Algoritmos de balanceo probabilístico: desarrollo y evaluación de un sistema informático para clasificación binaria

Ireimis  Leguen de Varona; Julio  Madera Quintana; Alfredo  Simon-Cuevas; Marcos Antonio  Rodríguez Guerra

Authors

Ireimis Leguen de Varona Universidad de Camagüey "Ignacio Agramonte Loynaz"
Julio Madera Quintana Universidad de Camagüey "Ignacio Agramonte Loynaz"
Alfredo Simon-Cuevas Universidad Tecnológica de La Habana "José Antonio Echevarría", CUJAE
Marcos Antonio Rodríguez Guerra Universidad de Camagüey "Ignacio Agramonte Loynaz"

Keywords:

Machine Learning, Class Imbalance, Probabilistic Overlapping, Probabilistic Models, Software Systems

Abstract

Class imbalance is a common problem in supervised classification tasks, where the minority class is represented by a significantly smaller proportion of instances. This situation compromises the model's ability to correctly recognize the most relevant cases, such as in text classification, fraud detection, or medical diagnoses. While multiple oversampling algorithms address this challenge, the probabilistic solutions proposed in this work—SMOTE-COV-LW, SMOTE-RL, and SMOTE-EN—present an innovative approach based on probabilistic models. However, their use has been limited to programming environments, which restricts their application by non-specialist users. To facilitate access, a software system was designed to apply these probabilistic oversampling algorithms to binary classification datasets in a graphical manner, without the need for advanced technical knowledge. The system was developed using technologies such as Python and PyQt6 and includes tools for importing binary classification datasets, applying balancing techniques, and exporting results. To evaluate the performance of the proposed probabilistic algorithms, a comparison was made with widely used classical techniques (SMOTE, Borderline-SMOTE, SMOTE-RSB, and ADASYN). The evaluation was conducted on multiple imbalanced binary classification datasets, using C4.5, MLP, KNN, Random Forest, and SVM as classifiers, and AUC and F1-Score as performance metrics. The results showed that the probabilistic algorithms achieved similar or even superior results in several scenarios, demonstrating their competitiveness compared to traditional methods.

References

Alghamdi, M., Alghamdi, M., & Al-Barakati, A. (2022). A systematic review on oversampling techniques for imbalanced data classification. IEEE Access, *10*, 12458-12478.

Algamal, Z. Y., & Lee, M. H. (2019). A new adaptive elastic net for high-dimensional data. Journal of Statistical Computation and Simulation, *89*(9), 1689-1702.

Bodnar, T., Okhrin, Y., & Parolya, N. (2021). Optimal shrinkage covariance matrix estimation in high-dimensional problems. Journal of Multivariate Analysis, *185*, Article 104767. https://doi.org/10.1016/j.jmva.2021.104767

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, *16*, 321-357.

Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving prediction of the minority class in boosting. In Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (pp. 107-119). Springer.

Das, S., Datta, S., & Chaudhuri, B. B. (2023). Handling class overlap and imbalance to detect rare events. Pattern Recognition, *133*, Article 109018.

Fernández, A., García, S., del Jesus, M. J., & Herrera, F. (2008). A study of the behaviour of linguistic fuzzy rule-based classification systems in the framework of imbalanced data sets. Fuzzy Sets and Systems, 159(18), 2378-2398.

Fernández, A., García, S., Herrera, F., & Chawla, N. V. (2018). SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research, *61*, 863-905.

Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). Learning from imbalanced data sets. Springer.

Google Cloud. (s.f.). ¿Qué es la inteligencia artificial o IA? Recuperado el 22 de junio de 2025, de https://cloud.google.com/learn/what-is-artificial-intelligence.

Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Advances in intelligent computing (pp. 878-887). Springer.

Hastie, T., Tibshirani, R., & Wainwright, M. (2019). Statistical learning with sparsity: The Lasso and generalizations. CRC Press.

He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (pp. 1322-1328). IEEE.

Krawczyk, B. (2016). Learning from imbalanced data: Open challenges and future directions. Progress in Artificial Intelligence, *5*(4), 221-232.

Ledoit, O., & Wolf, M. (2022). The power (non) linear shrinking: A review and guide to covariance matrix estimation. Journal of Multivariate Analysis, *188*, Article 104844. https://doi.org/10.1016/j.jmva.2021.104844

López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, *250*, 113-141.

Lu, Y., & Yin, Y. (2022). Applying logistic lasso regression for the diagnosis of atypical Crohn's disease. Computers in Biology and Medicine, *141*, Article 105151. https://doi.org/10.1016/j.compbiomed.2021.105151

Madera, J. (2008). Algoritmos evolutivos con estimación de distribuciones basados en pruebas de independencia.

Murphy, K. P. (2022). Probabilistic machine learning: An introduction. MIT Press.

Ramentol, E., Caballero, Y., Bello, R., & Herrera, F. (2012a). SMOTE-RSB*: A hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using rough set theory. Knowledge and Information Systems, *33*(2), 245-265.

Richardson Ibáñez, J. (2017). Algoritmos evolutivos estimadores de distribución celulares para problemas de optimización continuos.

Soofi, A. A., & Awan, A. (2017). Classification techniques in machine learning: Applications and issues. Journal of Basic & Applied Sciences, *13*, 459-465.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), *58*(1), 267-288.

Zhang, C., Zhang, Y., & Zhang, Y. (2021). A survey on artificial intelligence for medical diagnosis. Artificial Intelligence Review, *54*(5), 3597-3645.

Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), *67*(2), 301-320.