SMOTEMD: UN ALGORITMO DE BALANCEO DE DATOS MIXTOS PARA BIG DATA EN R.

Víctor Morales Oñate; Luis Moreta; Bolívar Morales-Oñate

doi:10.47187/perf.v1i24.75

Authors

Víctor Morales Oñate Banco Solidario, Riesgos, Data Analytics, Quito, Ecuador.
Luis Moreta Escuela Politécnica Nacional, Faculty of Sciences, Department of Quantitative Economicsa, Quito, Ecuador.
Bolívar Morales-Oñate Escuela Superior Politécnica de Chimborazo, Faculty of Sciences, Chemical Engineering / Research Group Data Science Research Group, Riobamba, Ecuador.

DOI:

https://doi.org/10.47187/perf.v1i24.75

Keywords:

SMOTE, Classification, Unbalanced samples

Abstract

Analyzing samples with unbalanced data is a challenge for those who should use them in terms of modeling. A context in which this happens is when the response variable is binary and one of its classes is very small in proportion to the total. For the modeling of binary variables, probability models such as logit or probit are usually used. However, these models present problems when the sample is not balanced and it is desired to elaborate the confusion matrix from which the predictive power of the model is evaluated. One technique that allows the observed data to be balanced is the SMOTE algorithm, which works with numerical data exclusively. This work is an extension of SMOTE such that it allows the use of mixed data (numerical and categorical). By using mixed data, this proposal also makes it possible to overcome the barrier of 65536 observations that the R software has when working with categorical data distances. Through a simulation study, it is possible to verify the benefits of the proposed algorithm: SMOTEMD for mixed data.

Downloads

Download data is not yet available.

References

W. Wei, J. Li, L. Cao, Y. Ou y J. Chen, Effective detection of sophisticated online banking fraud on extremely imbalanced data, World Wide Web. 2013: 449–475.

P. Van Deusen y L. Irwin, A robust weighted EM algorithm for use-availability. Environ Ecol Stat. 2012: 205–217.

G. King y L. Zeng, Logistic Regression in Rare Events Data. The Socieciety For Political Methodo- logy, 9 (2) 2001. 137-163.

B. Kitchenham, A procedure for analyzing unbalanced datasets. IEEE transactions on Software Engineering, 24 (4) 1998: 278-301.

B. Baesens, V. Van Vlasselaer y W. Verbeke, Fraud analytics using descriptive, predictive, and social network techniques: a guide to data science for fraud detection, Estados Unidos: John Wiley & Sons, 2015.

G. M. Weiss, Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter. 2004: 7-19.

A. P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition, 30 (7) 1997: 1145-1159.

H. He, Y. Bai, E. A. Garcia y S. Li, ADASYN: Adaptive Synthetic Sampling Approach for Imbalan- ced Learning. InternationalJoint Conference on Neural Networks. 2008: 1322-1328.

C. Drummond y R. C. Holte, C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling, de Workshop on learning from imbalanced datasets II, Washington, DC: Citeseer., 2003.

N. V. Chawla, K. W. Bowyer, L. O. Hall y W. Philip Kegelmeye, SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artiﬁcial Intelligence Research. 2002: 321–357.

C. Manski y S. Lerman, The Estimation of Choice Probabilities from Choice Based Samples. Eco- nometrica 45, 1977.

H. He y E. A. Garcia, Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21 (9) 2009: 1263-1284.

C. Lynch, How do your data grow? Nature; 2008: 1-2.

J. Hurtado, N. Taweewitchakreeya, X. Kong y X. Zhu, A Classiﬁer Ensembling Approach For Imbalanced Social Link Prediction, de International Conference on Machine Learning and Applications, 2013.

H.-J. Yoon, Development of Contents on the Marine Meteorology Service by Meteorology and Climate Big Data. The Journal of the Korea institute of electronic communication sciences. 2016: 125-138.

A. S. Shirkhorshidi, S. Aghabozorgi, T. Y. Wah y T. Herawan, Big Data Clustering: A Review, de Murgante B. et al. (eds) Computational Science and Its Applications – ICCSA 2014. ICCSA 2014, Cham, 2014.

Y. Sahin y E. Duman, Detecting credit card fraud by ANN and logistic regression, de 2011 Interna- tional Symposium on Innovations in Intelligent Systems and Applications, Istanbul, IEEE, 201: 315-319.

B. Krawczyk, Learning from imbalanced data: open challenges and future directions. Prog Artif

Intell, 2016.

A. Fernández, V. López, M. Galar, M. J. del Jesus y F. Herrera, Analysing the classification of imba- lanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches. Knowledge-Based Systems. 2013: 97-110.

B. W. Silverman y M. C. Jones, (1951): An important contribution to nonparametric discriminant analysis and density estimation: Commentary on Fix and Hodges (1951). International Statistical Review/ Revue Internationale de Statistique, 57 (3) 1989: 233-238.