Solving Simulated Imbalanced Body Performance Data using A-SUWO and Tomek Link Algorithm

Febryan Grady; Joel Rizky Wahidiyat; Abba Suganda Girsang

doi:10.37385/jaets.v6i2.4738

Authors

Febryan Grady Bina Nusantara University
Joel Rizky Wahidiyat Bina Nusantara University
Abba Suganda Girsang Bina Nusantara University

DOI:

https://doi.org/10.37385/jaets.v6i2.4738

Keywords:

A-SUWO, Body Perfomance Data, Data Sampling Techniques, Imbalanced, Tomek Link

Abstract

This research examines the impact of various sampling techniques on the performance of classification models in the context of imbalanced datasets, employing the body performance dataset as a case study. Many studies in this field analyze the effect of sampling techniques on a model performance, however they often begin with imbalance datasets, lacking a balanced baseline for comparison. This research addresses that gap by simulating an imbalanced dataset from an originally balanced dataset, obtaining a target reference point for evaluating the effectiveness of the sampling methods. The dataset is categorized into three versions: (1) a normal distribution, (2) a simulated imbalanced distribution, and (3) a synthesized dataset achieved through various data sampling techniques, including oversampling with Adaptive Semi-Unsupervised Weighted Oversampling (A-SUWO), undersampling with Tomek Link, and hybrid sampling combining both techniques. The primary objective of this research is to identify sampling techniques, when combined with model performance, closely match the performance observed in the original balanced dataset. Based on all experiments using Decision Tree, Random Forest, and K-Nearest Neighbors (KNN) as classifiers, both A-SUWO and Tomek Link led to overfitting due to discernible gap between the training and testing accuracy, averaging 0.21304. Despite overftting and general performance issue, the undersampling with Tomek Link obtained highest test accuracy (0.65023), outperforming A-SUWO (0.62883) and the hybrid approach (0.63568) on average. These findings highlight the importance of appropriate sampling techniques and optimizing model performance in imbalanced datasets.

Downloads

Download data is not yet available.

References

Aguiar, G., Krawczyk, B., & Cano, A. (2024). A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework. Machine Learning, 113(7), 4165–4243. https://doi.org/10.1007/S10994-023-06353-6/FIGURES/45

Ali, H., Mohd Salleh, M. N., Saedudin, R., Hussain, K., & Mushtaq, M. F. (2019). Imbalance class problems in data mining: a review. Indonesian Journal of Electrical Engineering and Computer Science, 14(3), 1552. https://doi.org/10.11591/ijeecs.v14.i3.pp1552-1563

Amin, A., Anwar, S., Adnan, A., Nawaz, M., Howard, N., Qadir, J., Hawalah, A., & Hussain, A. (2016). Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study. IEEE Access, 4, 7940–7957. https://doi.org/10.1109/ACCESS.2016.2619719

Apostolopoulos, I. D. (2020). Investigating the Synthetic Minority class Oversampling Technique (SMOTE) on an imbalanced cardiovascular disease (CVD) dataset. International Journal of Engineering Applied Sciences and Technology, 04(09), 431–434. https://doi.org/10.33564/ijeast.2020.v04i09.058

Bansal, M., Goyal, A., & Choudhary, A. (2022). A comparative analysis of K-Nearest Neighbor, Genetic, Support Vector Machine, Decision Tree, and Long Short Term Memory algorithms in machine learning. Decision Analytics Journal, 3, 100071. https://doi.org/10.1016/J.DAJOUR.2022.100071

Devi, D., Biswas, S. K., & Purkayastha, B. (2020). A Review on Solution to Class Imbalance Problem: Undersampling Approaches. 2020 International Conference on Computational Performance Evaluation (ComPE), 626–631. https://doi.org/10.1109/ComPE49325.2020.9200087

Ghosh, K., Bellinger, C., Corizzo, R., Branco, P., Krawczyk, B., & Japkowicz, N. (2024). The class imbalance problem in deep learning. Machine Learning, 113(7), 4845–4901. https://doi.org/10.1007/S10994-022-06268-8/FIGURES/27

Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on typical tabular data?. Advances in Neural Information Processing Systems, 35, 507–520.

Haibo He, & Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284. https://doi.org/10.1109/TKDE.2008.239

Hairani, H., Anggrawan, A., & Priyanto, D. (2023). Improvement Performance of the Random Forest Method on Unbalanced Diabetes Data Classification Using Smote-Tomek Link. JOIV?: International Journal on Informatics Visualization, 7(1), 258. https://doi.org/10.30630/joiv.7.1.1069

Halder, R. K., Uddin, M. N., Uddin, M. A., Aryal, S., & Khraisat, A. (2024). Enhancing K-nearest neighbor algorithm: a comprehensive review and performance analysis of modifications. Journal of Big Data 2024 11(1), 1–55. https://doi.org/10.1186/S40537-024-00973-Y

Hasib, K. Md., Iqbal, Md. S., Shah, F. M., Al Mahmud, J., Popel, M. H., Showrov, Md. I. H., Ahmed, S., & Rahman, O. (2020). A Survey of Methods for Managing the Classification and Solution of Data Imbalance Problem. Journal of Computer Science, 16(11), 1546–1557. https://doi.org/10.3844/jcssp.2020.1546.1557

Hosen, M. S., Islam, R., Naeem, Z., Folorunso, E. O., Chu, T. S., Mamun, M. A. Al, & Orunbon, N. O. (2024). Data-Driven Decision Making: Advanced Database Systems for Business Intelligence. Nanotechnology Perceptions, 20(S3), 687-704–687–704. https://doi.org/10.62441/NANO-NTP.V20IS3.51

Ionescu, S. A., & Diaconita, V. (2023). Transforming Financial Decision-Making: The Interplay of AI, Cloud Computing and Advanced Data Management Technologies. International Journal Of Computers Communications & Control, 18(6), 1–19. https://doi.org/10.15837/IJCCC.2023.6.5735

Ishaq, A., Sadiq, S., Umer, M., Ullah, S., Mirjalili, S., Rupapara, V., & Nappi, M. (2021). Improving the Prediction of Heart Failure Patients’ Survival Using SMOTE and Effective Data Mining Techniques. IEEE Access, 9, 39707–39716. https://doi.org/10.1109/ACCESS.2021.3064084

Johnson, J. M., & Khoshgoftaar, T. M. (2019). Survey on deep learning with class imbalance. Journal of Big Data, 6(1), 1–54. https://doi.org/10.1186/S40537-019-0192-5/TABLES/18

Kaur, H., Pannu, H. S., & Malhi, A. K. (2019). A Systematic Review on Imbalanced Data Challenges in Machine Learning. ACM Computing Surveys (CSUR), 52(4). https://doi.org/10.1145/3343440

Khushi, M., Shaukat, K., Alam, T. M., Hameed, I. A., Uddin, S., Luo, S., Yang, X., & Reyes, M. C. (2021). A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data. IEEE Access, 9, 109960–109975. https://doi.org/10.1109/ACCESS.2021.3102399

Kovács, G. (2019). Smote-variants: A python implementation of 85 minority oversampling techniques. Neurocomputing, 366, 352–354. https://doi.org/10.1016/j.neucom.2019.06.100

Lemaître, G., Nogueira, F., & Aridas, C. K. (2016). Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research, 18, 1–5. https://arxiv.org/abs/1609.06570v1

Liu, C., Wu, J., Mirador, L., Song, Y., & Hou, W. (2018). Classifying DNA methylation imbalance data in cancer risk prediction using SMOTE and Tomek link methods. Communications in Computer and Information Science, 902, 1–9. https://doi.org/10.1007/978-981-13-2206-8_1/COVER

López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141. https://doi.org/10.1016/j.ins.2013.07.007

Madeeh, O. D., & Abdullah, H. S. (2021). An Efficient Prediction Model based on Machine Learning Techniques for Prediction of the Stock Market. Journal of Physics: Conference Series, 1804(1). https://doi.org/10.1088/1742-6596/1804/1/012008

Mohammed, R., Rawashdeh, J., & Abdullah, M. (2020). Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results. 2020 11th International Conference on Information and Communication Systems, ICICS 2020, 243–248. https://doi.org/10.1109/ICICS49469.2020.239556

Mqadi, N., Naicker, N., & Adeliyi, T. (2021). A SMOTe based Oversampling Data-Point Approach to Solving the Credit Card Data Imbalance Problem in Financial Fraud Detection. International Journal of Computing and Digital Systems, 10(1), 277–286. https://doi.org/10.12785/ijcds/100128

Nekooeimehr, I., & Lai-Yuen, S. K. (2016). Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets. Expert Systems with Applications, 46, 405–416. https://doi.org/10.1016/J.ESWA.2015.10.031

Piyadasa, T. D., & Gunawardana, K. (2023). A Review on Oversampling Techniques for Solving the Data Imbalance Problem in Classification. International Journal on Advances in ICT for Emerging Regions (ICTer), 16(1), 22–31. https://doi.org/10.4038/icter.v16i1.7260

Priyanka, & Kumar, D. (2020). Decision tree classifier: A detailed survey. International Journal of Information and Decision Sciences, 12(3), 246–269. https://doi.org/10.1504/IJIDS.2020.108141

Rajaguru, H., & Sannasi Chakravarthy, S. R. (2019). Analysis of Decision Tree and K-Nearest Neighbor Algorithm in the Classification of Breast Cancer. Asian Pacific Journal of Cancer Prevention?: APJCP, 20(12), 3777. https://doi.org/10.31557/APJCP.2019.20.12.3777

Sawangarreerak, S., & Thanathamathee, P. (2020). Random forest with sampling techniques for handling imbalanced prediction of university student depression. Information, 11(11), 519. https://doi.org/10.3390/INFO11110519

Shamsudin, H., Yusof, U. K., Jayalakshmi, A., & Akmal Khalid, M. N. (2020). Combining oversampling and undersampling techniques for imbalanced classification: A comparative study using credit card fraudulent transaction dataset. IEEE International Conference on Control and Automation, ICCA, 2020-October, 803–808. https://doi.org/10.1109/ICCA51439.2020.9264517

Speiser, J. L., Miller, M. E., Tooze, J., & Ip, E. (2019). A comparison of random forest variable selection methods for classification prediction modeling. Expert Systems with Applications, 134, 93–101. https://doi.org/10.1016/J.ESWA.2019.05.028

Sun, L., Zhou, Y., Wang, Y., Zhu, C., & Zhang, W. (2020). The Effective Methods for Intrusion Detection With Limited Network Attack Data: Multi-Task Learning and Oversampling. IEEE Access, 8, 185384–185398. https://doi.org/10.1109/ACCESS.2020.3029100

Swana, E. F., Doorsamy, W., & Bokoro, P. (2022). Tomek Link and SMOTE Approaches for Machine Fault Classification with an Imbalanced Dataset. Sensors, 22(9), 3246. https://doi.org/10.3390/s22093246

Thabtah, F., Hammoud, S., Kamalov, F., & Gonsalves, A. (2020a). Data imbalance in classification: Experimental evaluation. Information Sciences, 513, 429–441. https://doi.org/10.1016/J.INS.2019.11.004

Uddin, S., Haque, I., Lu, H., Moni, M. A., & Gide, E. (2022a). Comparative performance analysis of K-nearest neighbour (KNN) algorithm and its different variants for disease prediction. Scientific Reports, 12(1), 1–11. https://doi.org/10.1038/s41598-022-10358-x

Wang, S., & Yao, X. (2012). Multiclass imbalance problems: Analysis and potential solutions. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 42(4), 1119–1130. https://doi.org/10.1109/TSMCB.2012.2187280

Solving Simulated Imbalanced Body Performance Data using A-SUWO and Tomek Link Algorithm

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

Current Issue

Information

Developed By