Solving Simulated Imbalanced Body Performance Data using A-SUWO and Tomek Link Algorithm
DOI:
https://doi.org/10.37385/jaets.v6i2.4738Keywords:
A-SUWO, Body Perfomance Data, Data Sampling Techniques, Imbalanced, Tomek LinkAbstract
This research examines the impact of various sampling techniques on the performance of classification models in the context of imbalanced datasets, employing the body performance dataset as a case study. Many studies in this field analyze the effect of sampling techniques on a model performance, however they often begin with imbalance datasets, lacking a balanced baseline for comparison. This research addresses that gap by simulating an imbalanced dataset from an originally balanced dataset, obtaining a target reference point for evaluating the effectiveness of the sampling methods. The dataset is categorized into three versions: (1) a normal distribution, (2) a simulated imbalanced distribution, and (3) a synthesized dataset achieved through various data sampling techniques, including oversampling with Adaptive Semi-Unsupervised Weighted Oversampling (A-SUWO), undersampling with Tomek Link, and hybrid sampling combining both techniques. The primary objective of this research is to identify sampling techniques, when combined with model performance, closely match the performance observed in the original balanced dataset. Based on all experiments using Decision Tree, Random Forest, and K-Nearest Neighbors (KNN) as classifiers, both A-SUWO and Tomek Link led to overfitting due to discernible gap between the training and testing accuracy, averaging 0.21304. Despite overftting and general performance issue, the undersampling with Tomek Link obtained highest test accuracy (0.65023), outperforming A-SUWO (0.62883) and the hybrid approach (0.63568) on average. These findings highlight the importance of appropriate sampling techniques and optimizing model performance in imbalanced datasets.
Downloads
References
Aguiar, G., Krawczyk, B., & Cano, A. (2024). A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework. Machine Learning, 113(7), 4165–4243. https://doi.org/10.1007/S10994-023-06353-6/FIGURES/45
Ali, H., Mohd Salleh, M. N., Saedudin, R., Hussain, K., & Mushtaq, M. F. (2019). Imbalance class problems in data mining: a review. Indonesian Journal of Electrical Engineering and Computer Science, 14(3), 1552. https://doi.org/10.11591/ijeecs.v14.i3.pp1552-1563
Amin, A., Anwar, S., Adnan, A., Nawaz, M., Howard, N., Qadir, J., Hawalah, A., & Hussain, A. (2016). Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study. IEEE Access, 4, 7940–7957. https://doi.org/10.1109/ACCESS.2016.2619719
Apostolopoulos, I. D. (2020). Investigating the Synthetic Minority class Oversampling Technique (SMOTE) on an imbalanced cardiovascular disease (CVD) dataset. International Journal of Engineering Applied Sciences and Technology, 04(09), 431–434. https://doi.org/10.33564/ijeast.2020.v04i09.058
Bansal, M., Goyal, A., & Choudhary, A. (2022). A comparative analysis of K-Nearest Neighbor, Genetic, Support Vector Machine, Decision Tree, and Long Short Term Memory algorithms in machine learning. Decision Analytics Journal, 3, 100071. https://doi.org/10.1016/J.DAJOUR.2022.100071
Devi, D., Biswas, S. K., & Purkayastha, B. (2020). A Review on Solution to Class Imbalance Problem: Undersampling Approaches. 2020 International Conference on Computational Performance Evaluation (ComPE), 626–631. https://doi.org/10.1109/ComPE49325.2020.9200087
Ghosh, K., Bellinger, C., Corizzo, R., Branco, P., Krawczyk, B., & Japkowicz, N. (2024). The class imbalance problem in deep learning. Machine Learning, 113(7), 4845–4901. https://doi.org/10.1007/S10994-022-06268-8/FIGURES/27
Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on typical tabular data?. Advances in Neural Information Processing Systems, 35, 507–520.
Haibo He, & Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284. https://doi.org/10.1109/TKDE.2008.239
Hairani, H., Anggrawan, A., & Priyanto, D. (2023). Improvement Performance of the Random Forest Method on Unbalanced Diabetes Data Classification Using Smote-Tomek Link. JOIV?: International Journal on Informatics Visualization, 7(1), 258. https://doi.org/10.30630/joiv.7.1.1069
Halder, R. K., Uddin, M. N., Uddin, M. A., Aryal, S., & Khraisat, A. (2024). Enhancing K-nearest neighbor algorithm: a comprehensive review and performance analysis of modifications. Journal of Big Data 2024 11(1), 1–55. https://doi.org/10.1186/S40537-024-00973-Y
Hasib, K. Md., Iqbal, Md. S., Shah, F. M., Al Mahmud, J., Popel, M. H., Showrov, Md. I. H., Ahmed, S., & Rahman, O. (2020). A Survey of Methods for Managing the Classification and Solution of Data Imbalance Problem. Journal of Computer Science, 16(11), 1546–1557. https://doi.org/10.3844/jcssp.2020.1546.1557
Hosen, M. S., Islam, R., Naeem, Z., Folorunso, E. O., Chu, T. S., Mamun, M. A. Al, & Orunbon, N. O. (2024). Data-Driven Decision Making: Advanced Database Systems for Business Intelligence. Nanotechnology Perceptions, 20(S3), 687-704–687–704. https://doi.org/10.62441/NANO-NTP.V20IS3.51
Ionescu, S. A., & Diaconita, V. (2023). Transforming Financial Decision-Making: The Interplay of AI, Cloud Computing and Advanced Data Management Technologies. International Journal Of Computers Communications & Control, 18(6), 1–19. https://doi.org/10.15837/IJCCC.2023.6.5735
Ishaq, A., Sadiq, S., Umer, M., Ullah, S., Mirjalili, S., Rupapara, V., & Nappi, M. (2021). Improving the Prediction of Heart Failure Patients’ Survival Using SMOTE and Effective Data Mining Techniques. IEEE Access, 9, 39707–39716. https://doi.org/10.1109/ACCESS.2021.3064084
Johnson, J. M., & Khoshgoftaar, T. M. (2019). Survey on deep learning with class imbalance. Journal of Big Data, 6(1), 1–54. https://doi.org/10.1186/S40537-019-0192-5/TABLES/18
Kaur, H., Pannu, H. S., & Malhi, A. K. (2019). A Systematic Review on Imbalanced Data Challenges in Machine Learning. ACM Computing Surveys (CSUR), 52(4). https://doi.org/10.1145/3343440
Khushi, M., Shaukat, K., Alam, T. M., Hameed, I. A., Uddin, S., Luo, S., Yang, X., & Reyes, M. C. (2021). A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data. IEEE Access, 9, 109960–109975. https://doi.org/10.1109/ACCESS.2021.3102399
Kovács, G. (2019). Smote-variants: A python implementation of 85 minority oversampling techniques. Neurocomputing, 366, 352–354. https://doi.org/10.1016/j.neucom.2019.06.100
Lemaître, G., Nogueira, F., & Aridas, C. K. (2016). Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research, 18, 1–5. https://arxiv.org/abs/1609.06570v1
Liu, C., Wu, J., Mirador, L., Song, Y., & Hou, W. (2018). Classifying DNA methylation imbalance data in cancer risk prediction using SMOTE and Tomek link methods. Communications in Computer and Information Science, 902, 1–9. https://doi.org/10.1007/978-981-13-2206-8_1/COVER
López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141. https://doi.org/10.1016/j.ins.2013.07.007
Madeeh, O. D., & Abdullah, H. S. (2021). An Efficient Prediction Model based on Machine Learning Techniques for Prediction of the Stock Market. Journal of Physics: Conference Series, 1804(1). https://doi.org/10.1088/1742-6596/1804/1/012008
Mohammed, R., Rawashdeh, J., & Abdullah, M. (2020). Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results. 2020 11th International Conference on Information and Communication Systems, ICICS 2020, 243–248. https://doi.org/10.1109/ICICS49469.2020.239556
Mqadi, N., Naicker, N., & Adeliyi, T. (2021). A SMOTe based Oversampling Data-Point Approach to Solving the Credit Card Data Imbalance Problem in Financial Fraud Detection. International Journal of Computing and Digital Systems, 10(1), 277–286. https://doi.org/10.12785/ijcds/100128
Nekooeimehr, I., & Lai-Yuen, S. K. (2016). Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets. Expert Systems with Applications, 46, 405–416. https://doi.org/10.1016/J.ESWA.2015.10.031
Piyadasa, T. D., & Gunawardana, K. (2023). A Review on Oversampling Techniques for Solving the Data Imbalance Problem in Classification. International Journal on Advances in ICT for Emerging Regions (ICTer), 16(1), 22–31. https://doi.org/10.4038/icter.v16i1.7260
Priyanka, & Kumar, D. (2020). Decision tree classifier: A detailed survey. International Journal of Information and Decision Sciences, 12(3), 246–269. https://doi.org/10.1504/IJIDS.2020.108141
Rajaguru, H., & Sannasi Chakravarthy, S. R. (2019). Analysis of Decision Tree and K-Nearest Neighbor Algorithm in the Classification of Breast Cancer. Asian Pacific Journal of Cancer Prevention?: APJCP, 20(12), 3777. https://doi.org/10.31557/APJCP.2019.20.12.3777
Sawangarreerak, S., & Thanathamathee, P. (2020). Random forest with sampling techniques for handling imbalanced prediction of university student depression. Information, 11(11), 519. https://doi.org/10.3390/INFO11110519
Shamsudin, H., Yusof, U. K., Jayalakshmi, A., & Akmal Khalid, M. N. (2020). Combining oversampling and undersampling techniques for imbalanced classification: A comparative study using credit card fraudulent transaction dataset. IEEE International Conference on Control and Automation, ICCA, 2020-October, 803–808. https://doi.org/10.1109/ICCA51439.2020.9264517
Speiser, J. L., Miller, M. E., Tooze, J., & Ip, E. (2019). A comparison of random forest variable selection methods for classification prediction modeling. Expert Systems with Applications, 134, 93–101. https://doi.org/10.1016/J.ESWA.2019.05.028
Sun, L., Zhou, Y., Wang, Y., Zhu, C., & Zhang, W. (2020). The Effective Methods for Intrusion Detection With Limited Network Attack Data: Multi-Task Learning and Oversampling. IEEE Access, 8, 185384–185398. https://doi.org/10.1109/ACCESS.2020.3029100
Swana, E. F., Doorsamy, W., & Bokoro, P. (2022). Tomek Link and SMOTE Approaches for Machine Fault Classification with an Imbalanced Dataset. Sensors, 22(9), 3246. https://doi.org/10.3390/s22093246
Thabtah, F., Hammoud, S., Kamalov, F., & Gonsalves, A. (2020a). Data imbalance in classification: Experimental evaluation. Information Sciences, 513, 429–441. https://doi.org/10.1016/J.INS.2019.11.004
Uddin, S., Haque, I., Lu, H., Moni, M. A., & Gide, E. (2022a). Comparative performance analysis of K-nearest neighbour (KNN) algorithm and its different variants for disease prediction. Scientific Reports, 12(1), 1–11. https://doi.org/10.1038/s41598-022-10358-x
Wang, S., & Yao, X. (2012). Multiclass imbalance problems: Analysis and potential solutions. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 42(4), 1119–1130. https://doi.org/10.1109/TSMCB.2012.2187280