Ensemble Classification with Lazy Predict on Three Diabetes Datasets: A Comparative Study with Resampling Techniques

Afshan Hashmi, Md Tabrez Nafis, Sameena Naaz, Imran Hussain

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Millions of people throughout the world suffer from the chronic illness diabetes mellitus. Effective diabetes care and complication avoidance depend on early diabetes prediction and diagnosis. Using the three distinct datasets—the PIMA India dataset, the NHANES dataset, and Mendeley’s diabetes dataset—we give a thorough analysis of diabetic prediction in this study. Lazy Predict enables us to efficiently evaluate a wide range of classifiers on each dataset, providing valuable insights into model performance. The top-performing model on each dataset is selected as the best individual model. Furthermore, ensembles are created by combining the predictions of the top ten models without any resampling and with resampling techniques. Random forest achieved the highest accuracy of 79% on the PIMA dataset, XGB achieved the highest accuracy of 99% on Mendeley’s dataset, and the dummy classifier attained the highest accuracy of 88%. for the NHANES dataset. However, the ensembles without oversampling consistently outperformed their counterparts with resampling. Surprisingly, the ensemble without oversampling exhibited the highest accuracy overall, followed by the ensemble with oversampling, challenging the common notion that resampling always leads to improved performance.
Original languageEnglish
Title of host publicationThird International Conference on Computing and Communication Networks. ICCCN 2023
PublisherSpringer Nature
Publication statusPublished - 21 Jul 2024

Cite this