Comparison of Dataset Proportions in SVM and Random Forest Algorithms in Detecting Student Dependence on AI in Learning

Sardar Faroq Ahmd Khan; Pramudya Asoka Syukur; Andi Baso Kaswar; Marwan Edy Ramdhany

Authors

Sardar Faroq Ahmd Khan Universitas Negeri Makassar
Pramudya Asoka Syukur Universitas Negeri Makassar
Andi Baso Kaswar Okayama University
Marwan Edy Ramdhany Universitas Gadjah Mada

Keywords:

AI Dependency, Machine Learning, Random Forest, Support Vector Machine, Proportion Dataset

Abstract

Purpose – The rapid integration of artificial intelligence (AI) in education has raised concerns about excessive student dependence, potentially undermining critical thinking and learning autonomy. This study aims to identify the most effective machine learning algorithm for detecting AI dependency in learning activities and to examine the impact of training–testing data proportion on predictive performance.
Methods - This study employs the CRISP-DM framework and applies two supervised classification algorithms, Random Forest and Support Vector Machine (SVM), to a synthetic dataset of 10,000 AI-assisted learning sessions. The target variable, perceived AI assistance level, was discretised into three categories (low, medium, and high). Model performance was evaluated under four dataset split scenarios (60:40, 70:30, 80:20, and 90:10) using accuracy, AUC, precision, recall, and F1-score.
Findings - The results show that Random Forest consistently outperforms SVM across all dataset proportions and evaluation metrics. The highest performance was achieved by Random Forest with a 60:40 split, yielding an accuracy of 67.6% and an AUC of 80.8%. Although SVM demonstrated stable performance, it required larger training datasets and remained inferior to Random Forest.
Research limitations - The use of synthetic data and limited behavioural features restricts the generalisability of the findings. The moderate accuracy indicates that AI dependency is a complex construct not fully captured by the current model.
Originality - This study provides empirical evidence on the combined influence of algorithm selection and dataset proportion in detecting AI dependency, offering practical guidance for developing early-warning systems to support responsible AI use in education.

References

Abdel Wahed, S., & Abdel Wahed, M. (2025). AI-Driven Digital Well-being: Developing Machine Learning Model to Predict and Mitigate Internet Addiction. LatIA, 3, 73. https://doi.org/10.62486/latia202573

Ahmad, S. F., Han, H., Alam, M. M., Rehmat, Mohd. K., Irshad, M., Arraño-Muñoz, M., & Ariza-Montes, A. (2023). Impact of artificial intelligence on human loss in decision making, laziness and safety in education. Humanities and Social Sciences Communications, 10(1), 311. https://doi.org/10.1057/s41599-023-01787-8

Al-Areef, M. H., & Saputra, K. (2023). Analisis Sentimen Pengguna Twitter Mengenai Calon Presiden Indonesia Tahun 2024. Jurnal SAINTIKOM, 22(2), 270. https://doi.org/10.53513/jis.v22i2.8680

Baria, H. G. (2025). Influence of Generative AI on Problem Solving Skills among Students. International Journal of Scientific Research in Engineering and Management, 9(3), 1–9. https://doi.org/10.55041/IJSREM42014

Bin Rofi, I., Eshita, M. M., Ahmed, Md. S., & Noor, J. (2024). Identifying Influences: A Machine Learning and Explainable AI Approach to Analyzing Social Media Addiction Resulting from Academic Frustration. Proceedings of the 11th International Conference on Networking, Systems, and Security, 128–136. https://doi.org/10.1145/3704522.3704529

Boros, K., & Kmetty, Z. (2024). Identifying missing data handling methods with text mining. International Journal of Data Science and Analytics. https://doi.org/10.1007/s41060-024-00582-1

Çela, E., Fonkam, M. M., & Potluri, R. M. (2024). Risks of AI-Assisted Learning on Student Critical Thinking: A Case Study of Albania. International Journal of Risk and Contingency Management, 12(1), 1–19. https://doi.org/10.4018/ijrcm.350185

Chai, C., Liu, J., Tang, N., Li, G., & Luo, Y. (2022). Selective data acquisition in the wild for model charging. Proc. VLDB Endow., 15(7), 1466–1478. https://doi.org/10.14778/3523210.3523223

Chegg.org. (2025). Chegg Global Student Survey 2025. https://www.chegg.org/global-student-survey-2025

Dučić, N., Jovičić, A., Manasijević, S., Radiša, R., Ćojbašić, Ž., & Savković, B. (2020). Application of Machine Learning in the Control of Metal Melting Production Process. Applied Sciences, 10(17), 6048. https://doi.org/10.3390/app10176048

Firmansyach, W. A., Hayati, U., & Wijaya, Y. A. (2023). Analisa Terjadinya Overfitting dan Underfitting. JATI, 7(1), 262–269. https://doi.org/10.36040/jati.v7i1.6329

Ghosh, D., & Cabrera, J. (2022). Enriched Random Forest for High Dimensional Genomic Data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 19(5), 2817–2828. https://doi.org/10.1109/TCBB.2021.3089417

Gu, J., & Congalton, R. G. (2025). Assessing the Impact of Mixed Pixel Proportion Training Data on SVM-Based Remote Sensing Classification: A Simulated Study. Remote Sensing, 17(7), 1274. https://doi.org/10.3390/rs17071274

Hancock, J. T., & Khoshgoftaar, T. M. (2020). Survey on categorical data for neural networks. Journal of Big Data, 7(1), 28. https://doi.org/10.1186/s40537-020-00305-w

Hatamian, A., Levine, L., Oskouie, H. E., & Sarrafzadeh, M. (2025, February). Exploring the Impact of Dataset Statistical Effect Size on Model Performance and Data Sample Size Sufficiency. arXiv. https://doi.org/10.48550/arXiv.2501.02673

He, Y., Zhang, W., Ma, Y., Li, J., & Ma, B. (2022). The Classification of Rice Blast Resistant Seed. Molecules, 27(13), 4091. https://doi.org/10.3390/molecules27134091

Holmes, M., & Theodorakopoulos, G. (2020). Towards using differentially private synthetic data for machine learning in collaborative data science projects. Proceedings of the 15th International Conference on Availability, Reliability and Security. https://doi.org/10.1145/3407023.3407024

Indriyani, D., & Solihati, K. D. (2021). An Overview of Indonesian’s Challenging Future: Management of Artificial Intelligence in Education. Advances in Social Science, Education and Humanities Research. https://doi.org/10.2991/assehr.k.210629.053

Joseph, V. R. (2022). Optimal ratio for data splitting. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(4), 531–538. https://doi.org/10.1002/sam.11583

Klingbeil, A., Grützner, C., & Schreck, P. (2024). Trust and reliance on AI — An experimental study on the extent and costs of overreliance on AI. Computers in Human Behavior, 160, 108352. https://doi.org/10.1016/j.chb.2024.108352

Malnad College of Engineering,Hassan, & Balgotra, A. (2025). Data Duplication Detection and Removal System Using Machine Learning. INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT, 09(05), 1–9. https://doi.org/10.55041/IJSREM46920

Octaberlina, L. R., Muslimin, A. I., Chamidah, D., Surur, M., & Mustikawan, A. (2024). Exploring the impact of AI threats on originality and critical thinking in academic writing. Edelweiss Applied Science and Technology, 8(6), 8805–8814. https://doi.org/10.55214/25768484.v8i6.3878

Rahardyan, T. M., Susilo, C. H., Iswara, A. M. N., & Hartono, M. L. (2024). ChatGPT: The Future Research Assistant or an Academic Fraud? [A Case Study on a State University Located in Jakarta, Indonesia]. Asia Pacific Fraud Journal, 9(2), 275–293. https://doi.org/10.21532/apfjournal.v9i2.347

Rankin, D., Black, M., Bond, R., Wallace, J., Mulvenna, M., & Epelde, G. (2020). Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing. JMIR Medical Informatics, 8(7), e18910. https://doi.org/10.2196/18910

Sabah, A. S., Abu-Naser, S. S., Helles, Y. E., Abdallatif, R. F., Taha, A., Massa, N. M., & Hamouda, A. A. (2023). Comparative Analysis of the Performance of Popular Sorting Algorithms on Datasets of Different Sizes and Characteristics. 7(6).

Salman, H. A., Kalakech, A., & Steiti, A. (2024). Random Forest Algorithm Overview. Babylonian Journal of Machine Learning, 2024, 69–79. https://doi.org/10.58496/BJML/2024/007

Thabtah, F., Hammoud, S., Kamalov, F., & Gonsalves, A. (2020). Data imbalance in classification: Experimental evaluation. Information Sciences, 513, 429–441. https://doi.org/10.1016/j.ins.2019.11.004

Thölke, P. & others. (2023). Class imbalance should not throw you off balance. NeuroImage, 277, 120253. https://doi.org/10.1016/j.neuroimage.2023.120253

Uppal, K., & Hajian, S. (2024). Students’ Perceptions of ChatGPT in Higher Education: A Study of Academic Enhancement, Procrastination, and Ethical Concerns. European Journal of Educational Research, 14(1), 199–211. https://doi.org/10.12973/eu-jer.14.1.199

Vasconcelos, H., Jörke, M., Grunde-McLaughlin, M., Gerstenberg, T., Bernstein, M., & Krishna, R. (2023, January). Explanations Can Reduce Overreliance on AI Systems During Decision-Making. arXiv. https://doi.org/10.48550/arXiv.2212.06823

Villuendas-Rey, Y., Tusell-Rey, C. C., & Camacho-Nieto, O. (2024). Simultaneous Instance and Attribute Selection for Noise Filtering. Applied Sciences, 14(18), 8459. https://doi.org/10.3390/app14188459

Vimala, S., & Sheela, D. G. A. (2025). Predictive Modeling of the Impact of Smartphone Addiction on Students’ Academic Performance Using Machine Learning. 1, 1–9.

Zhai, C., Wibowo, S., & Li, L. D. (2024). The effects of over-reliance on AI dialogue systems on students’ cognitive abilities: A systematic review. Smart Learning Environments, 11(1), 28. https://doi.org/10.1186/s40561-024-00316-7

Zhu, T. (2020). Analysis on the Applicability of the Random Forest. Journal of Physics: Conference Series, 1607(1), 012123. https://doi.org/10.1088/1742-6596/1607/1/012123