Comparison of Dataset Proportions in SVM and Random Forest Algorithms in Detecting Student Dependence on AI in Learning
Keywords:
AI Dependency, Machine Learning, Random Forest, Support Vector Machine, Proportion DatasetAbstract
Purpose – The rapid integration of artificial intelligence (AI) in education has raised concerns about excessive student dependence, potentially undermining critical thinking and learning autonomy. This study aims to identify the most effective machine learning algorithm for detecting AI dependency in learning activities and to examine the impact of training–testing data proportion on predictive performance.
Methods - This study employs the CRISP-DM framework and applies two supervised classification algorithms, Random Forest and Support Vector Machine (SVM), to a synthetic dataset of 10,000 AI-assisted learning sessions. The target variable, perceived AI assistance level, was discretised into three categories (low, medium, and high). Model performance was evaluated under four dataset split scenarios (60:40, 70:30, 80:20, and 90:10) using accuracy, AUC, precision, recall, and F1-score.
Findings - The results show that Random Forest consistently outperforms SVM across all dataset proportions and evaluation metrics. The highest performance was achieved by Random Forest with a 60:40 split, yielding an accuracy of 67.6% and an AUC of 80.8%. Although SVM demonstrated stable performance, it required larger training datasets and remained inferior to Random Forest.
Research limitations - The use of synthetic data and limited behavioural features restricts the generalisability of the findings. The moderate accuracy indicates that AI dependency is a complex construct not fully captured by the current model.
Originality - This study provides empirical evidence on the combined influence of algorithm selection and dataset proportion in detecting AI dependency, offering practical guidance for developing early-warning systems to support responsible AI use in education.
References
Abdel Wahed, S., & Abdel Wahed, M. (2025). AI-Driven Digital Well-being: Developing Machine Learning Model to Predict and Mitigate Internet Addiction. LatIA, 3, 73. https://doi.org/10.62486/latia202573
Ahmad, S. F., Han, H., Alam, M. M., Rehmat, Mohd. K., Irshad, M., Arraño-Muñoz, M., & Ariza-Montes, A. (2023). Impact of artificial intelligence on human loss in decision making, laziness and safety in education. Humanities and Social Sciences Communications, 10(1), 311. https://doi.org/10.1057/s41599-023-01787-8
Al-Areef, M. H., & Saputra, K. (2023). Analisis Sentimen Pengguna Twitter Mengenai Calon Presiden Indonesia Tahun 2024. Jurnal SAINTIKOM, 22(2), 270. https://doi.org/10.53513/jis.v22i2.8680
Baria, H. G. (2025). Influence of Generative AI on Problem Solving Skills among Students. International Journal of Scientific Research in Engineering and Management, 9(3), 1–9. https://doi.org/10.55041/IJSREM42014
Bin Rofi, I., Eshita, M. M., Ahmed, Md. S., & Noor, J. (2024). Identifying Influences: A Machine Learning and Explainable AI Approach to Analyzing Social Media Addiction Resulting from Academic Frustration. Proceedings of the 11th International Conference on Networking, Systems, and Security, 128–136. https://doi.org/10.1145/3704522.3704529
Boros, K., & Kmetty, Z. (2024). Identifying missing data handling methods with text mining. International Journal of Data Science and Analytics. https://doi.org/10.1007/s41060-024-00582-1
Çela, E., Fonkam, M. M., & Potluri, R. M. (2024). Risks of AI-Assisted Learning on Student Critical Thinking: A Case Study of Albania. International Journal of Risk and Contingency Management, 12(1), 1–19. https://doi.org/10.4018/ijrcm.350185
Chai, C., Liu, J., Tang, N., Li, G., & Luo, Y. (2022). Selective data acquisition in the wild for model charging. Proc. VLDB Endow., 15(7), 1466–1478. https://doi.org/10.14778/3523210.3523223
Chegg.org. (2025). Chegg Global Student Survey 2025. https://www.chegg.org/global-student-survey-2025
Dučić, N., Jovičić, A., Manasijević, S., Radiša, R., Ćojbašić, Ž., & Savković, B. (2020). Application of Machine Learning in the Control of Metal Melting Production Process. Applied Sciences, 10(17), 6048. https://doi.org/10.3390/app10176048
Firmansyach, W. A., Hayati, U., & Wijaya, Y. A. (2023). Analisa Terjadinya Overfitting dan Underfitting. JATI, 7(1), 262–269. https://doi.org/10.36040/jati.v7i1.6329
Ghosh, D., & Cabrera, J. (2022). Enriched Random Forest for High Dimensional Genomic Data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 19(5), 2817–2828. https://doi.org/10.1109/TCBB.2021.3089417
Gu, J., & Congalton, R. G. (2025). Assessing the Impact of Mixed Pixel Proportion Training Data on SVM-Based Remote Sensing Classification: A Simulated Study. Remote Sensing, 17(7), 1274. https://doi.org/10.3390/rs17071274
Hancock, J. T., & Khoshgoftaar, T. M. (2020). Survey on categorical data for neural networks. Journal of Big Data, 7(1), 28. https://doi.org/10.1186/s40537-020-00305-w
Hatamian, A., Levine, L., Oskouie, H. E., & Sarrafzadeh, M. (2025, February). Exploring the Impact of Dataset Statistical Effect Size on Model Performance and Data Sample Size Sufficiency. arXiv. https://doi.org/10.48550/arXiv.2501.02673
He, Y., Zhang, W., Ma, Y., Li, J., & Ma, B. (2022). The Classification of Rice Blast Resistant Seed. Molecules, 27(13), 4091. https://doi.org/10.3390/molecules27134091
Holmes, M., & Theodorakopoulos, G. (2020). Towards using differentially private synthetic data for machine learning in collaborative data science projects. Proceedings of the 15th International Conference on Availability, Reliability and Security. https://doi.org/10.1145/3407023.3407024
Indriyani, D., & Solihati, K. D. (2021). An Overview of Indonesian’s Challenging Future: Management of Artificial Intelligence in Education. Advances in Social Science, Education and Humanities Research. https://doi.org/10.2991/assehr.k.210629.053
Joseph, V. R. (2022). Optimal ratio for data splitting. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(4), 531–538. https://doi.org/10.1002/sam.11583
Klingbeil, A., Grützner, C., & Schreck, P. (2024). Trust and reliance on AI — An experimental study on the extent and costs of overreliance on AI. Computers in Human Behavior, 160, 108352. https://doi.org/10.1016/j.chb.2024.108352
Malnad College of Engineering,Hassan, & Balgotra, A. (2025). Data Duplication Detection and Removal System Using Machine Learning. INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT, 09(05), 1–9. https://doi.org/10.55041/IJSREM46920
Octaberlina, L. R., Muslimin, A. I., Chamidah, D., Surur, M., & Mustikawan, A. (2024). Exploring the impact of AI threats on originality and critical thinking in academic writing. Edelweiss Applied Science and Technology, 8(6), 8805–8814. https://doi.org/10.55214/25768484.v8i6.3878
Rahardyan, T. M., Susilo, C. H., Iswara, A. M. N., & Hartono, M. L. (2024). ChatGPT: The Future Research Assistant or an Academic Fraud? [A Case Study on a State University Located in Jakarta, Indonesia]. Asia Pacific Fraud Journal, 9(2), 275–293. https://doi.org/10.21532/apfjournal.v9i2.347
Rankin, D., Black, M., Bond, R., Wallace, J., Mulvenna, M., & Epelde, G. (2020). Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing. JMIR Medical Informatics, 8(7), e18910. https://doi.org/10.2196/18910
Sabah, A. S., Abu-Naser, S. S., Helles, Y. E., Abdallatif, R. F., Taha, A., Massa, N. M., & Hamouda, A. A. (2023). Comparative Analysis of the Performance of Popular Sorting Algorithms on Datasets of Different Sizes and Characteristics. 7(6).
Salman, H. A., Kalakech, A., & Steiti, A. (2024). Random Forest Algorithm Overview. Babylonian Journal of Machine Learning, 2024, 69–79. https://doi.org/10.58496/BJML/2024/007
Thabtah, F., Hammoud, S., Kamalov, F., & Gonsalves, A. (2020). Data imbalance in classification: Experimental evaluation. Information Sciences, 513, 429–441. https://doi.org/10.1016/j.ins.2019.11.004
Thölke, P. & others. (2023). Class imbalance should not throw you off balance. NeuroImage, 277, 120253. https://doi.org/10.1016/j.neuroimage.2023.120253
Uppal, K., & Hajian, S. (2024). Students’ Perceptions of ChatGPT in Higher Education: A Study of Academic Enhancement, Procrastination, and Ethical Concerns. European Journal of Educational Research, 14(1), 199–211. https://doi.org/10.12973/eu-jer.14.1.199
Vasconcelos, H., Jörke, M., Grunde-McLaughlin, M., Gerstenberg, T., Bernstein, M., & Krishna, R. (2023, January). Explanations Can Reduce Overreliance on AI Systems During Decision-Making. arXiv. https://doi.org/10.48550/arXiv.2212.06823
Villuendas-Rey, Y., Tusell-Rey, C. C., & Camacho-Nieto, O. (2024). Simultaneous Instance and Attribute Selection for Noise Filtering. Applied Sciences, 14(18), 8459. https://doi.org/10.3390/app14188459
Vimala, S., & Sheela, D. G. A. (2025). Predictive Modeling of the Impact of Smartphone Addiction on Students’ Academic Performance Using Machine Learning. 1, 1–9.
Zhai, C., Wibowo, S., & Li, L. D. (2024). The effects of over-reliance on AI dialogue systems on students’ cognitive abilities: A systematic review. Smart Learning Environments, 11(1), 28. https://doi.org/10.1186/s40561-024-00316-7
Zhu, T. (2020). Analysis on the Applicability of the Random Forest. Journal of Physics: Conference Series, 1607(1), 012123. https://doi.org/10.1088/1742-6596/1607/1/012123