Evaluating the Effectiveness of Open-Source LLMs for Automatic Short Answer Scoring

Aminah, Siti and Hidayah, Indriana and Permanasari, Adhistya Erna (2025) Evaluating the Effectiveness of Open-Source LLMs for Automatic Short Answer Scoring. 2025 IEEE International Conference on Artificial Intelligence and Mechatronics Systems (AIMS).

[thumbnail of Evaluating_the_Effectiveness_of_Open-Source_LLMs_for_Automatic_Short_Answer_Scoring.pdf] Text
Evaluating_the_Effectiveness_of_Open-Source_LLMs_for_Automatic_Short_Answer_Scoring.pdf - Published Version
Restricted to Registered users only

Download (524kB) | Request a copy

Abstract

Automated Short Answer Grading (ASAG) has become a critical challenge in educational technology, requiring models capable of understanding the semantics and context of student responses. This study investigates the effectiveness of three Large Language Models (LLMs) GPT-3.5 Turbo, LLaMA3-70B-8192, and Deepseek for ASAG tasks by evaluating their accuracy, interpretability, feedback quality, and efficiency. Using the Mohler dataset, which contains 2,273 answers scored by human raters, the models were assessed based on Quadratic Weighted Kappa (QWK) and Mean Absolute Error (MAE). Results show that LLaMA3-70B-8192 achieved the highest QWK (0.9484), while Deepseek obtained the lowest MAE (0.3), indicating superior precision. GPT-3.5 Turbo, though fastest, had lower agreement with human scores. The study also highlights differences in feedback style and cost-efficiency, with open-source models offering scalable and customizable alternatives. This research contributes a comprehensive benchmark and introduces a novel comparative framework for selecting LLMs tailored to ASAG needs, balancing accuracy, cost, and pedagogical effectiveness. © 2025 IEEE.

Item Type: Article
Additional Information: Cited by: 0
Uncontrolled Keywords: Automation; Educational technology; Feedback; Information management; Open systems; Semantics; Automated short answer grading; Critical challenges; Deepseek; GPT-3.5 turbo; Language model; Large language model; LLaMA3-70b-8192; Mean absolute error; Open-source; Student response; Efficiency; Grading
Subjects: T Technology > T Technology (General) > Industrial research. Research and development
T Technology > TA Engineering (General). Civil engineering (General) > Engineering mathematics. Engineering analysis
Divisions: Faculty of Engineering > Electrical and Information Technology Department
Depositing User: Rita Yulianti Yulianti
Date Deposited: 23 Feb 2026 01:50
Last Modified: 23 Feb 2026 01:50
URI: https://ir.lib.ugm.ac.id/id/eprint/24759

Actions (login required)

View Item
View Item