TY - GEN
T1 - From Natural Language to Interpretable Code
T2 - 18th International Conference on Agents and Artificial Intelligence, ICAART 2026
AU - Chen, Yuexi
AU - Vaidya, Gauri
AU - O’connor, Alison N.
AU - Kshirsagar, Meghana
N1 - Publisher Copyright:
© 2026 by SCITEPRESS-Science and Technology Publications, Lda.
PY - 2026
Y1 - 2026
N2 - This article presents a comparative evaluation of three large language models (LLMs), namely GPT-4o, Gemini 2.0 Flash 2.0 Flash, and Claude 3.5 Sonnet, examining their ability to automate key healthcare workflows while adhering to algorithmic constraints and supporting interpretability and fairness. The models were evaluated using Python, JavaScript, and Go under varying levels of prompt completeness across four healthcare tasks of increasing complexity: bed allocation, dynamic patient bed reallocation, ambulance dispatch, and patient triage. We introduce a multidimensional evaluation framework that captures model performance across task complexity, prompt completeness, and programming language, with an emphasis on generating functionally correct, transparent, and reliable code. This framework enables a systematic analysis of how effectively LLMs translate natural language specifications into executable logic under realistic, constraint rich healthcare scenarios. Experimental results show that all three models generate constraint compliant solutions for simpler tasks such as bed management. However, as task complexity increases and multiple constraints must be balanced, clear performance differences emerge. Claude 3.5 Sonnet consistently outperforms GPT-4o and Gemini 2.0 Flash 2.0 Flash by producing more robust, interpretable, and reliable code. These findings highlight Claude 3.5 Sonnet’s stronger potential for transparent and dependable automation of critical healthcare services using LLM based code generation. The code is publicly available at: https://github.com/gauriivaidya/alter-automated-healthcare-tasks.
AB - This article presents a comparative evaluation of three large language models (LLMs), namely GPT-4o, Gemini 2.0 Flash 2.0 Flash, and Claude 3.5 Sonnet, examining their ability to automate key healthcare workflows while adhering to algorithmic constraints and supporting interpretability and fairness. The models were evaluated using Python, JavaScript, and Go under varying levels of prompt completeness across four healthcare tasks of increasing complexity: bed allocation, dynamic patient bed reallocation, ambulance dispatch, and patient triage. We introduce a multidimensional evaluation framework that captures model performance across task complexity, prompt completeness, and programming language, with an emphasis on generating functionally correct, transparent, and reliable code. This framework enables a systematic analysis of how effectively LLMs translate natural language specifications into executable logic under realistic, constraint rich healthcare scenarios. Experimental results show that all three models generate constraint compliant solutions for simpler tasks such as bed management. However, as task complexity increases and multiple constraints must be balanced, clear performance differences emerge. Claude 3.5 Sonnet consistently outperforms GPT-4o and Gemini 2.0 Flash 2.0 Flash by producing more robust, interpretable, and reliable code. These findings highlight Claude 3.5 Sonnet’s stronger potential for transparent and dependable automation of critical healthcare services using LLM based code generation. The code is publicly available at: https://github.com/gauriivaidya/alter-automated-healthcare-tasks.
KW - Code Generation
KW - Healthcare Workflow Automation
KW - Large Language Models
KW - Operational Efficiency
KW - Programming Languages
UR - https://www.scopus.com/pages/publications/105035596941
U2 - 10.5220/0014717000004052
DO - 10.5220/0014717000004052
M3 - Conference contribution
AN - SCOPUS:105035596941
SN - 9789897587962
T3 - International Conference on Agents and Artificial Intelligence
SP - 829
EP - 840
BT - Proceedings of the 18th International Conference on Agents and Artificial Intelligence
A2 - Rocha, Ana Paula
A2 - Wahde, Mattias
A2 - van den Herik, H. Jaap
PB - Science and Technology Publications, Lda
Y2 - 5 March 2026 through 8 March 2026
ER -