Ponente
Descripción
Large Language Models (LLMs) have emerged as powerful tools for medical question-answering, capable of assisting in clinical decision-making by processing and synthesising vast amounts of medical knowledge. However, deploying LLMs in healthcare requires balancing accuracy, computational efficiency, and cost. This study investigates different inference strategies—single-call, ensemble, and episodic chain-of-thought (ECoT)—to evaluate their impact on medical reasoning performance. We analyse the trade-offs inherent in various models through extensive benchmarking on the MedQA USMLE-H dataset, including GPT-4, Claude 3.5, LLama 3, Mixtral, Gemini, and GPT-3.5. Our results demonstrate that GPT-4 and Claude 3.5 achieve the highest accuracy but incur substantial computational costs. LLama 3 70B presents a cost-effective alternative with competitive accuracy, while Mixtral and Gemini offer moderate performance at lower costs. ECoT reasoning improves accuracy for some models without introducing computational overhead in our preliminary research, requiring further optimisation. These findings provide insights into selecting optimal inference strategies for deploying LLMs in medical applications.