Daily News
State-of-the-art AI models excel in complex neurology tests
Recent advancements in LLMs. with minor tweaks, can become key resources for clinical neurology applications
Published
2 years agoon

In a recent research article published in the JAMA Network Open, a comprehensive investigation was conducted on two large language models (LLMs) based on ChatGPT. These models were specifically trained to respond to queries from the American Board of Psychiatry and Neurology question bank. A comparison was drawn between the performance of these models, labeled LLM 1 (ChatGPT version 3.5) and LLM 2 (ChatGPT 4), concerning both lower- and higher-order questions in comparison to human neurology students.
They are smarter!
The results revealed a significant distinction between the two models, with one surpassing the average human scores on the question paper (85% versus 73.8%). This superior performance implies the potential of these LLMs to successfully navigate challenging entrance examinations in neurology. The findings underscore the recent strides in LLM development and suggest that, with slight adjustments, these models could emerge as pivotal tools in clinical neurology applications.
The increasing sophistication of artificial intelligence (AI), particularly through machine learning (ML) algorithms, is reshaping fields that were traditionally exclusive to human expertise. Sectors such as medicine, military, education, and scientific research are witnessing a notable integration of these “smarter” AI models.
Recent strides in computing power and the evolution of advanced AI models, particularly transformer-based architectures trained on massive datasets exceeding 45 terabytes, have paved the way for their widespread utilisation in clinical neurology. These deep learning algorithms now contribute significantly to tasks ranging from neurological diagnosis to treatment and prognosis.

Transformer-based AI architectures, exemplified by ChatGPT versions 3.5 and 4 (LLM 1 and LLM 2), have been developed to address diverse needs in neurology. LLM 1 is computationally less demanding and offers swift data processing, while LLM 2 boasts contextual accuracy. Despite informal indications of their utility, scientific scrutiny of these models’ performance and accuracy has been limited. Previous assessments primarily focused on LLM 1 in examinations like the United States Medical Licensing Examination (USMLE) and ophthalmology, leaving LLM 2 unvalidated.
The present study sought to fill this gap by comparing the performance of LLM 1 and LLM 2 against human neurology students in board-like written examinations. Adhering to the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines, the study employed a neurology board examination as a benchmark for evaluating the models’ performance in complex human medical examinations.
Utilising questions from the publicly accessible American Board of Psychiatry and Neurology (ABPN) question bank, the study excluded 80 questions based on visual content. LLM 1 and LLM 2 were sourced from online servers (ChatGPT 3.5 and 4, respectively) and trained until September 2021. Human performance benchmarks were derived from previous iterations of the ABPN board entrance examination.

Importantly, the pre-trained models underwent evaluations without access to online resources for fact-checking or improvement. No specific adjustments for neurology were implemented, and no fine-tuning occurred before testing.
The testing process involved subjecting the models to 1,956 multiple-choice questions categorised as lower- and higher-order based on the Bloom taxonomy. The study set a minimum passing grade of 70%, and models were subjected to 50 independent queries to assess answer reproducibility and self-consistency principles.
The study revealed that in neurological board examinations, the later model (LLM 2) outperformed both the earlier model (LLM 1) and human neurology students across various question categories. Despite exhibiting stronger performance in memory-based questions compared to those requiring cognitive skills, these results emphasise the potential of these models in supporting or even substituting for human medical experts in non-mission-critical roles.
Notably, the models were not tailored for neurological purposes and lacked access to real-time online resources, factors that could further enhance their performance compared to human counterparts. In essence, the pace at which AI LLMs are advancing is unprecedented.
Related Stories
Shalini is an Executive Editor with Apeejay Newsroom. With a PG Diploma in Business Management and Industrial Administration and an MA in Mass Communication, she was a former Associate Editor with News9live. She has worked on varied topics - from news-based to feature articles.