A machine learning algorithm can predict whether people with major depression will respond to a common antidepressant after only 1 week.
Credit: Yuuji/Getty Images

New research by a team at Mass General Brigham has found that across all medical specialties, ChatGPT displays 72% accuracy in clinical decision making and is 77% accurate making a final diagnosis. The study, published today in the Journal of Medical Internet Research, shows that the chatbot performed equally well in primary care and emergency medical settings.

“Our paper comprehensively assesses decision support via ChatGPT from the very beginning of working with a patient through the entire care scenario, from differential diagnosis all the way through testing, diagnosis, and management,” said corresponding author Marc Succi, MD, associate chair of innovation and commercialization and strategic innovation leader at Mass General Brigham.

“No real benchmarks exists, but we estimate this performance to be at the level of someone who has just graduated from medical school, such as an intern or resident. This tells us that LLMs (large language models) in general have the potential to be an augmenting tool for the practice of medicine and support clinical decision making with impressive accuracy.”

To put ChatGPT through its paces, the investigators pasted successive portions of standardized, published clinical vignettes into the chatbot. It was then tasked with providing a set of possible diagnoses based on the patient’s initial information—age, gender, symptoms, and whether the case was an emergency—and later was provided with additional information and asked to make a care management decision along with a final diagnosis. This method simulated the same trajectory of seeing an actual patient.

The Brigham team then compared ChatGPT’s accuracy across a number of domains including differential diagnosis, diagnostic testing, final diagnosis, and management in a structure blinded process. The investigators then scored the accuracy, awarding points for correct answers and used linear regressions to assess the relationship between ChatGPT’s performance and the demographic information of each vignette.

The results showed ChatGPT was best as making a final diagnosis (77% accurate), while it lagged significantly in making differential diagnoses (60%) and was only accurate in 68% of clinical management decisions such as choosing the best medication for a patient after making a diagnosis.

These results suggest that the chatbot is not quite ready for prime time in the doctor’s office and the researchers noted that more benchmark research is needed as well as guidance for tools such as these from regulatory agencies.

“ChatGPT struggled with differential diagnosis, which is the meat and potatoes of medicine when a physician has to figure out what to do,” said Succi. “That is important because it tells us where physicians are truly experts and adding the most value—in the early stages of patient care with little presenting information, when a list of possible diagnoses is needed.”

Succi and team will continue their research for how artificial intelligence tools can aid in patient management and will next research whether AI tools can play a role in resource-constrained areas within the hospital to improve patient care and outcomes.

Also of Interest