Chatbot
Credit: Irina Strelnikova / Getty Images

Researchers from Mass General Brigham healthcare system have shown that one third of ChatGTP responses to questions about breast, prostate, and lung cancer treatment include recommendations that are at least partially nonconcordant with National Comprehensive Cancer Network (NCCN) guidelines.

“Our results showed that the model is good at speaking fluently and mimicking human language, but a challenging aspect of that for health advice is that it makes it hard to detect correct versus incorrect information,” says corresponding author Danielle Bitterman, MD, of the Department of Radiation Oncology and the Artificial Intelligence in Medicine (AIM) Program of Mass General Brigham.

She tells Inside Precision Medicine this could be problematic because “if a response states that curative treatment is an option but in reality it is not, [it] could cause distress and emotional harm” to the patients.

“In addition, one error mode we found was that the incorrect answers suggested newer, more expensive treatments that were not approved for that cancer,” noted Bitterman. “This could similarly cause harm by misleading that a treatment option is available when reality it is not. I could imagine a situation where, if [a ChatGTP] recommendation conflicts with physician recommendations, it could cause some mistrust and impact the patient-clinician relationship.”

Many patients use the internet for self-education on medical topics. With ChatGPT now at their fingertips, Bitterman and team assessed how consistently the artificial intelligence (AI) chatbot provides recommendations for cancer treatment that align with NCCN guidelines.

They created four questions, or prompts, to input to the GPT-3.5-turbo-0301 model via the ChatGPT (OpenAI) interface. Each prompt was used for 26 diagnosis descriptions that included the cancer type (breast, prostate, or lung) with or without further relevant information such as whether the disease was localized or advanced, giving a total of 104 prompts. The questions they asked were:

  • What is a recommended treatment for [diagnosis description] according to NCCN?
  • What is a recommended treatment for [diagnosis description]?
  • How do you treat [diagnosis description]?
  • What is the treatment for [diagnosis description]?

Chatbot’s recommendations were compared with 2021 NCCN guidelines because the version tested only included knowledge up to September 2021. Guideline concordance was assessed against five criteria by three of four board-certified oncologists. The output did not have to recommend all possible regimens to be considered concordant and in cases of disagreement among the three oncologists, the fourth, who had not previously seen the output, adjudicated.

The researchers report in JAMA Oncology that the chatbot provided at least one recommendation for 102 (98%) of the 104 prompts. All outputs with a recommendation included at least one NCCN-concordant treatment, but 35 (34.3%) also recommended one or more nonconcordant treatments.

Furthermore, 13 (12.5%) of the 104 outputs included responses that were “hallucinated” (ie, were not part of any recommended treatment). Hallucinations were primarily recommendations for localized treatment of advanced disease, targeted therapy, or immunotherapy.

Bitterman says that overall, the results “were what we expected.” She points out that “ChatGPT and many of the similar large language models [LLMs] are trained primarily to function as chatbots, but they are not specifically trained to reliably provide factually correct information.”

The investigators also found that for 38% of the outputs there was disagreement among the oncologists’ interpretation that tended to arise when the output was unclear (e.g., not specifying which multiple treatments to combine). This highlights the challenges of interpreting descriptive LLM output, they say.

Although the study only evaluated one model at one particular time, the results highlight the fact that “generalist large language models are not trained to provide medical advice, and currently cannot be relied on to do so,” says Bitterman.

She adds: “There is so much excitement and potential of AI in healthcare, but we need to carefully evaluate our models at each step and optimize them for the high-stakes clinical domain. There is too much at stake if we get this wrong; patient safety is paramount and early errors will set the field back and slow the potential gains.”

Going forward, the group will continue to research the potential of LLMs in oncology. “We are working new methods for task-specific models that are specialized for clinical tasks,” said Bitterman. “We are also evaluating ways to put safety guards on these large models so that we can benefit from their strengths while minimizing the safety risks regarding response reliability, factuality, and stability.”

Also of Interest