A new analysis of an artificial intelligence (AI) chest X-ray foundation model for disease detection has found both race and sex bias in reporting findings resulting in uneven performance and making it potentially inaccurate and unsafe for clinical application across all populations. The study, which appears in the journal Radiology: Artificial Intelligence serves as a warning on the risks of using foundation models for the development of AI tools in medical imaging.
“There’s been a lot of work developing AI models to help doctors detect disease in medical scans,” said lead researcher Ben Glocker, PhD, professor of machine learning for imaging at Imperial College London in the U.K. “However, it can be quite difficult to get enough training data for a specific disease that is representative of all patient groups.”
Foundation models are increasingly used across all applications of AI and have been developed with the promise of being able to extract data features without the need for large volumes of training data in order to create prediction models. Some in healthcare have embraced foundation models due to the challenges of collecting large, high quality data from diverse populations. Foundation models can be applied to both text and medical image analysis.
But there is a caveat: “Despite their increasing popularity, we know little about potential biases in foundation models that could affect downstream uses,” Glocker said.
For their current analysis, Glocker and team compared the performance of a recently published chest X-ray foundation model with a reference model they have built in evaluating 127, 118 chest X-rays that contained corresponding diagnostic labels. The pre-trained foundation model was created using more than 800,000 chest X-rays from patients in the U.S. and India.
Next, the team conducted a performance analysis using data from 42,884 patients representing subgroups that included Asian, Black, and white patients to determine how while both models performed. The mean age of patients was 63 and comprised 23,623 male patients (55%). Bias analysis found there were large differences between features related to disease detection across both biological sex and race, with significant differences detected between male and female Asian and Black patients related to disease detection.
Compared with the average model performance across all subgroups, classification performance on the “no finding” label dropped between 6.8% and 7.8% for female patients, and performance in detecting a buildup of fluid around the lungs called “pleural effusion” dropped between 10.7% and 11.6% for Black patients, highlighting race and sex bias.
Glocker noted that this analysis highlighted that “dataset size alone does not guarantee a better or fairer model.”
He emphasizes the need for data collection to be focused on ensuring diversity, while also noting that new foundation models need to be fully accessible to the research community to be analyzed to identify the potential risks of any models that might be for driving clinical decision making. He said bias analysis such as this provided by his team should be an integral step in the development of foundation models.
“AI is often seen as a black box, but that’s not entirely true,” Glocker noted. “We can open the box and inspect the features. Model inspection is one way of continuously monitoring and flagging issues that need a second look. As we collect the next dataset, we need to, from day one, make sure AI is being used in a way that will benefit everyone.”