Consideration of Real Life Factors Needed to Improve AI-Driven Diagnostics

Consideration of Real Life Factors Needed to Improve AI-Driven Diagnostics
Telling sick people that they’re healthy, can happen when a human doctor sees a patient. It also happens when Artificial Intelligence (AI) learns to diagnose disease. But giving a big penalty to an algorithm for false negatives results in much better precision. [Graphic design by Therese van Wyk, University of Johannesburg. Based on Pixabay images.]

Artificial intelligence has advanced enormously in recent years and has the potential to greatly benefit healthcare, but it is important to take real life perspectives from patients and doctors into account when designing medical programs and algorithms to ensure they achieve their true potential.

Medicine is not ‘black and white’ many different things influence a patient, their disease, how quickly and accurately it is diagnosed and how well subsequent treatment progresses. However, artificial intelligence (AI) is binary in nature and not naturally well equipped to deal with the many factors that can come into play in a hospital or clinic setting.

Machine learning, a branch of AI that allows an algorithm to ‘learn’ with experience, is attempting to solve this issue. However, the way accuracy is defined by most current systems can cause problems when used in a medical setting, as in most cases a lot more people have negative results than positive —something known as a class imbalance.

“Let’s say there is a dataset about a serious disease. The dataset has 90 people who do not have the disease. But 10 of the people do have the disease,” says Ibomoiye Domor Mienye, a post-doctoral AI researcher at the University of Johannesburg.

“As an example, a machine learning algorithm says that the 90 do not have the disease. That is correct so far. But it fails to diagnose the 10 that do have the disease. The algorithm is still regarded as 90% accurate,” he says.

To try and improve the accuracy of machine learning for medical use, Mienye and Yanxia Sun, a professor at the University of Johannesburg, recently carried out a study on four different medical datasets to assess how changes to the algorithms could impact outcomes. They used logistic regression, decision tree, XGBoost, and random forest algorithms, which learn in a binary ‘yes/no’ way from data that they are given to work with.

As described in the journal Informatics in Medicine Unlocked this month, the researchers added something called cost sensitivity into the algorithms. This means that the algorithm is penalized more harshly for falsely giving someone who is sick a negative diagnosis than falsely telling a healthy person they are sick.

They found that adding this penalty into the algorithms made them significantly more accurate at diagnosing disease correctly, with precision and recall scoring between 94-100% depending on the dataset and disease being diagnosed.

“The research is important because medical datasets are unique and should not be treated like other datasets, and algorithms that are generally applied to other datasets need to be adequately modified to take into account the class imbalance that exists in the medical datasets,” Mienye told Clinical Omics.

“AI professionals need to treat medical datasets differently, in order to develop models that are not just highly accurate but models with high precision and sensitivity. These performance evaluation metrics are more important than accuracy in medical diagnosis because they focus on identifying sick patients.”

A recent systematic review published in The Lancet: Digital Health, brought to light another potential accuracy issue that medical AI tool developers need to be aware of.

The study analysed 21 publicly available skin cancer image datasets including more than 100,000 images. These datasets are commonly used to train and feed algorithms designed to diagnose skin cancer based on image analysis.

Co-senior author of the study, Rubeta Matin, a consultant dermatologist based in Oxford in the UK, explained that they uncovered some concerning findings. “Diagnosis of skin cancer normally requires a photo of the worrying lesion as well as a picture taken with a special hand-held microscope, called a dermatoscope, but only two of the 21 datasets included images taken with both of these methods suggesting that these datasets were not representing real-life practice. The datasets were also missing other important information, such as how images were chosen to be included, and evidence of ethical approval or patient consent.”

She added that only a very small percentage of the images included information about skin color (2436 images) or ethnicity (1585 images), and that in this small subsection of images only 11 were from people with brown or black skin and none of the images stating ethnicity were from people with African, Afro-Caribbean or South Asian background.

Although skin cancer is less common in people with darker skins, it does occur and there is evidence to suggest that those who develop it can be at increased risk of more severe disease or death. One reason for this, may be delayed diagnoses in this group, which this kind of bias in public datasets could exacerbate.

“We found that for the majority of datasets, a significant amount of important information about the images and the patients from whom these images were taken from included in these datasets was not reported… This has implications for the AI technologies which are developed from these images, as it is difficult to determine which groups of people it will be effective for and in which settings,” says Matin.

“This can potentially lead to the exclusion or even harm of groups of people who aren’t well represented in datasets. It also means that if an AI tool which was trained on these biased datasets is then used in the general population it is not possible to be confident that the performance of the tool is generalizable to that population.”

Matin and her colleagues are calling for the creation of international standards that make it clear what should be included in a dataset to make it diverse and representative. She suggests that AI designers really need to think about the population that the tool they are designing is aimed at, make sure they use a dataset that represents this population during development of any medical diagnostics and make sure the data is collected prospectively the avoid bias.

They also suggest that stakeholders and the public should be involved to ensure that data is collected more transparently, comprehensively and fairly and that clinical researchers and AI developers work to raise awareness of “health data poverty and inequalities in data representation.”

Both Mienye and Matin are keen for the use of AI in medicine to continue, but encourage those involved to take all relevant clinical factors into account and encourage cross-subject discussions between healthcare staff and researchers, patient representatives and those working to develop this new generation of medical tools.

“The use of AI in healthcare is a good thing because not only does it help the patient or consumer to get highly precise healthcare, it also helps healthcare professionals better understand their patient’s condition, which enables them to provide better healthcare,” says Mienye.

“There are many areas of medicine that are very repetitive and could be improved by being standardized and AI has the potential to address this,” agrees Matin, although she cautions that there is still work to be done. “The tools are developed in the absence of early input from clinicians and patients and although are scientifically interesting they are not actually addressing an clinical unmet need.”