An AI model trained on sequential health information from electronic health records (EHRs) detected individuals with a 25-fold risk of developing pancreatic cancer within 3 to 36 months, according to results presented by Bo Yuan, a PhD candidate at Harvard University, at the American Association for Cancer Research (AACR) Annual Meeting 2022 (Abstract LB550).
Davide Placido, the lead author of the abstract, is a fellow at the University of Copenhagen. The team also included researchers from Dana-Farber Cancer Institute, Harvard T.H. Chan School of Public Health, Harvard Medical School, and the Broad Institute of MIT and Harvard.
Pancreatic cancer is a leading cause of cancer-related deaths worldwide and is increasing in incidence. The worldwide market for this condition is currently estimated at more than $3.71 billion. More than 60,000 Americans alone die from the disease each year.
Early diagnosis of pancreatic cancer is a major challenge, but patients who present with early-stage disease can be cured by a combination of surgery, chemotherapy, and radiotherapy. Thus, a better understanding of the risk factors for pancreatic cancer and detection at early stages has potential to improve patient survival and reduce overall mortality.
In this study, researchers used advanced machine learning (ML) technology by focusing on the time sequence of clinical events and by predicting the risk of cancer occurrence over a multi-year time interval.
This investigation was initially carried out using the Danish National Patient Registry (DNPR), one of the world’s oldest nationwide hospital registries and which covers 41 years (1977 to 2018) of clinical records for 8.6 million patients. About 40,000 of these people had had a diagnosis of pancreatic cancer.
To maximize predictive information extraction from these records the researchers tested a range of ML methods, ranging from regression methods and ML without or with time dependence to time series methods such as GRU and Transformer. They trained machine learning models on the time sequence of diseases in patient clinical histories and tested their ability to predict cancer occurrence in time intervals of 3 to 60 months after risk assessment.
For cancer occurrence within 36 months, the performance of the best model (AUROC=0.88; OR=47.5 for 20% recall and OR=159.0 for 10% recall), substantially exceeded that of a model without time information, even when disease events within a three-month window before cancer diagnosis was excluded from training (AUROC[3m]=0.84). Independent training and testing on the Boston dataset reaches comparable performance (AUROC=0.87, OR=112.0 for 20% recall and OR=162.4 for 10% recall).
They team also extracted from the AI machine an estimate of the contribution to prediction of individual disease features, e.g., obesity and diabetes.
The researchers said, “These results raise the state-of-the-art level of performance of cancer risk prediction on real-world data sets and provide support for the design of future screening trials for high-risk patients.”
They added that “AI on real-world clinical records has the potential to shift focus from treatment of late-stage to early-stage cancer, benefiting patients by improving lifespan and quality of life.”
The team anticipates increases in prediction accuracy with the availability of new data beyond disease codes, including prescriptions, laboratory values, and images.