A simple model, developed by U.K. researchers using machine learning, can accurately predict lung cancer risk and identify people who need screening using just three variables— age, smoking duration, and pack–years.
The University College London lung cancer death (UCL-D) and incidence (UCL-I) models performed as well as or better than current lung cancer risk prediction tools and offer a “novel approach that could simplify the implementation of risk-based lung cancer screening in multiple settings,” writes Thomas Callender, from the Department of Respiratory Medicine at UCL, and co-authors in PLOS Medicine.
Callender tells Inside Precision Medicine that although lung cancer is the most common cause of cancer death worldwide and screening with low-dose computed tomography reduces mortality by around 20%, current approaches for risk prediction can be cumbersome.
The US Preventive Services Taskforce (USPSTF) recommends the use of dichotomous criteria—age 50 to 80 years, at least 20 pack–years smoked, and less than 15 quit–years for former smokers—to select screening participants.
In the U.K., eligibility for lung cancer screening pilot schemes is based on the PLCOm2012 and Liverpool Lung Project risk models, which together require 17 unique variables, most of which are not available in electronic health records and so not automatable. If a person meets a threshold on either of these two models, then they would be eligible for lung cancer screening.
“We worked out that if it took 5 minutes to ask a person these 17 questions, you would need about 48 staff working full-time for a whole year to do risk assessments on 1 million people,” says Callender. “In the U.K. we have approximately 7 million potentially eligible for screening [so] you’re asking clinical staff to collect lots of additional data, this becomes unfeasible unless we radically simplify or automate.”
He adds: “With our models, we consequently hope that we can make it more feasible to implement a risk-based program for lung cancer.”
Callender and team used an ensemble machine learning method to create their final models. This method combines several different models and is based on the concept that each one makes different types of mistakes and the errors begin to cancel each other out. Therefore, combining the models should improve the performance that any individual one might achieve.
UCL-D and UCL-I were developed to predict the 5-year risk for lung cancer death and lung cancer incidence, respectively, using data from 216,714 ever-smokers from the UK Biobank prospective cohort and 26,616 high-risk ever-smokers from the U.S. National Lung Screening randomized controlled trial.
During the development process, the team found that age, smoking duration, and pack–years of smoking were driving the predictions, with little value to additional predictors.
Using these the variables, the models “achieved or exceeded parity in discrimination, overall performance, and net benefit with comparators currently in use, despite requiring only one-quarter of the predictors,” the researchers report.
Indeed, in external validation among 49,593 participants of the US Prostate, Lung, Colorectal and Ovarian (PLCO) Screening Trial, both UCL models had higher sensitivity for predicting the 5-year risk for lung cancer incidence or death than the USPSTF-2021 criteria, at an equivalent specificity.
Specifically, for UCL-I at a five-year risk threshold of 1.17%, the sensitivity for predicting lung cancer incidence was 83.9% versus 77.7% with USPSTF-2021, a 6.2 percentage point gain.
With UCL-D at a five-year risk threshold of 0.68%, sensitivity for lung cancer death was 85.5% compared with 77.5% with USPSTF-2021, leading to a 7.9 percentage point increase in sensitivity.
Furthermore, the researchers calculated that, at all risk thresholds, the net benefit of the UCL models was greater than screening using the USPSTF-2021 criteria and equivalent to or greater than the PLCOm2012 and Liverpool Lung Project risk models.
Callender believes that screening is the solution to diagnosing lung cancer earlier and improving survival, but “to capture the benefits, we need to screen the right people at the right point,” he said. “By making a model easier to use, we think that it will allow more avenues to design and democratize screening. The models could be done on an app, via text message, or on an online portal, such that the public can work with their clinicians to get the preventive care they need.”
The investigators are now looking at whether their simplified modeling approach could be applied to other conditions such as cardiovascular disease, diabetes, and chronic kidney disease “to support the implementation at scale of multiple concurrent risk-stratified prevention and early detection programs for major causes of morbidity and mortality,” they write.