Targeted cancer therapies can only be used in patients whose type of cancer can be identified, according to FDA rules. That leaves the 3% to 5% of cancer patients diagnosed with cancer of unknown primary (CUP) with few options and poor outcomes. Researchers at the Dana-Farber Cancer Institute developed an AI-based tool that uses tumor gene sequencing data to predict the primary source of a patient’s cancer, with hopes of improving diagnosis for this group of patients. The new tool is described in a study published in Nature Medicine.
“A core goal of the work was to identify CUP subtypes that had clinically meaningful differences in terms of survival and/or treatment response,” says Alexander Gusev, PhD, a Dana-Farber researcher and senior author of the paper.
Gusev and his colleagues not only used the tool, called OncoNPC, to identify subtypes, but also conducted a retrospective study of clinical data, showing that those patients who received therapies that aligned with the subtype predicted by the model had better outcomes than those who did not. “We saw this as an important gap in the prior studies, which have shown that cancers can be classified based on genomic sequencing alone but have generally not investigated whether these subtypes have an influence on patient prognoses,” Gusev adds.
In addition, the team found that the OncoNPC predictions would enable approximately 2.2 times as many CUP patients to be matched to approved targeted medicines. “This could open the door to more precision treatment for these patients,” Gusev said.
OncoNPC is short for Oncology NGS-based Primary cancer type Classifier. The research team trained and validated the classifier using the medical records of 36,445 patients with known primary tumors from three major cancer centers. The records contained tumor genetic sequencing data and clinical information for each patient. OncoNPC accurately predicted the origin of about 80% of tumors with known types, including metastatic tumors, using a subset of cases that had not been used as training data. The model made high confidence predictions in 65% of the tumors, meaning it assessed its prediction as having a high probability of being correct. Those predictions were 95% accurate.
They then applied OncoNPC to a separate database of 971 CUP tumors from patients seen at Dana-Farber, where a team of experts had already made a substantial effort to identify the primary source of the tumor. OncoNPC was able to predict the tumor’s origin with high confidence for 400 out of 971 (41.2%) of the cases.
Gusev explained the vision for OncoNPC is for it to be used in conjunction with the conventional pathology workflow for CUP, not as a replacement. “There is a lot of molecular information and clinical intuition that goes into these workups which is not captured in the somatic data. However, we believe the somatic data also offers the potential for additional refinement when the workup is not enough,” he says.
Like many genetic studies, the data used is likely non-representative of the true population of cancer patients, both in terms of being overrepresented for more advanced disease (as these are the patients who get sequencing) and in terms of being underrepresented for patients from minority populations or those with limited healthcare access. “This is a critical challenge to generalizability and, in follow-up work, we are developing new algorithms that will train specifically to ensure consistent accuracy across any subpopulation,” Gusev notes.
Future work also includes incorporating additional data into OncoNPC made possible by recent breakthroughs in large-language modeling. “We hypothesize that by integrating unstructured data like pathology images and clinical notes, the AI can achieve a more holistic understanding of tumors,” Gusev says.