DNA with bright glowing futuristic blue neon lights on black background
Credit: Andrii Sedykh / iStock / Getty Images Plus

There may be new hope in the hunt for reliable ways to detect cancer early on, an area where traditional diagnostic methods often prove inadequate.

By analyzing cell-free DNA (cfDNA) end-motifs, AI can now distinguish between cancer patients and healthy individuals, according to a research article published in npj Precision Oncology. A model based on deep learning called end-motif inspection via transformer (EMIT) was created using thousands of samples from various studies that included cfDNA sequencing for hepatocellular carcinoma, colorectal cancer, non-small cell lung cancer, and esophageal carcinoma. Tested on whole exome sequencing data from both lung cancer and non-cancer patients, EMIT demonstrated strong classification capabilities. Developed by researchers at Tianjin Medical University, EMIT is an advance toward a standardized deep-learning method for identifying cfDNA fragmentome end-motifs.

Using cfDNA for cancer diagnosis

Linear cfDNA fragments, which exhibit non-random fragmentation patterns and range in size from approximately 167 bp, display distinct cfDNA patterns that reflect the physiological conditions of cancer. The preferential distribution of cfDNA throughout the genome can be investigated to detect cancer via liquid biopsy. However, developing cfDNA computational analysis methods is a significant obstacle, and there is an immediate need to address this issue in cfDNA-based cancer diagnosis. Steps like reads mapping, detecting copy number changes, and analyzing fragmentome characteristics, are part of the traditional bioinformatics approach to cfDNA analysis, which can be tedious and prone to error. Due to its complexity, this pipeline raises the possibility of mistakes and significantly hinders its broad adoption.

Several cancer types can be identified by training machine learning models on the genome-wide fragmentation properties of cfDNA sequenced with low-coverage whole-genome sequencing. End-motif profiling of plasma cfDNA, for example, is emerging as a marker in hepatocellular carcinoma (HCC) due to research indicating a difference in end-motifs between HCC and non-HCC patients. According to research, people with HCC have a more diverse set of plasma cfDNA end-motifs, and cfDNA from the liver is more likely to end at specific genomic positions than cfDNA from other sources. Patients with HCC showed distinct nonrandom distributions of cfDNA at specific genomic coordinates compared to liver transplant recipients and hepatitis B patients.

End-motifs encode cancer features

To improve early cancer detection across cancers, co-lead authors Hongru Shen, Meng Yang, and Jilei Liu developed a deep-learning-based end-to-end method that simplifies cfDNA analysis. As shown in this study, EMIT uses a self-supervised method to represent cfDNA end-motifs that are conceptually simple and empirically powerful. This allows it to represent diverse genomes from different sequencing platforms. EMIT was designed to streamline analytical procedures by limiting inputs to end-motif rankings, which can be efficiently computed from raw sequencing data. Consequently, tedious processes such as sequence mapping, evaluating changes in copy number, and identifying mutations are superfluous. 

EMIT was created using data from 4606 plasma cfDNA samples collected using various sequencing methods. While EMIT was developed using only end-motif frequencies and not cancer state information, Shen, Yang, and Liu found that cancer-discriminatory features are encoded and represented. When applied to six datasets produced by various sequencing methods, EMIT demonstrated excellent classification performance in cancer detection. Additionally, using a separate cfDNA testing set from whole-exome sequencing, the researchers demonstrated excellent classification performance in identifying lung cancer using linear projections of EMIT representations.

One disadvantage of using only end-motifs rankings as input to EMIT is that it disregards other information shown to aid in cancer detection, such as size profile, aberrant coverage, preferred end coordinates, and somatic mutations. Not to mention that tumor-derived cfDNA is scarce, particularly in early-stage cancer patients. Once cancerous material enters the bloodstream and mixes with signals from healthy cells, the cancer signal is greatly diminished. Increasing tumor signals may be possible through the enrichment of tumor-derived cfDNA by excluding background cfDNA based on the distribution of size profiles.

Also of Interest