The Age of Analytics: Sequencing’s New Frontier Is Clinical Interpretation

The Age of Analytics: Sequencing’s New Frontier Is Clinical Interpretation

David FitzPatrick’s typical patient is a toddler with global developmental delay who is small for her age, has epilepsy, and cannot speak or understand speech. Often there is no family history of similar symptoms and the parents will report no problems with the pregnancy and birth. The child will have undergone all the standard testing for metabolic disease, had a brain MRI, and also been tested for chromosomal anomalies. All test results are normal. FitzPatrick and his colleagues will then turn to next-generation sequencing (NGS) to try and find an underlying cause of the child’s condition. And for some of these patients, they’ll succeed.

While those types of patients have continued to regularly come through his door, FitzPatrick has seen one major change over time. “We’re able to diagnose a lot more of them,” he said. “Now, I even have patients I’ve been seeing for 15 years without being able to give them an answer. We take the original samples, retest them, and we finally have a diagnosis for them.” FitzPatrick, Ph.D., is a professor in the Medical Research Council Institute of Genetics and Molecular Medicine at the University of Edinburgh, and he sees patients at the Royal Hospital for Sick Children in that city. Currently, his team can perform a whole exome analysis in 3–4 weeks, and “we have a 50% chance of finding a full explanation for any one of these children’s problems,” he added. That’s compared to about a 30% chance just a couple years ago.

It’s a tipping point in the clinical application of genetic sequencing. Today it’s much faster and easier to do sequencing and the cost has dipped dramatically. Prenatal testing for genetic diseases has exploded and is estimated to become a $5 billion global market by 2027. Approximately 250 new rare diseases are found each year. In oncology, likewise, the number of clinically meaningful variants has risen to over 500. More genes that raise the risk of complex diseases are also being discovered.

Today, much more genomic data is being generated at a much faster rate than ever before. And those on the frontier of this field are trying to make sure that data is as useful as possible. “Embedding genomics into clinical practice is our goal, vision, and shared passion,” Garret Hampton, senior vice president, clinical genomics at Illumina said recently.

While the surge in sequencing has benefited many patients, the genomic data avalanche has caused its own problems.

Now, many experts see data analysis and interpretation as the biggest challenge in genomics-guided precision medicine. This is particularly true when researchers are trying to do large studies, rather than looking at one or two specific genes. “While the hypothesized $1,000, or more recently $100, whole genome continues to get a lot of attention within our industry—there is little discussion of the reality of the $10,000 interpretation that accounts for the true level of effort associated with analyzing, interpreting, and reporting out of insightful and actionable results for large datasets,” said Sean P. Scott, vice president of business development at QIAGEN.

And yet, doing those large-scale studies is exactly what is needed to power the next clinical advances. Heidi Rehm, Ph.D. explained: “The infrastructure and resources we need to analyze an entire genome with 5 million variants is very different from what is needed to find and interpret a couple of variants in a single gene.” Rehm is the medical director of the genomics platform of the Broad Institute and chief genomics officer in the department of medicine at Massachusetts General Hospital.

It’s not just making sure that the correct diagnoses are going to be made, it’s also ensuring that analysis is being done in a standard way, from the point where the data leaves the sequencer to when the diagnosis is determined. Ideally, standards and methods would also be consistent across diverse organizations, labs, and clinical sites. That way, whatever is gleaned can be as useful as possible both to those generating it—and others who may be able to learn from it.

Pipeline progress

“We see continued variability in secondary analysis pipelines—especially research-oriented pipelines—because these workflows are developed using a variety of public and proprietary data sources, and individual labs can set different quality thresholds for the variant filtering, which impacts the final variant list,” Scott noted. QIAGEN, he points out, was one of the first companies to standardize and commercialize secondary analysis pipelines for both research and clinical applications to support standards, while providing robust pipelines and tool sets for their customization. Those products are CLC Biomedical Workbench for research and QIAGEN Clinical Insight (QCI)-Analyze for clinical applications.

There are also plenty of open-source tools. The Broad Institute, for example, has the Burrows Wheeler Aligner (BWA) for short reads and the variant caller Genome Analysis Toolkit (GATK). The University of Maryland provides another short-read aligner, Bowtie. But without bioinformatics expertise, using open-source tools can be daunting. Commercial makers of such tools typically provide support and, as a result, these have gained in popularity. The advent of some industry standards (e.g., ACMG) have also helped.

Accurate variant interpretation for tertiary analysis requires high-quality curated content and is one of the most challenging steps in delivering a diagnosis. Two of the most popular sources for such content are the Human Gene Mutation Database (HGMD) and ClinVar. “These are the gold standards for interpretation,” says Scott. Because of this, many commercial interpretation platforms incorporate both databases, including QIAGEN Clinical Insight, which also comprises QIAGEN’s Ingenuity Knowledge Base.

Fabric Genomics is another analysis platform that has gained traction in the past couple of years. It can perform analysis all the way from sequence to the creation of a physician-ready report, but the entry point “depends on the customer,” said Francisco M. De La Vega, senior vice president of research and development. “Sometimes the customer wants to assemble the reads and identify genomic variants themselves, and then go to our secure cloud platform to use our proprietary genome interpretation tools for the rest of the process.”

Fabric has compiled a database of 50,000 genomes and exomes from private sources to train algorithms to perform analysis. The company’s leading algorithms are VAAST and Phevor, which were co-developed with Mark Yandell, Ph.D., professor of human genetics at The University of Utah. VAAST ranks genes based on how likely they are to cause disease. Pheevor uses Human Phenotype Ontology (HPO) terms describing the patient phenotype to re-rank those genes. “Over 50 publications have found that VAAST can find new genes and Phevor improves the prioritization of genes involved in any monogenic disease phenotype,” he said.

Clinicians are also finding utility in the analysis of structural variants (SVs), including deletions, insertions, duplications, inversions, and translocations at least 50 base pairs long. “We used to use microarrays and other technologies to find structural variants, but now whole-genome sequencing is proving to be more sensitive and robust,” said Stephen Kingsmore, M.D., president and CEO of Rady Children’s Institute for Genomic Disease. But groups are using different algorithms to do structural variant calls. “There is no gold standard yet,” he added. “But the genomics community is working to help establish the right standards and benchmarks.”

A recent paper in Genome Biology seeks to tame that Wild West. Five researchers from RIKEN Center for Integrative Medical Sciences in Yokohama, Japan published a “Comprehensive evaluation of SV detection algorithms for whole genome sequencing” in which they evaluated 69 existing structural variant detection algorithms using multiple simulated and real whole-genome sequencing datasets. Their conclusion was that “careful selection of the algorithms for each type and size range of SVs is required for accurate calling of SVs.” The first author of that paper is Shunichi Kosugi of the RIKEN Center.

But most of these analysis tools are not easy to use. “Most physicians cannot do this on their own,” Rehm says. “A lot of expertise is needed to go from the sequence data to interpreting and validating your results.” That requires bioinformaticians and geneticists with specialized knowledge.

“The folks who help us do the analysis are called clinical genomic analysts,” said Shimul Chowdhury, clinical laboratory director at Rady Children’s. “They pass the analysis to the laboratory directors who are board certified geneticists. We also have multiple genetic counselors on our team.” Each case is handled by two-to-three people before it reaches the treating physician. They will comb the literature to learn about implicated variants, and then create a report that is “digestible” to non-specialists, as Chowdhury says.

Another helpful advance has been cloud-based platforms, which allow companies to increase the amount of data they are working with and securely collaborate with anyone. DNAnexus offers an end-to-end genome informatics solution with access to tools and datasets for clients to build and run their own analysis workflows. Researchers can stream NGS data from their instruments to the cloud and work securely there with collaborators around the globe. “We built the infrastructure to take the data from the sequencer, process and analyze it, and get it through tertiary analysis,” said Richard Daly, CEO of DNAnexus.

A wide range of tools, including commercial products and home-grown solutions, can operate in this cloud environment, which becomes increasingly important as the amount of data being generated increases. “The unit volume is currently highest in prenatal testing,” Daly said. But as NGS becomes more common place, platforms like DNAnexus’ will come into wider use. The company is currently working with more than 30 organizations, many of whom, Daly reports, are using thousands of servers to process tens of thousands of samples per year.

Other companies have invested in building knowledgebases that can inform the process of reaching a diagnosis. Fabric, as mentioned above, has a database of more than 50,000 genes to train its algorithms. And at QIAGEN, the company has been building for more than 20 years what Scott describes as “the industry’s largest knowledge base of biomedical knowledge, evidence and reference sets to support a broader and deeper understanding of complex biology systems and evidence-based analysis and interpretation.” This effort is tied to the company’s belief that evidence-based “interpretation is, and will continue to be, the rate limiting factor for broad adoption of high-throughput sequencing platforms.”

Foundation Medicine, which offers specialized cancer screening panels, sequences thousands of patients per week. Using these data, the company has created what it describes as “one of the largest consolidated genomic profiling knowledgebases in the country.” Their FoundationCORE database contains more than 300,000 cases and is continually updated. “Our tests are designed to be as evergreen as possible,” says David Fabrizio, vice president of product development. That means markers will be added and dropped from their panels based on accrued evidence. At this point “about a third of cases who get tested with the CDX panel end up with a match to an approved therapy,” Fabrizio says. “About 80% overall get some type of recommendation.” Those with no genetic marker found are usual better candidates for traditional chemotherapy.

An expanding horizon

While advances in sequencing instruments will continue to impact the sequencing field, “there has been a major industry shift from a focus on instrumentation to insight, with more emphasis and resources going towards the informatics phase of NGS,” said Scott.

One of the major challenges for NGS will be addressing complex disorders. Most inherited complex diseases, such as autism or arthritis, involve more than one gene and have variable phenotypes. With NGS, researchers can test a large number of genes simultaneously in a cost-effective manner, such as by using targeted gene panels, or by performing whole-exome (WES) or whole-genome (WGS) sequencing. However, even as the cost of sequencing drops, it is expected that large numbers of patients will be needed to conduct such studies successfully.

But the use of sequencing in the clinic is expected to continue growing in prenatal screening, rare diseases, and oncology as its use becomes more accepted and broadly adopted. A recent study from Color, for example, found that a large proportion of patients at risk of cancer are not being tested because they don’t meet current guideline criteria. From a pool of more than 23,000 patients tested, the researchers found that more than 2,500 individuals had positive results for cancer risk, although they did not meet clinical guidelines for testing. Most of those results were pathogenic variants in BRCA1 and BRCA2. (Neben et al. Journal of Molecular Diagnostics, 2019)
Researchers and clinicians are also incorporating more different types of data with sequencing results. SOPHiA for Radiomics analyzes medical images and combines them with biological and clinical data to predict tumor evolution. It’s designed to go beyond the usual RECIST and PERCIST criteria, to deliver finer biomarkers and support clinicians in matching cancer patients to treatments. SOPHiA’s artificial intelligence platform (see below) uses techniques such as statistical inference, pattern recognition, and machine learning.

Foundation Medicine, meanwhile, partnered with Flatiron Health to create a clinico-genomic database. A study recently published in JAMA (May 28, 2019) based on that partnership demonstrated the potential of real-world data for improving personalized cancer care. It looked at associations between tumor genomics and patient characteristics as well as clinical outcomes in 4064 patients with NSCLC.

Rehm is a principal investigator for the NIH-funded ClinGen and a vice chair of the Global Alliance for Genomics and Health (GA4GH). She continues to focus on data sharing, both through centralized databases such as ClinVar as well as federated platforms. “ClinVar allows groups to share and compare interpreted variants, whereas evidence, such as patient data, is best shared through federated patforms where we access data through APIs using standards being developed by GA4GH.” Like ClinVar, gnomad is another collaborative database with information about variants. ClinGen convenes experts around the world to develop standards and consensus in the interpretation of genes and variants.

There’s also hope that wider sharing will improve the ethnic diversity of the data. A number of ethnic populations are poorly represented in current databases. All this amid the growing evidence of substantive genetic differences between populations that drive diseases and disease progression. “There’s a debate about whether we should have multiple reference genomes to deal with regions with major structural differences,” Rehm said. Kingsmore is strongly for more sequencing of these underrepresented populations. “Thus far, the majority of sequencing has been disproportionately performed on the Caucasian population,” he said. “Ideally, we need to make sure that all populations have access to this testing and we need multiple reference genomes that include diverse populations to make sure the analysis is robust for individuals of all backgrounds.”

More advanced analytical tools are also coming into use. Artificial intelligence (AI) and machine learning (ML) are two very popular topics now. SOPHiA DDM integrates SOPHiA AI, which detects, annotates and pre-classifies all types of genomic variants such as SNVs, Indels, CNVs, amplifications, and fusions, in one single experiment. Its clinical grade solutions are used to detect and characterize variants associated with multiple conditions, including cancers and hereditary diseases. The company reports that more than 970 healthcare organizations from around the world have adopted SOPHiA’s platform, which “learns” from all the data contributed by user-members while securing data privacy.

But the ultimate goal is push-button diagnosis. “In the future all this will be automated,” said Kingsmore. His team recently published a new rapid-fire approach to genetic diagnosis. (Clark et al. Science Translational Medicine, 2019.) The Rady team applied automated machine-learning and clinical natural language processing (CNLP) to reduce the need for the labor-intensive manual analysis of genomic data. This work was done in collaboration with technology and data-science developers Alexion, Clinithink, Diploid, Fabric Genomics, and Illumina.

Kingsmore and others believe that a driving force for the continued maturation of clinical genome analysis will be Illumina’s acquisition of Edico Genome in 2018. Edico has been a leading provider of data analysis solutions for NGS. Its algorithms are expected to help improve and accelerate the analysis process for Illumina customers. “Today people have to knit together all these public and commercial tools and it still takes weeks for most of them to generate a diagnosis,” Kingsmore says. He envisions a day when something that sits on the sequencer automatically delivers the diagnosis. “That is where Illumina is evolving,” he said. “They produce results, not sequences.”

Rady is at the head of the curve, having set up their lab specifically to do speedy genetic analysis. The group has already delivered genetic diagnoses on approximately 1000 children since 2016. “It wasn’t long ago when it took us weeks to do just the interpretation step,” said Chowdhury, Kingsmore’s colleague. “Now we can do a tertiary analysis in a single day.”

That’s what patients all over the world are waiting for—a time when anyone with a condition that is related to a genetic cause can walk into a doctor’s office and get a dependable diagnosis within days rather than weeks, months, or even years.

Building a New Data Tool

As the applications and use of NGS has expanded, new analysis and interpretation tools have been needed. For example, FitzPatrick and his collaborators built VEP-G2P, which is an extension to Ensembl Variant Effect Predictor (VEP), another popular publicly available program that is used to predict the possible effects of a particular variant. VEP-G2P was built specifically to help diagnose patients with genetically heterogeneous clinical presentation, like those FitzPatrick sees in the clinic. Those are particularly hard cases. “The main problem is that each of us has several thousand variants,” he explains. “The challenge is filtering out the irrelevant ones.”

This project was powered by data from the Deciphering Developmental Disorders Study (DDD). That project recruited more than 13,000 patients with previously undiagnosed severe developmental disorders (DD) from the U.K. and the Republic of Ireland. Those patients, and their parents, were all sequenced. Next, a database of all known loci causing DDs was created, continually updated, and used in studies of DD. FitzPatrick and colleagues used the basic architecture and processes employed in building that database to create VEP-2GP and associated tools. Basically, they filter out variants that are found in healthy patients and then predict which of the remaining variants are likely to be pathogenic, that means going from a few thousand to 2-4 target variants. The program shows high sensitivity and precision compared to other public tools in a recent report in Nature Communications.
—Malorye Branca