Looking to significantly reduce the error rate inherent in the assembly of genomes using high-throughput sequencing (HTS) methods, Google this week launched DeepVariant, a deep-learning technology that its creators contend will “reconstruct the true genome sequence from HTS sequencer data.”
“DeepVariant is the first of what we hope will be many contributions that leverage Google's computing infrastructure and ML expertise to both better understand the genome and to provide deep learning-based genomics tools to the community,” wrote Mark DePristo and Ryan Poplin of the Google Brain Team in a blog post. Development of the tool took roughly two years and was a collaborative effort between Google Brain and sister Alphabet company Verily Life Sciences.
The problem the DeepVariant team sought to solve are the error rates inherent in the 100-base short reads of HTS, which can range anywhere from 0.1% to as high as 10%. To do this, the developers at Google eschewed the more common statistical methods of variant calling to instead perform this reconstruction using image classification, leveraging the company’s expertise in using neural networks for image recognition.
To help “train” DeepVariant, the Google team used reference genomes from the Genome in a Bottle consortium (GIAB). “Using multiple replicates of these genomes, we produced tens of millions of training examples in the form of multi-channel tensors encoding the HTS instrument data, and then trained a TensorFlow-based image classification model to identify the true genome sequence from the experimental data produced by the instruments,” DePristo and Poplin wrote. Using this visual approach, DeepVariant can automatically identify small insertions and deletions, and single-base-pair mutations from raw sequencing data.
A year ago, DeepVariant won the award for the highest SNP accuracy at the precisionFDA Truth Challenge. This despite the fact the tool had no specific or specialized knowledge of genomics and HTS. Since then, the development team has continued to train the system and has decreased the variant calling rate by another 50%.
DeepVariant is being made available to the genomics community as an open-source tool on the Google Cloud Platform.
For co-developer Verily Life Sciences, the tool promises to play a significant role in the company’s ongoing clinical studies, notably Project Baseline, a collaboration with Duke University School of Medicine and Stanford Medicine. Project Baseline will recruit 10,000 people to participate in a one of the broadest longitudinal studies of human health, one that will collect medical information, genomic data, and patient-generated health and behavioral data in an effort to more precisely understand the nature of health and disease development.
“With recent advances at the intersection of science and technology, we have the opportunity to characterize human health with unprecedented depth and precision,” said Jessica Mega, M.D., chief medical officer of Verily at its launch earlier this year. “The Project Baseline study is the first step on our journey to comprehensively map human health.”