A team of Princeton University-led researchers has used an artificial intelligence algorithm to identify mutations in “junk” DNA that can cause autism. The findings, reported in Nature Genetics, are the first to functionally link mutations in regulatory DNA with a complex disease such as autism, and suggest that the alterations affect the expression of genes in the brain, including those responsible for neuron migration and development.
The researchers say the approach could also be used more widely to study the role of noncoding mutations in disorders such as cancer or heart disease. “This is the first clear demonstration of noninherited, noncoding mutations causing any complex human disease or disorder,” commented Olga Troyanskaya, PhD, a Princeton professor of computer science and genomics. “This method provides a framework for doing this analysis with any disease.” Troyanskaya is senior author of the team’s published report, which is titled, “Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk.”
Great progress has been made in understanding the genetics of autism spectrum disorder (ASD), the authors commented, but studies suggest that mutations in protein-coding genes account for only about 30% of spontaneous cases of autism, where there is no known family history. In fact, only about 1–2% of the human genome comprises genes that code for proteins. The remaining 98–99% of the genome is noncoding DNA, historically known as “junk” DNA, but which we now know contains regulatory regions that play important roles in controlling where and when genes are expressed.
While a potential role for noncoding mutations in ASD has “long been speculated,” the authors continued, it hasn’t been possible to sift through the entire genome to identify alterations in regulatory DNA and predict how these changes might contribute to ASD or other complex diseases. Any one individual may have dozens of noncoding mutations, many of which will be unique to them, and identifying common mutations among individuals with the same disorder hasn’t been possible. “Analysis of contribution of noncoding mutations to ASD is challenging due to the difficulty of assessing which noncoding mutations are functional and, of these, which contribute to the disease phenotype,” the authors stated.
Previous analysis approaches have found it challenging to detect any difference in the number of regulatory mutations in people with autism compared to unaffected people. Historically the only way to achieve this level of insight would be through an experimental approach involving testing each mutation against more than 2,000 protein-protein interactions, and across cell types and tissues, which would amount to hundreds of millions of experiments. And even more recent machine learning approaches have only been able to look at targeted sections of the genome.
Taking a different approach, Troyanskaya et al., trained a machine learning model to predict how a variation in a stretch of noncoding DNA might affect gene expression. They applied the model to the Simons Simplex Collection, an autism population that comprises the whole genomes of nearly 2,000 family “quartets”, including a child with autism, an unaffected sibling, and their unaffected parents. In each family there had been no prior history of autism, so any mutations in the affected child must have arisen spontaneously, and were likely to have played a role in the disorder.
“The design of the Simons Simplex Collection is what allowed us to do this study,” said study co-author Jian Zhou, PhD, at the Flatiron Institute’s Center for Computational Biology (CCB). “The unaffected siblings are a built-in control.” The Simons Simplex Collection is maintained by the Simons Institute, which is the parent organization of the Flatiron Institute. Troyanskaya is also deputy director for genomics at the Flatiron Institute’s CCB.
The team applied the computer algorithm to data on 1,790 quartets in the collection. The system was able to learn about patterns in the genome, and teach itself how to identify biologically relevant sections of DNA, and predict whether alterations in noncoding regions might play a role in any of the 2,000-plus protein interactions that affect gene regulation. The algorithm in effect “slides along the genome” analyzing each base pair in the context of the 1,000 base pairs around it, and scanning all the mutations. The computational tool can then predict the effect of any mutation in the genome, and come up with a prioritized list of DNA sequences that are likely to regulate genes, and mutations that are likely to impact on that regulation.
“What our paper really allows you to do is take all those possibilities and rank them,” commented Christopher Park, PhD, who is currently a visiting scientist at Lewis-Sigler Institute for Integrative Genomics, and a researcher at the Flatiron Institute. “That prioritization itself is very useful, because now you can also go ahead and do the experiments in just the highest priority cases.” The computational system finally generates a “disease impact score” as an estimation of how likely a mutation is to have an effect on disease.
“This is a shift in thinking about genetic studies that we’re introducing with this analysis,” noted Chandra Theesfeld, PhD, a research scientist in Troyanskaya’s lab. “In addition to scientists studying shared genetic mutations across large groups of individuals, here we’re applying a set of smart, sophisticated tools that tell us what any specific mutation is going to do, even those that are rare or never observed before.”
Interestingly, the results indicated that mutations in the noncoding regions affected similar genes and functions to those that had previously been linked with autism through studies focused on coding genes. “Notably, our study reveals important biological convergences among the genetic dysregulations associated with ASD,” the authors wrote. “Our analyses of the disease impact of mutations with effects on DNA and RNA point to similar sets of impacted genes and pathways, indicating that the effects of regulatory mutations are convergent. Furthermore, high-impact noncoding regions that we find in ASD probands affect the same genes previously found to be impacted by LoF [loss of function] coding mutations in ASD … This convergence provides support for a causal contribution of noncoding regulatory mutations to ASD etiology.”
“This is consistent with how autism most likely manifests in the brain,” stated Park. “It’s not just the number of mutations occurring, but what kind of mutations are occurring.” Troyanskaya added, “Right now, 98% of the genome is usually being thrown away. Our work allows you to think about what we can do with the 98%.”