A massive effort led by the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium has produced the most comprehensive map of cancer genomics to date. The work was published in a series of papers appearing in a special issue of Nature entitled, “Cancer Catalogued: Whole-genome sequences for 38 types of tumour.” The consortium analyzed 38 tumor types, sequencing 2,658 whole-cancer genomes (including 1,188 transcriptomes) alongside matched non-cancerous cell samples.
The scale of the PCAWG project (also referred to as Pan-Cancer) was unlike previous studies. It is “the breadth and rigor of the conducted inferential analyses” that marks the significance of this project, according to Marcin Cieslik, Ph.D., assistant professor of computational medicine & bioinformatics at the University of Michigan Medical School and co-author of the accompanying News & Views in Nature. Prior large-scale genomic studies to identify cancer causing variants, he asserted, focused largely on detecting and quantifying genetic events. In contrast, Cieslik explained, the accurate measurement of genetic aberrations “was only the starting point” in the PCAWG study. Because virtually no aspect of cancer genetics has been left untouched, he asserted, PCAWG has “established a reference methodology and baseline for most of the difficult inferential analyses.”
The second most frequent cause of death in the world, cancer is caused by mutations that cause proliferation of cells. PCAWG’s analysis of these mutations that come in many, complex arrangements, was broken up into six primary papers, all published in the special issue of Nature.
This body of work represents “a major forward step towards unravelling the vast genetic complexities of cancer” said Ana Vivancos, Ph.D., PI of Cancer Genomics at Vall D’Hebron Institue of Oncology, Barcelona, Spain. This “superb collection of papers” noted Vivancos, describes the structural variant landscape in cancer, retrotransposome-driven rearrangements, activated telomere maintenance mechanisms, as well as timings and patterns of tumor evolution, among several other important aspects.
“Each tumor type is different, representing its own jigsaw puzzle to be reconstructed,” said Peter Campbell, M.D., Ph.D., head of Cancer, Ageing and Somatic Mutation, and senior group leader at the Wellcome Sanger Institute. What Pan-Cancer shows, he continued, is that “we have all the pieces now—all the tools ready to analyze each tumor type.” This process has begun for all of the common tumor types, he asserted, and many of the rare ones too.
Vivancos told Clinical OMICs that, by navigating previously unchartered territory and providing the community at-large with massive data on whole cancer genomes—as opposed to coding regions alone—these analyses “gift researchers with a whole bunch of findings on the non-coding genome.” In addition, she pointed out, 30% of the samples have associated transcriptome data.
One arm of the consortium investigated the evolutionary trajectories of cancer, reconstructing the “life history and evolution of mutational processes and driver mutation sequences of 38 types of cancer.”
Another team took a deep dive into the role of RNA alterations of cancer. Using whole-genome sequence data, the team “associated several categories of RNA alterations with germline and somatic DNA alterations and probable genetic mechanisms.” The PCAWG Transcriptome Core Group functionally linked DNA and RNA, finding associations between DNA variants and gene expression of neighboring (or closely located) genes.
The genetic drivers present in non-coding DNA were also investigated. As Cieslik and Chinnaiyan write in their News & Views, this was an “ambitious undertaking” because of the degree of difficulty in detecting mutations in non-coding regions, which required “careful modelling” to execute the task. Results from this study were surprising, according to Cieslik, because of the “dearth of functional (driver) non-coding mutations.”
Mutations that provide a selective advantage—so called “driver mutations”—were found at an average rate of four to five (when combining coding and non-coding genomic elements) in cancer genomes. The researchers identified drivers in more than 90% of cases. However, in 181 tumors, there were no drivers found. This could be the result of multiple factors, either biological or technical, the team explains. Some tumor types had higher fractions of samples without identified drivers, such as chromophobe renal cell carcinoma and pancreatic neuroendocrine cancers. This suggests the possibility that other genomic events may trigger these cancers. Not being able to find drivers in 5% of cancers could also indicate that cancer driver discovery is an unfinished process.
To Campbell, the most striking finding is “just how different one person’s cancer genome is from another person’s.” Thousands of different combinations of mutations that cause the cancer; more than 80 different underlying processes generating those mutations, some reflecting the wear-and-tear of aging, some reflecting inherited causes, and others reflecting lifestyle factors. Nonetheless, he explains, one of the most exciting themes to emerge from Pan-Cancer is that we can begin to discern recurring patterns among all this enormous complexity.
“The most immediately applicable finding” noted Campbell “is that we can identify the tumor type of a cancer just from its pattern of mutations (the “signatures”) and which genes are affected.”
The identification and characterization of signatures, or characteristic patterns of DNA changes, were an important finding of this research. Two groups in the PCAWG used the large dataset to identify a total of 97 signatures. One group characterized 84,729,690 somatic mutations to identify 81 signatures comprising small mutations such as single base substitutions, doublet base substitutions, clustered base substitutions, and small insertions and deletions.
But small-scale mutations are only a portion of the DNA mutations that have a role in the etiology of cancer. Indeed, structural variants (SV)—created by larger exchanges of genetic material including inversions, balanced translocations, insertions and deletions—remain less well understood. This lack of understanding is, in part, due to the limitations of the current technology. Because SVs are larger DNA rearrangements, their detection hinges on the sequencing of long stretches of the genome. But, since short reads are the bread and butter and dominant method of DNA sequencing, SVs have remained a challenge to detect.
The focus on signatures of structural variation was groundbreaking, Cieslik noted, due to “its conceptual and methodological advances.” The development of new methods to group, classify and describe somatic SVs allowed the team to uncover reproducible SV signatures. From this work, 16 SV signatures emerged. The group concluded that a “wide variety of rearrangement processes are active in cancer, which generate complex configurations of the genome upon which selection can act.” Going beyond just signature identification, the team also could obtain functional insights into the signatures’ roles in cancer.
Using their system, the group detected and described the frequency of SV classes among tumor types. A significant finding was that extra copies of genomic templates are inserted during the rearrangement process. This includes, the authors note, “simple events such as tandem duplications, as well as a range of more-complex events with duplications and triplications that are rearranged locally as well as inserted distantly.”
Roughly 1% to 5% of cancers cannot be classified accurately with conventional diagnostic methods (so-called carcinomas of unknown primary), Campbell points out. “Our data show that we can solve the mystery for many of these just by sequencing its genome.”
Where does the map lead?
The authors write that this work “has brought us closer to a comprehensive narrative of the causal biological changes that drive cancer phenotypes” and that we must now “translate this knowledge into sustainable, meaningful clinical treatments.”
In order for patients to benefit from the information laid out by the PCAWG, linking the genetic findings to clinical phenotypes and outcomes is imperative. PCAWG, in addition to prior WGS studies, demonstrated the complexity of structural variation in cancer genomes. Cieslik noted that it will be important “to link those patterns of genomic rearrangements to patient clinical data in order to demonstrate the need for structural variation profiling in the clinic.” He added that “the relative lack of non-coding driver mutations argues against unselected sequencing of non-coding regions in the clinical setting.” To him, this suggests that “more efficient (in terms of cost, data volume, etc.) structural-variant detection technologies will be needed.”
The enormous complexity, explained Campbell, means that ultimately, especially for the common tumor types, “we will need thousands of cancer genomes for each type of cancer to understand it fully.” This will not be funded by the academic research sequencing done in this work, he said, rather “it will require us developing frameworks for accessing and analysing cancer genomes generated for patients as part of their routine clinical care.”
While there are existing national national sequencing programs in England, the Netherlands, South Korea, North America, and elsewhere aimed at translating these technologies into routine clinical diagnostics, the Pan-Cancer study provides a blueprint for these programs, illustrating all the ways a cancer genome can be analyzed and the insights that emerge. It describes how to make data analysis pipelines portable, stable and reproducible; and how to build a comprehensive knowledge bank of cancer genomes that will be the foundation for further data sharing and international collaboration.
“Regarding the likely clinical impact of these studies, it really is too early to say” Vivancos said. While we are “far from picking high-hanging fruit, these insights are seeding new hypotheses and inspiring more refined research.” She told Clinical OMICs that we must now “collectively build on these efforts by looking above, between and beyond the sprouting shoots.”
Computing, Bioinformatics Capabilities Kept Pan-Cancer Project Humming
When Nature published a collection of research findings in early February from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, the focus of the scientific community was, rightfully, on digging into the data which analyzed 38 different and tumor types to provide the most comprehensive map of cancer, to date.
While the research community begins to unpack this new treasure trove of information that is sure to impact the cancer care almost immediately, perhaps an underrated aspect of the years-long effort is the contribution made by the computing platforms and bioinformatics packages that supported the sequencing and analysis of the more than 2,600 whole-cancer genomes that were part of the research.
At the European Molecular Biology Laboratory (EMBL), one area of focus was to ensure that the cloud infrastructure needed to coordinate the research activities of roughly 1,100 scientists, dispersed across 37 countries, stayed up and running. While cloud computing seems tailor-made for allowing collaboration from many geographically dispersed research hubs, it is also prone to system crashes when run at scale. System downtime can cause significant delays and, with the delays, a financial impact.
To address this issue, EMBL’s Jan Korbel, group leader and senior scientist, and Sergei Yakneen, now chief operating officer at SOPHiA Genetics, developed a bioinformatics workflow system it calls Butler. Unlike typical cloud workflow tools, Butler constantly collects metrics on the health and operation of all system components in the cloud environment including the central processing nit (CPU), memory, or disk space. It also includes self-healing modules that take this information and can figure out if anything has gone wrong and can then automatically take steps to restart failed services or machines.
Before the advent of Butler, system checkups and maintenance require trained crews to ensure system health, or to make the necessary fixes in the event of a system crash. With Butler, the only time a human operator is required is when the system can’t heal itself. In those cases, Butler sends a notification of the problem via email or Slack.
The result, according to the EMBL team, is fewer crashes and accelerated research timelines. “It is indeed very rewarding that these large-scale analyses can now take place in a few months instead of years,” said Korbel.
Butler was an integral part of PCAWG. It processed a 725 terabyte cancer genome dataset, on 1500 CPU cores, 5.5 terabytes of RAM, and approximately one petabyte of storage.