William Haseltine

The severe acute respiratory syndrome (SARS)-coronavirus (CoV)-1 caused an epidemic in 2002 with 8437 cases in 30 countries. Its most recent relative, SARS-CoV-2, is responsible for a pandemic which started in 2019 and continues to linger driven by viral variants with 300 million cases and over 4 million deaths. The ability of SARS-CoV-2 to rapidly evolve into different variants has not ceased to surprise us, as exemplified by the Omicron variant with over 30 mutations in its spike protein that seem to have come from left field. This prompted us to launch a systematic effort to better understand how the virus genomes change. Here we examine the ends or termini of the SARS-CoV-2 genome, which are also known as the 5’ and 3’ untranslated regions.

Circularization of the viral genome involving complementary sequences in its ends

The replication and transcription of the positive-sense single stranded RNA genome of all coronaviruses including SARS-CoV-1 and 2 requires continuous and discontinuous negative-strand synthesis with differing stoichiometry of nucleic acid and protein components without splicing (Figure 1). To create the nested set of messenger RNAs, the elongating negative-sense strand jumps from a set of transcription regulatory sequences (TRS-Bs) to a similar leader transcription regulatory sequence (TRS-L) located in the 5’ genomic terminus. This unique mode of messenger RNA synthesis has prompted the speculation for the need for the proximity of the 5’ and 3’ termini during the synthesis of subgenomic negative-sense strands. We recently reported the presence of complementary terminal segments in SARS-CoV-1 and 2 that could mediate such circularization 1.

Figure 1.
Figure 1. The finding of potential circularization sequences which as far as we can determine are unique to SARS-CoV-1 and 2 as well as closely related viruses of bat and other animal origin, raises the question of how other human coronaviruses replicate without such sequences. Others have proposed that a protein bridge
composed of cap binding protein, eIF4E, eIF4G, and poly(A)

The finding of potential circularization sequences which as far as we can determine are unique to SARS-CoV-1 and 2 as well as closely related viruses of bat and other animal origin, raises the question of how other human coronaviruses replicate without such sequences. Others have proposed that a protein bridge composed of cap binding protein, eIF4E, eIF4G, and poly(A) binding protein mediates circularization 2,3. We speculate that the presence of the circularization sequences in SARS-CoV-1 and 2 helps to stabilize such protein bridges. It remains to be determined which viral and cellular factors contribute to the replication complexes and how the putative RNA-RNA interaction described here contributes to facilitating or stabilizing RNA-protein and protein-protein interactions in subgenomic RNA synthesis.

Structural flexibility at the ends of SARS-CoV-2 and related viruses

Evolution of SARS-CoV-2 variants involves point mutations and small insertions and deletions as well as recombination, in which segments of the genome of the virus are moved around or derived from another source. Tracking of variation in SARS-CoV-2 is focused on mutations, insertions, and deletions in viral structural proteins, most notably the spike protein. Our study of the genomic termini of SARS-CoV-2 also revealed unanticipated structural flexibility and its importance as an inherent source of variation that affects not only the essential functions of genomic termini in viral replication, transmission, gene expression, host pathogenicity, immune evasion, and variation but also those of other viral genomic regions.

By analyzing thousands of sequences of SARS-CoV-2 isolates from around the world including cruise ships and from closely related coronaviruses from bats in Laos, Thailand, and Great Britain, we recently reported numerous previously undescribed duplications, inversions, and translocations within the genomic termini and into or from coding regions of the viral genome 4. As illustrated in Figure 2, our analysis revealed a striking variation in the length and composition of 5’ terminal sequences driven by the presence at the proximal end of the viral genome of exact inverted duplications of 5’-UTR sequences of various lengths (~20 to over 100 nucleotides) relative to the Wuhan reference strain.

Figure 2.
Figure 2.

Changes in length of viral genomic termini via insertions may serve an adaptive purpose and reflect a compensatory mechanism to address a common problem of linear genomes, namely, that they fray at both ends. Rather than acquiring segments from cellular mRNA as do influenza viruses, SARS-CoV-2 and related bat coronaviruses appear more parsimonious by deriving the insertions from their own genomes, and in the case of the 5’-UTR from the negative-sense strand. The presence of the insertions leads to changes in how the RNA folds itself resulting in long-double stranded stems that encompass sequences that would have otherwise folded into individual stem-loop structures involved in interactions with RNA and proteins. The stem-loop (SL)1 loop in either plus-sense or minus-antisense orientation and the loops of SL5, both consistently present in one copy each, appear to be needed while the loops of SL2 to SL4.5 along with their conserved stems can become part of a long double-stranded stem. These observations call for a reexamination of the biochemical fundamentals of coronavirus replication and gene expression as we know them, and for correlations of these structural changes with pathogenicity, immune evasion, and infectiousness of these variants.

5’-UTR intragenic insertions in SARS-CoV-2 variants

Further search of potential insertions of 5’-UTR sequences revealed duplication and translocation of 5’ terminus sequences not only within the 5’-UTR but also to coding regions of the viral genome. We detected an insertion of a 27-nucleotide positive-sense strand segment of the 5’-UTR to the end of ORF8 gene in a SARS-CoV-2 variant isolated in Minnesota, USA and encoding an ORF8X protein with a modified carboxyl-terminus (5 last amino acids are replaced by 10 amino acids) (Figure 3). It was previously noted that a longer overlapping 57-nucleotide segment of the 5’-UTR duplicated and translocated in place of an 882-nucleotide deletion within the coding portion of the viral genome of a SARS-CoV-2 variant isolated from 3 patients in Hong Kong with absent ORF7a, ORF7b, and ORF8 (lineage B.1.36.27) and encoding a C-terminally modified ORF6 product, termed ORF6X 6.

Figure 3.
Figure 3.

We note that in both translocations, the end of the inserted 5’-UTR sequence corresponds to the leader transcription regulatory sequence (TRS-L) that has been associated with spots with a higher frequency of recombination, and that insertion occurs at the same site immediately proximal to the nucleocapsid (N) and ORF9b genes, thereby altering gene expression regulatory sequences at this location 7.

Analysis of the Omicron (B.1.1.528) variant (exemplified by OL672836.1 in Figure 4) revealed a 15-nucleotide match (only one differing nucleotide) which spans up to 16 nucleotides in 44 SARS-CoV-2 isolates (exemplified by OV045104 in Figure 4) from around the globe between the negative-sense strand of the 5’-UTR of SARS-CoV-2 Omicron and the region in Omicron’s spike (S) protein with insertion of the amino acids EPE at position 214 (ins214EPE) and proximal deletion of an asparagine at position 210 (N210del) relative to the Wuhan reference strain. Translation of the 5’-UTR anti-sense matching sequence generated an almost identical amino acid sequence (valine [V] and leucine [L] are conservative substitutions) to that present in Omicron’s S protein including N210del and the first glutamic acid (E) in ins214EPE. The remaining 5 nucleotides remain of unknown origin. We think it unlikely that this insertion originates from either cellular or other viral RNAs.

Figure 4.
Figure 4.

We also reported two instances of duplication, and/or inversion and translocation of coding sequences at the end of the nucleocapsid (N) gene and/or the beginning of ORF10 to the distal end of the 3’-UTRs of two bat coronaviruses. These insertions can form stem-loop structures that may affect 3’ terminus-mediated regulation of gene expression, minus strand synthesis, and viral RNA stability and turnover as well as viral evolution.

Deletions within the genomic termini

Another source of variation comes from deletion rather than insertion of sequences in the genomic termini. For instance, the 3’ genomic terminus of SARS-CoV-2 shares with other positive-sense single-stranded RNA viruses including beta, gamma, and delta coronaviruses, picornaviruses and astroviruses from various animals an approximately 40-nucleotide-long stem-loop-like motif (s2m). This sequence is recognized by a human microRNA, hsa-miR-1307-3p, as is a similar one in influenza A virus H1N1, which has caused epidemics of severe disease. A single point mutation in the target region of influenza A virus H1N1 adversely affects the binding of hsa-miR-1307-3p thereby weakening the host attack on the virus 5. The s2m motif is deleted in the 3’ UTR of the B.1.640.1 variant from Congo and France but not in that of its close relative B.1.640.2 (IHU variant) which is now spreading in southern France after originating in Cameroon, and in the background of a fast-spreading Omicron variant.


The relevance of genomic termini to viral evolution, replication and pathogenicity highlighted here calls for careful tracking of the ends of the genome of SARS-CoV-2. This is rendered difficult by limitations secondary to the selection of primers for sequencing which in several cases obviate the first and last 100 nucleotides at the ends of the virus. Most publicly available sequences of the Omicron variant lack detailed information on the extreme termini of the genome. The genomic termini, via their regulatory sequences, contribute to the overall transmissibility, pathogenicity and immune evasion of the virus and study of their variation will continue to shed light on all these clinically relevant areas.

Whether the structural rearrangements reviewed here provide an advantage or disadvantage to the viral variants remains to be determined and correlations with viral infectivity, pathogenicity and immune evasion are warranted. As we consider the potential of future variants, we must be mindful of the structural flexibility of genomic termini as an inherent source of variation. In view of the flexibility of the SARS-CoV-2 genome and as more therapeutic agents become available, it is going to become more important to determine the sequence of the SARS-CoV-2 variant affecting a patient to inform the choice of the most appropriate combination of antiviral drugs as is currently done for HIV-1.



1. Patarca R, Haseltine WA. Circularization via complementary sequences in the 5’ and 3’ termini may facilitate replication of SARS coronaviruses. Authorea. January 04, 2022. DOI: 10.22541/au.164132044.46753705/v1

2. Tarun SZJr, et al. 1997. Proc Natl Acad Sci USA 94: 9046

3. Spagnolo JF, Hogue BG. 2000. J Virol 74:5053

4. Patarca R, Haseltine WA. Structural flexibility of the SARS-CoV-2 genome relevant
to variation, replication, pathogenicity, and immune evasion.

5. Chan AP, et al. mSphere 2020. 5: e00754

6. Tse H, et al. 2021. J Inf Dis 73, 1696

7. Thorne LG, et al. Nature 2021 Dec 23. doi: 10.1038/s41586-21-04352-y


William R, Haseltine, PhD. is chair and president of the think tank ACCESS Health International, a former Harvard Medical School and School of Public Health professor and founder of the university’s cancer and HIV/AIDS research departments. He is also the founder of more than a dozen biotechnology companies, including Human Genome Sciences.

Roberto Patarca, M.D. PhD. is Chief Medical Officer at ACCESS Health International and a former pharmaceutical and medical device company executive and faculty at Harvard Medical School and the University of Miami. His research has focused on diagnostics, pathophysiology, pharmacogenomics, and immunotherapy of infectious and other diseases.

Also of Interest