RESOURCE AVAILABILITY
Lead Contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, S. Rouskin ([email protected])
Materials Availability
This study did not generate new unique reagents.
Data and Code Availability
The source code for the data processing and analyses is available at http://dreem.wi.mit.edu/static/dreem.zip and http://dreem.wi.mit.edu/static/DREEM_Manual.pdf
The sequencing data are deposited into NCBI Gene Expression Omnibus (GEO), (accession number pending).
EXPERIMENTAL MODEL AND SUBJECT DETAILS
SARS-CoV-2 total viral RNA was extracted from Vero cells (ATCC CCL-81) cultured in DMEM (Gibco) supplemented with 10% FBS (Gibco) plated into 100 mm dishes and infected at a MOI of 0.01 with 2019-nCoV/USA-WA1/2020 (Passage 6). Infected cells were incubated at 37 °C, 5% CO2 and harvested 2 days post infection either with or without DMS treatment. Infected cell pellets were centrifuged at 5000xg for 5 min at 4 °C and resuspended in Trizol (Ambion).
METHOD DETAILS
DMS modification of SARS-CoV-2 RNA in infected cells
200 μl DMS (or 2% v/v) was added dropwise to the plated Vero cells 48 h post SARS-CoV-2 infection and incubated for 4 min at 37 °C. DMS was neutralized by adding 15 ml PBS (ThermoFisher Scientific) with 30% β-mercaptoethanol. The cells were centrifuged at 1,000g for 5 min at 4 °C. The cells were washed twice by resuspending the pellet with 15 ml PBS with 30% β-mercaptoethanol and centrifugation to pellet then just once with 15 ml PBS. After washes, the pellet was resuspended in 1 ml Trizol (ThermoFisher Scientific) and RNA was extracted following the manufacturer’s specifications. Total RNA was purified using RNA Clean and Concentrator -25 kit (Zymo).
DMS modification of in vitro-transcribed RNA
gBlocks were obtained from IDT for the SARS-CoV-2 92nt and 283nt FSE which corresponds to nucleotides 13460-13546 and nucleotides 13,342-13,624 based on 2019-nCoV/USA-WA1/2020. The regions of interest were amplified by PCR with a forward primer that contained the T7 promoter sequence (TAATACGACTCACTATAGGGTT). The PCR product was used for T7 Megascript in vitro transcription (ThermoFisher Scientific) according to manufacturer’s instructions with a 16 h incubation time at 37 °C. Subsequently, 1 μl Turbo DNase I (ThermoFisher Scientific) was added to the reaction and incubated at 37 °C for 15 min. The RNA was purified using RNA Clean and Concentrator -5 kit (Zymo). 10 μg of RNA in 10 μl H2O was denatured at 95 °C for 1 min then placed on ice. On the basis of the DMS concentration used in the next step, 300 mM sodium cacodylate buffer (Electron Microscopy Sciences) with 6 mM MgCl2+ (refolding buffer) was added so that the final volume was 100 μl. (e.g. for 2.5% final DMS concentration: add 87.5 μl refolding buffer and 2.5 μl DMS) Then, 2.5 μl was added and incubated at 37 °C for 5 min while shaking at 500 r.p.m. on a thermomixer. The DMS was neutralized by adding 60 μl β-mercaptoethanol (Millipore-Sigma). The RNA was purified using RNA Clean and Concentrator -5 kit.
DMS modification of full-length SARS-CoV-2 RNA in vitro
Full-length SARS-CoV-2 RNA was extracted from the supernatant of infected Vero cells (as described above), resuspended in 1 ml Trizol (ThermoFisher Scientific) and RNA was extracted following the manufacturer’s specifications. The RNA was purified using RNA Clean and Concentrator -5 kit (Zymo) and DMS modified as described above.
Human rRNA subtraction of total cellular RNA
15 μg of total RNA per reaction was used as the input for rRNA subtraction. First, 1 μl rRNA subtraction mix (15 μg/μl) and 2 μl 5× hybridization buffer (end concentration: 200 mM NaCl, 100 mM Tris-HCl, pH 7.4) were added to each reaction, and final volume was then adjusted with water to 10 μl. The samples were denatured at 95 °C for 2 min and then temperature was reduced by 0.1 °C/s until the reaction was at 45 °C. Next, 10 μl RNase H buffer and 2 μl hybridase thermostable RNase H (Lucigen) preheated to 45 ° were added. The samples were incubated at 45 °C for 30 min. The RNA was cleaned with RNA Clean and Concentrator -5, following the manufacturer’s instructions and eluted in 45 μl water. Then, 5 μl Turbo DNase buffer and 3 μl Turbo DNase (ThermoFisher Scientific) were added to each reaction and incubated for 30 min at 37 °C. The RNA was purified with RNA Clean and Concentrator -5 (Zymo) following instructions.
RT–PCR and sequencing of DMS-modified RNA
For reverse transcription, 1.5 μg of rRNA subtracted total RNA or 10 μg of in vitro-transcribed RNA was added to 4 μl 5× first strand buffer (ThermoFisher Scientific), 1 μl 10μM reverse primer, 1 μl dNTP, 1 μl 0.1M DTT, 1 μl RNaseOUT and 1 μl TGIRT-III (Ingex). The reverse-transcription reaction was incubated at 60 °C for 1.5 h. 1 μl 4M NaOH was then added and incubated at 95 °C for 3 min to degrade the RNA. The cDNA was purified with Oligo Clean and Concentrator -5 (Zymo) following instructions. PCR amplification was done using Advantage HF 2 DNA polymerase (Takara) for 30 cycles according to the manufacturer’s specifications. The PCR product was purified by DNA Clean and Concentrator -5 (Zymo) following manufacturer’s instructions. RNA-seq library for 150 bp insert size was constructed following the manufacturer’s instruction (NEBNext Ultra™ II DNA Library Prep Kit). The library was loaded on ISEQ-100 Sequencing flow cell with ISEQ-100 High-throughput Sequencing Kit and the library was run on ISEQ-100 (paired-end run,151 x 151 cycles).
Library generation with DMS-modified SARS-CoV-2 RNA
After rRNA subtraction (described above), extracted DMS-modified RNA from SARS-CoV-2 infected Vero cells was fragmented using the RNA Fragmentation kit (ThermoFisher Scientific). 1.5 μg of rRNA subtracted total RNA was fragmented at 70 °C for 2.5 min. The fragmented RNA was mixed with an equal volume 2× Novex TBE-urea sample buffer (ThermoFisher Scientific) and run on a 10% TBE-urea gel (ThermoFisher Scientific) at 200V for 1 h 15 min for size selection of RNA that is ~150nt. To dephosphorylate and repair the ends of randomly fragmented RNA, 2 μl 10x CutSmart buffer (New England Biolabs), 10 μl shrimp alkaline phosphatase (New England Biolabs), 2 μl RNaseOUT (ThermoFisher Scientific) and water were added to a final volume of 20 μl and 37 °C for 1 h. Next, 4 μl 50% PEG-800 (New England Biolabs), 4 μl 10× T4 RNA ligase buffer (New England Biolabs), 4 μl T4 RNA ligase, truncated KQ (England Biolabs) and 2 μl linker were added to the reaction and incubated for 18 h at 22 °C. The RNA was purified with RNA Clean and Concentrator -5, following the manufacturer’s instructions for recovery of all fragments and eluted in 10 μl water. Excess linker was degraded by adding 2 μl 10× RecJ buffer (Lucigen), 1 μl RecJ exonuclease (Lucigen), 1 μl 5′ deadenylase (New England Biolabs) and 1 μl RNaseOUT, then incubating for 1 h at 30 °C. The RNA was purified with RNA Clean and Concentrator -5, following the manufacturer’s instructions and eluted in 11 μl water.
For reverse transcription, 1.5 μg of rRNA subtracted total RNA or 10 μg of in vitro-transcribed RNA was added to 4 μl 5× first strand buffer (ThermoFisher Scientific), 1 μl 10μM reverse primer, 1 μl dNTP, 1 μl 0.1M DTT, 1 μl RNaseOUT and 1 μl TGIRT-III (Ingex). The reverse-transcription reaction was incubated at 60 °C for 1.5 h. 1 μl 4M NaOH was then added and incubated at 95 °C for 3 min to degrade the RNA. The reverse-transcription product was mixed with an equal volume 2× Novex TBE-urea sample buffer (ThermoFisher Scientific) and run on a 10% TBE-urea gel (ThermoFisher Scientific) at 200V for 1 h 15 min for size selection of cDNA that is ~250nt. The size-selected and purified cDNA was circularized using CircLigase ssDNA ligase kit (Lucigen) following manufacture’s protocol. 2 μl of the circularized product was then used for PCR amplification using Phusion High-Fidelity DNA Polymerase (NEB) for a maximum of 16 cycles. The PCR product was run on an 8% TBE gel at 180V for 1 h and size-selected for products ~300 nt. The product was then sequenced with iSeq100 (Illumina) to produce either 150×150-nt paired-end reads.
Dual-luciferase frameshift reporter assay
92nt and 3000nt FSEs which corresponds to nucleotides13460-13546 and nucleotides 12686-15609 based on 2019- nCoV/USA-WA1/2020 were inserted into dual luciferase reporter bewteen firefly luciferease (F-Luc) coding sequence and renilla luciferase (R-Luc) coding sequence in -1 frame. Insertion of 0-frame stop codon between FLuc and FSE element is used as negative control construct whilst a construct of matching length in which F-Luc and R-Luc were translated continuously without frameshifting is used as a positive control.
Frameshifting reporter as well as positive and negative control mRNAs were in vitro transcribed and polyadenylated using HiScribe T7 mRNA kit (New England Biolabs) according to the manufacturers’ instructions. Purified mRNAs were transfected in HEK293T cells in 24-well plates using Lipofectamine MessengerMAX (ThermoFisher). 24 hours after transfection, cells were washed once with phosphate-buffered saline (PBS), and lysed in Glo Lysis Buffer (Promega) at room temperature for 5 min. 10 µL of lysate was diluted with 30 µL PBS before being mixed with 40 µL Dual-Glo FLuc substrate (Promega). After 10 min, FLuc activity was measured in a GloMax 20/20 luminometer (Promega). Subsequently, 40 µL Dual-Glo Stop & Glo reagent was added to the mixture, incubated for 10 min, and measured for RLuc luminescence. The ratio between RLuc and FLuc activities minus the negative control background luminescence and normalized to positive control luminescence was calculated as frameshift efficiency.
QUANTIFICATION AND STATISTICAL ANALYSIS
Mapping and quantification of mutations
Fastq files were trimmed using TrimGalore (github.com/FelixKrueger/TrimGalore) to remove Illumina adapters. Trimmed paired reads were mapped to the genome of SARS-CoV-2 isolate SARS-CoV-2/human/USA/USA-WA1/2020 (GenBank: MN985325.1) (Harcourt et al., 2020) using Bowtie2 (Langmead and Salzberg, 2012) with the following parameters: --local --no-unal --no-discordant --no-mixed -L 12 -X 1000. Reads aligning equally well to more than one location were discarded. SAM files from Bowtie2 were converted into BAM files using Picard Tools SamFormatConverter (broadinstitute.github.io/picard).
For each pair of aligned reads, a bit vector the length of the reference sequence was generated using DREEM (Tomezsko et al., 2020). Bit vectors contained a 0 at every position in the reference sequence where the reference sequence matched the read, a 1 at every base at which there was a mismatch or deletion in the read, and no information for every base that was either not in the read or had a Phred score <20. We refer to positions in a bit vector with a 0 or 1 as “informative bits” and all other positions as “uninformative bits.”
For each position in the reference sequence, the number of bit vectors covering the position and the number of reads with mismatches and deletions at the position were counted using DREEM. The ratio of mismatches plus deletions to total coverage at each position was calculated to obtain the population average mutation rate for each position.
Filtering bit vectors
In cases indicated below, bit vectors were discarded if they had two mutations closer than 4 bases apart, had a mutation next to an uninformative bit, or had more than an allowed total number of mutations (greater than 10% of the length of the bit vector and greater than three standard deviations above the median number of mutations among all bit vectors). The average mutation rate for each position was computed from the filtered bit vectors in the same way as described above.
Normalizing the mutation rates
The mutation rates for all of the bases in the RNA molecule were sorted in numerical order. The greatest 5% or 10% of mutation rates (specified where relevant in the main text) were chosen for normalization. The median among these signals was calculated. All mutation rates were divided by this median to compute the normalized mutation rates. Normalized rates greater than 1.0 were winsorized by setting them to 1.0 (Dixon, 1960).
Computing genome coverage and mutation rates
Genome-wide coverage (Figure 1C) was computed by counting the number of unfiltered bit vectors from the in-cell library that contained an informative bit (0 or 1) at each position. Signal and noise plots (Figure 1D) were generated from the unfiltered population average mutation rate. A total of 103 (0.34%) positions across the genome were discarded for having a noise mutation rate greater than 1% in the untreated sample (likely due to endogenous modifications or “hotspot” reverse transcription errors). The signal and noise were computed every 100 nt, starting at nucleotide 51. For each of these nucleotides, the average mutation rate was computed over the 100 nt window starting 50 bases upstream and ending 49 bases downstream. The “signal” was defined as the average mutation rate of A and C, while the “noise” was defined as the average mutation rate of G and U.
The correlation of mutation rates between biological replicates genome-wide (Figure 1B) was computed using the unfiltered bit vectors. The correlation of mutation rates between different conditions of the FSE (Figure 4B) was computed using the filtered bit vectors. The correlation of mutation rates between clusters and biological replicates for the FSE (Figure 5B) was computed using the filtered bit vectors after clustering into two clusters. For all correlation plots, the Pearson correlation coefficient is given. A total of 6 (0.02%) outliers with >30% mutation rate were removed to prevent inflating the Pearson correlation coefficients.
Folding the entire SARS-CoV-2 genome
The unfiltered population average mutation rate was obtained from the in-cell library reads. The 29,882 nt genome of SARS-CoV-2 was divided into ten segments, each roughly 3 kb the boundaries of which are predicted to be open and accessible by RNAz (Rangan, Zheludev and Das, 2020). For each segment, the population average mutation rate was normalized. The segment was then folded using the Fold algorithm from RNAstructure (Mathews, 2004) with parameters -m 3 to generate the top three structures, -md to specify a maximum base pair distance, and -dms to use the normalized mutation rates as constraints in folding. All mutation rates on G and U bases were set to -999 (unavailable constrains). Connectivity Table files output from Fold were converted to dot bracket format using ct2dot from RNAstructure (Mathews, 2004). The ten dot bracket structures were concatenated into a single genome-wide structure.
The data-structure correlation index (DSCI)
The data-structure correlation index (DSCI) quantifies how well a secondary structure model is supported by DMS or SHAPE reactivity data, under the assumption that genuinely unpaired bases are more reactive than paired bases. Given a secondary structure model in which every base is designated as paired or unpaired, and reactivity values for all or for a subset of bases in the model, the DSCI is defined as the probability that a randomly chosen unpaired base will have greater reactivity than a randomly chosen paired base. It is equal to the following:
where p is the set of reactivities for all m paired bases (indexed by i) and u is the set of reactivities all n paired bases (indexed by j). Bases without reactivity information (such as Gs and Us for DMS data, and any problematic base) are excluded from p and u.
The DSCI is closely related to the Mann-Whitney U statistic (Mann and Whitney, 1947), which is obtained from the above equation without dividing by mn (assuming no ties in reactivities). The calculation is implemented in Python using the SciPy Stats MannWhitneyU function (Virtanen et al., 2020), and dividing the result by mn. If min(m, n) < 5, then we return a missing value to avoid biases caused by very low numbers of paired or unpaired bases.
The modified Fowlkes-Mallows index (mFMI)
Given two RNA structures of the same length (L) in dot-bracket notation, all base pairs in each structure were identified. Each base pair was represented as a tuple of (position of 5’ base, position of 3’ base). The number of base pairs common to both structures (P12) as well as the number of base pairs unique to the first structure (P1) and to the second structure (P2) were computed. Given these quantities, the Fowlkes-Mallows index (a measure of similarity between two binary classifiers) is defined as
As the Fowlkes-Mallows index does not consider positions at which the structures agree on bases that are unpaired, the index needed to be modified; otherwise regions with few base pairs would tend to score too low. Thus, the number of positions at which both sequences contained an unpaired base (U) was computed. Two variations of the modified Fowlkes-Mallows index (mFMI) were tested that differed in their treatment of externally paired bases, defined as bases paired to another base outside of the region of the structure being compared. The version of mFMI excluding external base pairs counted all externally paired bases as unpaired when computing U. The number of positions containing a paired base (P) was computed as P = L – U. In this case, mFMI was defined as mFMI = U/L = P/L x FMI, which weights the Fowlkes-Mallows index by the fraction of paired bases and adds the fraction of unpaired bases (U/L), as the structures agree at all unpaired positions.
To include external base pairs, any position containing an externally paired base was not counted in U. The number of positions at which both structures contained an externally paired base with the same orientation (i.e. both facing in the 5’ or 3’ direction) was computed as the number E. The number of positions at which at least one structure contained a base that was paired, but not externally, was computed as P. Then, the mFMI was defined as mFMI = U/L + E/L + P/L x FMI, which weights the Fowlkes-Mallows index by the fraction of positions containing a paired base and considers positions in which both bases are unpaired as in agreement, but only counts externally paired bases as agreeing if both structures contain an externally paired base at the same position and the base pairs have the same orientation.
Comparisons to previous in silico predictions
Excel files from the supplemental material of (Rangan, Zheludev and Das, 2020) were parsed to obtain the coordinates and predicted structures. For each predicted structure, agreement with the region of our structure with the same coordinates was computed using the mFMI, either including or excluding external base pairs (as specified in the text). Box plots of the agreement for each window (Figure 3B) show the minimum, first quartile, median, third quartile, and maximum; data lying more than 1.5 times the interquartile range from the nearest quartile are considered outliers and are plotted as individual points. The numbers of points in each box plot are given in the Results section for Figure 3B.
Folding the frameshift stimulating element
Reads from RT-PCR of a 283 nt segment of in-cell RNA spanning the FSE (nucleotides 13,342 - 13,624) were used to generate bit vectors. The bit vectors were filtered as described above, and the filtered average mutation rates were normalized. The RNA was folded using the ShapeKnots algorithm from RNAstructure (Hajdin et al., 2013) with parameters -m 3 to generate three structures and -dms to use the normalized mutation rates as constraints in folding. All signals on G and U bases were set to -999 (unavailable constrains). Connectivity Table files output from Fold were converted to dot bracket format using ct2dot from RNAstructure (Mathews, 2004).
Coronavirus sequence alignments
Accession numbers of curated sarbecovirus and merbecovrus genomes were obtained from (Ceraolo and Giorgi, 2020) and downloaded from NCBI. The sequences were aligned using the MUSCLE (Edgar, 2004) web service with default parameters. The region of the multiple sequence alignment spanning the two sides of Alternative Stem 1 was located and the sequence conservation computed using custom Python scripts.
For the alignment of all betacoronaviruses with genomes in NCBI RefSeq (O’Leary et al., 2016), all reference genomes of betacoronaviruses were downloaded from RefSeq using the query “betacoronavirus[organism] AND complete genome” with the RefSeq source database as a filter. The sequences were aligned using the MUSCLE (Edgar, 2004) web service with default parameters. The subgenus of betacoronavirus to which each virus belonged was obtained from the NCBI taxonomy database (Sayers et al., 2009).
Detecting alternative structures genome-wide
The reference genome (length = 29,882 nt) was partitioned into 373 regions of 80 nt each and one final region of 42 nt. For each region, reads were filtered out according to the criteria in “Filtering Bit Vectors” or if they did not overlap with at least 20% (16 nt) of the region. The reads were then clustered using the EM algorithm implemented previously (Tomezsko et al., 2020) using a maximum of two clusters per region, ignoring G and U residues, and setting all mutation rates less than 0.005 to 0.0.
After clustering, regions were filtered out if fewer than 100,000 reads mapped to the region (n = 42) or if either cluster contained a base with a mutation rate exceeding 30% (n = 16). For each remaining region with two clusters (n = 316), each cluster’s mutation rates (µ) were normalized by setting the base with the highest mutation rate to 1.0 and scaling the mutation rates of all other bases proportionally. For each base, the difference in DMS reactivities (∆DMS) between its mutation rate in cluster 1 (µ1) and cluster 2 (µ2) was calculated as . The coefficient of determination (R2) was also computed on the normalized DMS reactivities.
Detecting alternative structures of the FSE
The filtered bit vectors (the same used to fold the frameshift stimulating element) were clustered using the expectation maximization algorithm of DREEM to allow detection of a maximum of two alternative structures (Tomezsko et al., 2020).
Quantification of minus-strand reads
Mapped reads from the in-cell library were classified as minus-strand using a custom Python script if they had the following SAM flags (Li et al., 2009): PAIRED and PROPER_PAIR and ({READ1 and MREVERSE and not REVERSE} or {READ2 and REVERSE and not MREVERSE}) and not (UNMAP or MUNMAP or SECONDARY or QCFAIL or DUP or SUPPLEMENTARY).
Visualizing RNA structures
RNA structures were drawn using VARNA (Darty, Denise and Ponty, 2009). The bases were colored using the normalized DMS signals.