Larch RNA-sequencing and transcriptome assembly
The Siberian larch nuclear genome and autumn bud transcriptome were sequenced and assembled in the Laboratory of Forest Genomics of Siberian Federal University (Krasnoyarsk, Russia) [13]. RNA was isolated from autumn buds from a reference Siberian larch tree using the Qiagen RNeasy Mini Kit (Qiagen, Hilden, Germany). The RNA-seq library was prepared using the TruSeq RNA Sample Preparation Kit v2 (Illumina Inc., San Diego, CA, USA). The PE-sequencing of the obtained library was carried out on the Illumina MiSeq platform using the MiSeq Reagent Kit v2 (300-cycles) (Illumina Inc., San Diego, CA). FastQC software v. 0.11.9 was used to evaluate the quality of the sequencing data. The raw sequencing data were processed using Trimmomatic program v. 0.39 [34] (9-bp headcrop, minimum read quality of Q=23, and minimum read length of 35 bp). SortMeRNA version 4.0.0 [35] was used for ribosomal RNA removal. In addition, Rcorrector [36] was used for sequencing error correction. Finally, de novo assembler Trinity v. 2.9.1 [37] was used to assemble the Larix sibirica transcriptome.
Search and identification of REs
To assess the relative abundance of previously characterized repeat families, RepeatMasker was used on a whole assembly of Siberian larch (12.3 Gbp).
To search for REs, we used RepeatModeler v.1.0.11, which is based on de novo RE detection programs RepeatScout and RECON [38, 39]. Since RepeatScout does not use all scaffolds or contigs for the analysis, but only a part of them that is randomly selected, it was decided to analyse 2 869 scaffolds from a larch genome assembly longer than 100 Kbp (Table 3).
Table 3 Siberian larch genome assembly and scaffolds longer than 100 Kbp selected for REs search and identification using RepeatModeler v.1.0.11
Assemblies
|
Number of scaffolds
|
Total length, bp
|
Max length, bp
|
Genome assembly, bp
|
11,325,800
|
12,342,093,815
|
354,326
|
Selected scaffolds longer than 100 Kbp
|
2,869
|
360,016,106
|
354,326
|
This derived RepeatModeler de novo library was augmented by clustering of frequently occurring reads from whole-genome sequencing data. Clusters of reads were assembled with Inchworm from TrinityRnaSeq v2.2.0, which resulted in consensus sequences that should represent highly repeated regions of the larch genome. Unrecognized elements from de novo repeat library generated by RepeatModeler and frequently occurring reads were classified by TEclass v2.1.3. This program classifies transposons using the Support Vector Machines (SVM) and LVQ neural network [40]. The combined library, comprising the RepeatModeler derived library classified with TEclass, RepBase library (Edition 2017.01.27), MIPS Repeat Element Database library (Nussbaumer et al. 2013), Custom Plant Repeat Database (Wegrzyn et al. 2013) and Pine Interspersed Repeats Resource library PIER v1.0 (Wegrzyn et al. 2013; Neale et al. 2014) was used for sequence similarity search.
Leucine-rich repeats (LRRs)
LRRs were searched in ORFs of the transcripts of Siberian larch and Sitka spruce (NCBI GenBank accession number GACG00000000.1) autumn buds. ORFs were identified using Transdecoder v.5.5.0 (https://github.com/TransDecoder). These ORF of the transcript sequences were scanned by HMMER 3.2.1 [41] against the Pfam models LRR-1 (ID PF00560), LRR-2 (ID PF07723), LRR-3 (ID PF07725), LRR-4 (ID PF12799 ), LRR-5 (ID PF13306), LRR-6 (ID PF13516), LRR-8 (ID PF13855) and LRR-9 (ID PF14580). All LRR models were obtained from the Pfam 32.0 database and belong to the LRR clan (ID CL0022). Also LRR clan contains families LRR-10, LRR-11, LRR-12, FNIP, DUF285, Recep_L_domain, and TTSSLRR, but they were excluded because they represent bacteria, animals, and myxomycetes [42]. Statistics of ORFs in transcriptome assemblies are presented in Table 4.
Table 4 ORFs identified in the Siberian larch and Sitka spruce transcriptomes studied in autumn buds
Species
|
Number
|
Total length, bp
|
Maximum length, bp
|
N50, bp
|
N90, bp
|
Siberian larch
|
22,116
|
4,315,585
|
1,980
|
207
|
110
|
Sitka spruce
|
10,106
|
1,827,092
|
706
|
192
|
116
|
A search for NBS R-genes (NB-ARC; obtained from the Pfam 32.0 database: PF00931) was additionally performed to check if some sequences with LRRs belong to R-genes.
The computing was mostly performed using a 96-core server with symmetric parallel multiprocessing (IBM x3950 X6) and 3 TB of RAM. The computing cluster also included an IBM dx360 M4 hybrid computational server with two NVIDIA Tesla K20 GPUs, as well as an IBM Storwize V3700 48Tb storage subsystem. The cluster runs on SuSe Linux Enterprise Server 11 with installed parallel file system IBM GPFS, monitoring system Ganglia and Torque batch processing system.
Gene ontology (GO) analysis
The OmicsBox [43] was used for BLASTing, GO mapping, annotation and statistical analysis. Gene ontology (GO) terms associated with the obtained BLAST results were extracted, and evaluated GO annotation was obtained. The annotation step was reduced because graph construction was not possible with sequences extracted from total sequence. Enzyme codes were inferred by mapping with equivalent GOs, while InterPro motifs were directly queried at the InterProScan web service. The GO annotation was visualized by reconstructing the structure of the GO relationships and pathways.