Earl Grey is available from a Github repository (https://github.com/TobyBaril/EarlGrey) and can be run on Linux distributions, such as Ubuntu. It can be installed on a local system or a HPC. Earl Grey is parallelised and makes use of multiple CPU threads to reduce runtime. Given that storage input/output speed can be a limiting factor when annotating TEs, analysing individual genome assemblies back-to-back (instead of running multiple genome analyses at once) can be most time efficient.
Earl Grey runs in Conda or Docker environments to avoid conflicts between tool versions and to streamline the installation procedure. Users who do not have RepeatMasker (Smit, Hubley and Green, 2013) and RepeatModeler2 (Flynn et al., 2020) installed, and who do not wish to do this, can install Earl Grey in an interactive Docker container, which will install and configure all dependencies automatically and provide a virtual machine in which to perform all Earl Grey analyses. For a Conda installation (which uses the Anaconda Distribution Platform), it is necessary to have RepeatMasker and RepeatModeler2 pre-installed. This is because the Conda implementations of these packages are difficult to configure with the expanded Dfam and RepBase libraries, and our aim is to minimise the difficulty of configuration for the user. Instructions for the installation of these dependencies are provided for users who do not currently have them installed. As a minimum requirement, RepeatMasker must be configured to use the Dfam database of repetitive DNA elements (Hubley et al., 2016) (tested with all versions from release 3.2 onwards). If users have access to RepBase RepeatMasker edition libraries (Jurka et al., 2005; Kapitonov and Jurka, 2008), they are encouraged to configure RepeatMasker with these in addition to Dfam. A configuration script provided with Earl Grey will automatically check for dependencies and configure the Conda environment and related tools to enable the user to run Earl Grey.
Once installed and configured, Earl Grey will run on a given input genome assembly in FASTA format with a single command. Prior to running Earl Grey, the “earlGrey” conda environment must be activated ‘conda activate earlGrey’. Once the conda environment is active, Earl Grey can be called with the command ‘earlGrey’. There are four required options, and 5 optional parameters (Table 1). For example, a run on the Homo sapiens genome, with the genome located in the current directory, could be started with the minimum parameters: ‘earlGrey -g homoSapiens.fasta -s homoSapiens -o ./homoSapiens_outputs/ -r 9606’.
Command Flag
|
Description
|
Required?
|
-g
|
Path to genome FASTA file
|
Y
|
-s
|
Name for results files (cannot contain spaces)
|
Y
|
-o
|
Path to output directory (. represents current directory).
|
Y
|
-r
|
RepeatMasker species search term used for the initial mask of known elements, in string (“Arthropoda”) or NCBI taxonomy ID (6656) format.
|
Y
|
-t
|
Number of threads (default: 1)
|
N
|
-l
|
Path to FASTA file containing a custom library for initial mask of known elements (e.g if the user has some previous manually-curated elements to use)
|
N
|
-i
|
Number of iterations to run the BLAST, Extract, Extend process (Default: 5)
|
N
|
-f
|
Number of flanking bases to add in each round of the BLAST, Extract, Extend process (Default: 1000)
|
N
|
-d
|
Maximum distance between two TEs to define clusters
|
N
|
Table 1. Parameters for Earl Grey.
Earl Grey runs through a multi-step TE curation and annotation pipeline to annotate a given genome assembly with all intermediate results saved in their respective directories (Figure 1, Table 2). Logs are printed to ‘stdout’ (the console) and saved to a log file in the Earl Grey output directory. The steps involved in the Earl Grey TE annotation procedure are outlined below:
1. The first step of Earl Grey prepares the input genome for analysis. Some tools used in the pipeline are sensitive to long header names. To prevent associated issues, header names are stored in a dictionary and replaced with generic headers using the naming convention ‘ctg_n’, where ‘ctg’ is a short generic name for each contig, and ‘n’ is a unique integer for each entry in the FASTA file. Ambiguous nucleotide IUPAC codes are replaced with “N” due to incompatibility with some tools, including the search engines used by RepeatMasker. The original input genome file is backed up and compressed, with the prepared input genome version saved under the same file name appended with the extension ‘.prep’.
2. Known repeats are identified and masked using RepeatMasker and a user-specified subset of the TE consensus libraries (i.e. RepBase and/or Dfam depending on RepeatMasker configuration). A sensitive search is performed ignoring low-complexity repeats and small RNA genes ‘-s -nolow -norna’, and a hard masked version of the input genome is produced with nucleotides within known TEs replaced with ‘N’.
3. The hard masked version of the genome is subsequently analysed with RepeatModeler2 for de novo TE identification. The optional LTR identification step included as part of RepeatModeler2 is not used, as we implement a separate LTR curation step later in the Earl Grey pipeline during the RepeatCraft stage, as this is a requirement for RepeatCraft (see step 12). RepeatModeler2 outputs a library of de novo consensus sequences using the following naming convention: “rnd-n_family-n#TE_Classification” (e.g rnd-1_family-256#LINE/R2-Hero).
4. The success of the RepeatModeler2 run is verified as failures can occur when annotating certain genome assemblies. For example, when annotating a genome where enough unsampled nucleotides remain to initiate a new round of RepeatModeler2, but where there are not enough unsampled long sequences for the additional round to run successfully, this leads to a program failure (e.g. https://github.com/Dfam-consortium/RepeatModeler/issues/118). If this occurs, Earl Grey will automatically restart the RepeatModeler2 run with a reduced maximum stage number to ensure it runs successfully.
5. To reduce redundancy, the set of de novo consensus sequences are clustered using cd-hit-est (Li and Godzik, 2006; Fu et al., 2012) with parameters satisfying the TE family definition of Wicker et al. (2007), implemented as described by Goubert et al. (2022) ‘-d 0 -aS 0.8 -c 0.8 -G 0 -g 1 -b 500 -r 1’. This step is required due to RepeatModeler2 generating multiple redundant consensus sequences for a single TE family.
6. To generate maximum-length de novo TE consensus sequences, the de novo TE library is processed with an automated implementation of the “BLAST, Extract, Extend” (BEE) process described by Platt et al. (2016). Each iteration consists of three main steps: (i) BLAST. A BLASTn (Camacho et al., 2009) search is performed with a de novo TE majority rule consensus sequence (generated by RepeatModeler2) as the query, and the input genome as the subject, to obtain up to the top 20 hits for each TE consensus sequence. Up to the top 20 hits are taken to ensure only the highest quality TE sequences of the TE family are used to generate new consensus sequences, as including more hits introduced a higher number of ambiguous positions in new consensus sequences during testing. Top hits are selected based on BLASTn scoring parameters: bitscore, percentage identity, and length of match, where hits with the highest bitscores (indicating statistical significance of an alignment between the query TE sequence and subject genome assembly), highest percentage identities (indicating the proportion of identical positions shared in the query TE sequence and subject genome assembly), and longest lengths of match are selected. (ii) Extract. Genomic sequences corresponding to the selected BLASTn hits are extracted from the input genome. (iii) Extend. The extracted hits are extended by adding 1kb of flanking sequence to both the 5’ and 3’ ends (other flank lengths can be defined by the user). The resultant output files each contain sequences corresponding to a single TE family sequence with extended flanks. Each set of TE sequences is aligned using MAFFT (version 7.487) (Katoh and Standley, 2013) with the ‘--auto’ flag to generate a multiple sequence alignment of the genomic TE sequences with their extended flanks. Resultant alignments are trimmed with trimAl (version 1.4.rev22) (Capella-Gutiérrez, Silla-Martínez and Gabaldón, 2009) to retain high quality positions in the alignment (‘-gt 0.6’). Updated consensus sequences are generated with EMBOSS (version 6.6.0) (Rice, Longden and Bleasby, 2000) cons (‘-plurality 3’), resulting in new TE consensus sequences with extended ends.
7. The updated consensus sequence is then used as the initial query in a new BEE process, until five iterations have been performed, but up to ten iterations can be specified by the user.
8. Following the BEE process, redundancy is again removed by clustering sequences likely to represent the same family using cd-hit-est with the same parameters applied in the previous clustering step. It is important to include this post-BEE clustering step to reduce TE library redundancy following the generation of extended TE consensus sequences, as the process of generating extended TE consensus sequences can reveal homology between library sequences.
9. The resulting TE consensus sequences then undergo a TE sequence reclassification step, where similarity between the de novo consensus sequences and known TEs is identified using a RepeatMasker run. If sufficient similarity is not detected, the repeat family is designated as ‘Unclassified’.
10. The curated de novo consensus sequences are combined with the known TE library subset used during the initial RepeatMasker step to produce a combined TE library.
11. This combined TE library is used to annotate TEs in the input genome using RepeatMasker applying a conservative score threshold of 400 (‘-cutoff 400’) to exclude poor matches unlikely to be true TE sequences.
12. The input genome is then analysed with LTR_Finder (version 1.07) (Xu and Wang, 2007), using the LTR_Finder parallel wrapper (Ou and Jiang, 2019), to identify full-length LTR elements.
13. The TE annotations from the final RepeatMasker run are defragmented and combined with the LTR_Finder results using the loose merge ‘-loose’ process in RepeatCraft (Wong and Simakov, 2018) (https://github.com/niccw/repeatcraftp), which produces a modified GFF file containing the refined TE annotations.
14. Following defragmentation, Earl Grey removes overlapping TE annotations using a custom R script employing GenomicRanges (Lawrence et al., 2013), which ignores strand information and retains the longest TE of overlapping pairs.
15. Finally, to decrease the incidence of spurious hits unlikely to be true TE sequences, all TE annotations less than 100bp in length are removed before the final set of annotated TEs are quantified.
16. Summary figures are generated. Earl Grey automatically produces figures providing a general overview of TEs in the input genome (Figure 2): (i) A pie chart illustrating the proportions of the genome assembly annotated with the main TE classifications and non-TE sequence; (ii) A repeat landscape plot, which illustrates the genetic distance between each identified TE and their respective consensus sequences (calculated using ‘calcDivergenceFromAlign.pl’ utility of RepeatMasker), broadly indicative of patterns of TE activity (recently active TE copies are assumed to have low levels of genetic distance to their respective family consensus). Consistent colour keys are used for the pie chart and repeat landscape to facilitate comparability.
17. The final TE annotation file is analysed with bedtools cluster to identify clusters of TEs. By default, TEs are considered to be part of the same cluster if they are separated by <200bp, although this can be user-defined.
18. Upon completion, main results are stored within the summary files directory. This directory contains: (i) TE annotation coordinates in both GFF3 and Bed format; (ii) The combined TE library used for the final RepeatMasker annotation; (iii) The de novo TE library containing all de novo TE consensus sequences processed using the BEE methodology; (iv) Summary figures.
Path to Directory
|
Contents
|
outputDirectory/{speciesName}EarlGrey
|
All Earl Grey outputs. The EarlGrey directory is created within the directory designated as the output directory by the user, and where {speciesName} is set with the -s flag.
|
All directories described below are found within {speciesName}EarlGrey/
|
/{species}_RepeatMasker
|
Results from the initial RepeatMasker run.
|
/{species}_Database
|
Hard masked input genome database required for RepeatModeler2.
|
/{species}_RepeatModeler
|
Results of the RepeatModeler2 run, including the raw and clustered de novo TE consensus families.
|
/{species}_BLASTN
|
Results of the first BLASTn search to identify copies of de novo TEs in the input genome.
|
/{species}_ExtractAlign
|
Results of the iterative BEE process, including the improved de novo TE consensus library with reduced redundancy.
|
/{species}_Curated_Library
|
Subset of known TE sequences used in initial RepeatMasker run, improved de novo TE consensus library, and a file of the two combined. All in FASTA format.
|
/{species}_Masked_de_novo_Repeats
|
Data for the reclassification of TEs in the de novo library.
|
/{species}_RepeatMasker_Against_Custom_Library
|
Results of the final RepeatMasker run using the combined known and de novo repeat libraries.
|
/{species}_RepeatLandscape
|
.divsum file generated by RepeatMasker. Used to calculate percentage divergence from consensus for production of repeat landscape plot.
|
/{species}_mergedRepeats
|
Results of TE defragmentation process.
|
/{species}_summaryFiles
|
Main results of Earl Grey: TE loci in input genome (GFF3 and bed), combined TE consensus library, processed de novo TE consensus library, TE proportion pie chart, and TE landscape plots.
|
/{species}_clusTErs
|
Results of TE clustering step
|
Table 2. Directories created by Earl Grey, and the outputs generated in each.