Figure 3 summarizes how the different DNA elements needed for assembly are obtained from the original target gene sequence, as well as how primers and synthesized DNA fragments are designed to put together the final construct.
sgRNA design
sgRNA design criteria
The first step in the design algorithm is sgRNA selection. This is done by searching for instances of the chosen CRISPR enzyme's Protospacer Adjacent Motif (PAM) sequence, which defines possible sgRNAs. GeneTargeter supports PAM sequences for a variety of CRISPR enzymes, including Streptococcus pyogenes Cas9 (SpCas9) with a 3'-NGG PAM [50, 51] and Acidaminococcus sp. BV3L6 Cas12a (AsCas12a, previously known as Cpf1) with a 5'-TTTV PAM [52, 53].
Potential sgRNA sequences are evaluated based on their GC composition, presence of homopolymers, on-target scores and off-target scores. sgRNA GC richness is associated with better editing activity [54, 55]. Homopolymers have been reported as occasionally reducing sgRNA efficiency [54–56]. On-target scores, which gauge an sgRNA's predicted probability of cleaving the intended site, can be calculated with one of two algorithms: (1) the original Rule Set 2 score developed by Doench et al. for Cas9 sgRNAs [57] and updated as Azimuth (github.com/MicrosoftResearch/Azimuth); or (2) CINDEL scores, developed for use with Cas12a sgRNAs [53]. Finally, off-target activity scoring algorithms available include scores developed by Hsu et al. [58] and Cutting-Frequency Determination (CFD) scores developed by Doench et al. [57]. Both are based on aggregating the scores of pairwise comparisons between the sgRNA of choice and all other possible sgRNAs found in the target genome, with pairwise scores being determined using a coefficient matrix that takes mismatch position within the sgRNA into account, as well as mismatch identity in the case of CFD scores. Since the coefficient matrices for both scores were derived from SpCas9 data in murine and human cells, an enzyme-agnostic score based on unweighted coefficients in the position mismatch matrix is provided for use as a proxy score with Cas12a sgRNAs. At the time of writing, we are aware of no off-target score specifically trained on data of Cas12a sgRNA activity.
sgRNA search sequence space
- The selection process begins by compiling a list of sgRNA candidates from a given sgRNA search space. The way this search space is defined depends on whether the desired construct is intended to modify the 3' or 5' UTR for knockdown, or achieve knockout of the target gene (Fig. 3a).
- Knockdown via 3' UTR modification. A sequence search space is defined by taking the last 450 bp of the target gene's coding sequence and 125 bp downstream of its stop codon. The limits of the search space can be modified by the user. The search space is then examined to identify possible PAMs. Preference is given to PAMs downstream of the stop codon, as this allows for straightforward omission of the sgRNA target site within homologous regions of the donor vector and no required modification of coding sequence. The software excludes sgRNA sequences found within genes downstream of the one being targeted.
- Knockdown via 5' UTR modification. A sequence search space is defined by taking 125 bp upstream of the target gene's start codon followed by the first 450 bp of coding sequence. The limits of the search space can be modified by the user. The search space is examined to identify possible PAMs. Preference is given to PAMs upstream of the start codon due to practicality, as explained above. The software excludes sgRNA sequences found within genes upstream of the one being targeted.
- Knockout. To maximize the probability of successful knockout and homologous recombination, potential sgRNAs in this case are selected from a search range centered around the midpoint of the coding sequence.
All potential sgRNA sequences within the search space are extracted based on their PAM to create a list of candidates ready for filtering.
sgRNA candidate filtering
Once the list of potential sgRNAs is defined, it is subjected to a primary filter step, wherein each criterion can be user-modified (Fig. 2a). Using default settings, a potential sgRNA is discarded if it has: 1) less than 25% GC content; 2) more than 10 consecutive A or T bases, which can noticeably reduce cleavage efficiency [55, 56]; or 3) an on-target score below a user-defined minimum threshold value, set to 35% and 20%, respectively, for Azimuth and CINDEL scores.
All sgRNAs passing the primary filtering step are subjected to a secondary series of tests, which include checking for any 4-homopolymers or triple T sequences [54, 55], and filtering according to off-target activity scores. Threshold off-target values empirically determined as useful are a minimum aggregated score of 20% and a maximum score of 50% for any pairwise hit in the case of CFD scores, and a 75% minimum aggregate score and 5% maximum pairwise hit score for Hsu et al. scores, all of which are user-adjustable. An sgRNA is discarded from the main candidate list if: 1) it contains 4-homopolymers or triple T sequences; 2) its aggregated score is below the set threshold; or 3) any pairwise score for a specific off-target sequence in the genome is above the set threshold. If an sgRNA fails this second series of tests, it is kept as a backup sgRNA.
Once three valid sgRNAs are found or the search area is exhausted, the sgRNA search process terminates. If no sgRNAs (valid or backup) were found, an error is written to the output message file and the vector design process aborts. If no valid sgRNAs are found, but there are backups, the backup sgRNA with the highest GC content is selected as the main sgRNA. In this case, a warning is written to the message file and the process continues. If there is at least one valid sgRNA, selection of the main sgRNA is as follows (Fig. 3a). If a valid sgRNA would not require a recoded region because it is outside the coding sequence, only valid and backup sgRNAs outside the coding sequence are kept. When there are no valid sgRNAs outside the coding sequence, the procedure to select sgRNAs depends on which end of the gene is being targeted.
- Knockdown via 3' UTR modification. The valid sgRNA closest to the stop codon is selected. All valid downstream sgRNAs and any up to 50 bp (user-adjustable) upstream of the selected sgRNA are kept. Backup sgRNAs that are downstream of the most upstream valid sgRNA are kept.
- Knockdown via 5' UTR modification. The valid sgRNA closest to the start codon is selected. All valid upstream sgRNAs and any up to 50 bp (user-adjustable) downstream of the selected sgRNA are kept. Backup sgRNAs that are upstream of the most downstream valid sgRNA are kept.
- Knockout. The valid sgRNA with highest GC content is selected. All valid and backup sgRNAs are kept.
The remaining valid sgRNAs are ranked and labeled (sgRNA 1, 2, 3) according to GC content, with the highest rank assigned based on highest GC content (Fig. 3a). All other sgRNAs are left unlabeled and not taken into account for the rest of the design process. Multiple sgRNAs are annotated and taken into account in order to facilitate simple exchange of the target sgRNA in a finished plasmid vector, in case the first choice proves to be experimentally unsuitable.
LHR design
GeneTargeter selects LHR and RHR regions to maximize the ease of PCR amplification and DNA assembly, while avoiding alterations to the gene coding sequence and its splicing patterns, as well as any adjacent coding and non-coding sequences.
LHR selection begins by selecting the 3' endpoint of the LHR. This occurs differently according to the type of vector being built (Fig. 3b):
- Knockdown via 3' UTR modification. The 3' end of the LHR is initially assigned to be immediately upstream of either the stop codon or the most upstream sgRNA target site, whichever is more upstream.
- Knockdown via 5' UTR modification. The 3' end of the LHR is assigned to either the start codon or the most upstream sgRNA, whichever is most downstream.
- Knockout. LHR selection follows that of 5' UTR-modifying knockdowns, providing an LHR close to the 5' end of the gene.
If the most upstream sgRNA target site is within an intron, the 3' end of the LHR is moved to the 3' end of the immediately upstream exon, and trimmed 3'→5' until the 3'-proximal 40 bp region of the LHR has a melting temperature ≤55 ºC and is not in a repeat region (Fig. 3b). Melting temperature is calculated using nearest neighbor thermodynamics through BioPython [59]. If the 3' end of the LHR reaches ≤500 bp from the point where the search started and the melting temperature criterion is not satisfied, the LHR 3' end defaults to the initially assigned position, and a warning included with the returned design.
Next, the 5' end of the LHR is initially assigned to be 500 bp upstream of the selected 3' end. The 5' end of the LHR is moved upstream until the 5'-proximal 40 bp sequence of the LHR has a ≤55 ºC and is not in a repeat region (Fig. 3b). If this melting temperature condition is not satisfied and the LHR exceeds 750 bp, the 5' end is instead moved downstream until the melting temperature criterion is satisfied or the LHR reaches 400 bp. If this also fails, the 3' end position will be moved upstream to the next suitable end point, and the 5' end position search is repeated from the beginning at this new location. This process iterates until both start and end positions meet melting temperature criteria or until the LHR search space is exhausted, as explained above. If a solution satisfying the melting temperature criterion is not found prior to exceeding the limits of the sequence space available, the initially assigned 5' end is selected and a warning is issued.
RHR design
GeneTargeter selects LHR and RHR regions to maximize the ease of PCR amplification and DNA assembly, while avoiding alterations to the gene coding sequence and its splicing patterns, as well as any adjacent coding and non-coding sequences.
- Knockdown via 3' UTR modification. The 5' end is defined to be immediately downstream of the target gene's stop codon or the most downstream sgRNA, whichever is more downstream.
- Knockdown via 5' UTR modification. The 5' end is defined to be immediately downstream of the most downstream sgRNA.
- Knockout. RHR selection follows that of 3' UTR-modifying knockdowns, providing an RHR close to the 3' end of the gene to delete as much of the gene coding sequence as possible.
If the most upstream sgRNA target site is within an intron, the 5' end of the RHR is moved to the 5' end of the immediately downstream exon, and trimmed 5'→3' until the 5'-proximal 40 bp region of the RHR has a melting temperature melting temperature ≤55 ºC and is not in a repeat region (Fig. 3c). If the 5' end of the RHR reaches ≤500 bp from the point where the search started and the melting temperature criterion is not satisfied, the RHR 5' end defaults to the initially assigned position, and a warning included with the returned design.
The 3' end of the RHR is defined and adjusted in the same way the as the LHR 5' end. The 3' end of the RHR is initially assigned to be 500 bp downstream of the selected 5' end. The 3' end of the RHR is moved downstream until the 3'-proximal 40 bp sequence of the RHR has a melting temperature ≤55 ºC and is not in a repeat region (Fig. 3c). If this melting temperature condition is not satisfied and the RHR exceeds 750 bp, the 3' end is instead moved upstream until the melting temperature criterion is satisfied or the RHR reaches 400 bp. If this also fails, the 5' end position will be moved downstream to the next suitable end point, and the 3' end position search is repeated from the beginning at this new location. This process iterates until both start and end positions meet melting temperature criteria or until the RHR search is exhausted, as explained above. If a solution satisfying the melting temperature criterion is not found prior to exceeding the limits of the sequence space available, the initially assigned 3' end is selected and a warning is issued.
Recoded region design
Recoded regions are used to supplement missing coding sequences for certain knockdown constructs. They are defined differently according to vector type (Fig. 3d):
- Knockdown via 3' UTR modification. If the end of the LHR is within a coding region, GeneTargeter creates a recoded region spanning the end of the LHR to the end of the gene (excluding the stop codon).
- Knockdown via 5' UTR modification. If the start of the RHR is within a coding region, GeneTargeter creates a recoded region spanning the start of the gene (excluding the start codon) to the start of the RHR.
- Knockout. No recoded regions are necessary to knock out a gene.
If there are intronic regions within this sequence, the corresponding cDNA is used to synthesize a more compact recoded region. This is recodonized to preserve reading frame using T. gondii codon frequencies as the default to scramble sgRNA recognition sites and exclude restriction sites relevant during donor vector assembly. T. gondii, an apicomplexan parasite like P. falciparum, is selected because of its higher GC content, which facilitates both ease of synthesis and sequencing. Target sites of all ranked sgRNAs are recodonized to achieve pairwise CFD scores below 10%. This allows for modular exchange of sgRNAs in the final vector without the need to redesign and reassemble a new recoded region. Repetitive or homopolymeric sequences that may complicate assembly are recodonized. If the combined length of the recoded region and Gibson homology flanking sequences is below a default of 250 bp, the recoded region is extended until this minimum size requirement is met for compatibility with standard commercial synthesis.
Plasmid vector design
GeneTargeter virtually assembles a complete donor vector using a selected base plasmid with predetermined insertion sites for each component. Plasmid backbone pSN054 is used for 3' knockdowns, while plasmid pSN150 is used for both 5' knockdown and knockout constructs [18]. The software designs Gibson Assembly compatible primers for PCR amplification of the LHR, RHR and recoded regions. Designs are initialized with a preferred 40 bp homology on either side of the start and end positions of the region. If the initial primer has a melting temperature <50 ºC, it is redesigned by increasing the length of the primer until the maximum primer length is achieved. The user is warned if the temperature difference between primer pairs is ≥5 ºC. If needed, a recoded region gene fragment with 40 bp of homology to the plasmid backbone and either the LHR (for 3' UTR modification) or RHR (5' UTR modification) is designed, along with PCR primers for its amplification (Fig. 3e). All targeting elements and oligonucleotides are annotated both in the completed donor vector and edited gene locus output files.
Advanced methods and parameters
GeneTargeter can produce designs with minimal user intervention using only gene sequences input as GenBank files. However, the user may choose to manually specify one or multiple sgRNA sequences within the gene for the software to prioritize, or even the LHR or RHR sequences to use. In these cases, the software requires additional annotations for user-selected sgRNAs. The sgRNAs should contain “gRNA” in the annotation label and a number indicating prioritization order. Additionally, user-chosen annotations for the LHR and RHR (containing the capitalized words “LHR” and “RHR” in the labels, respectively) can be added to the file, in which case the software uses these custom regions instead of automatically selecting LHR and RHR. The user must specify the name of the annotation corresponding to the gene's coding sequence, if this name is different from the name of the file.
There are a variety of additional design parameters with default values that can be modified by the user (Table S4). At a global level, the user may select which plasmid to use, with options for gene knockout, 5' and 3' conditional knockdown. The user can also choose whether or not to include 5' epitope tags, and to do so only if the encoded protein does not have a predicted N-terminal trafficking signal. The user can also specify targets as “non-coding RNA” and exclusively search for editing sites flanking such regions.
Other adjustments can be made to homologous region selection, Gibson Assembly primer design and creation of recoded regions. The size of all homologous regions can be altered by manually adjusting minimum and maximum limits. Homologous region boundaries can be adjusted by changing the maximum allowable distance between the end of the LHR and start of the first sgRNA, and between the gene's stop codon and start of the RHR. The user may also define a range around the start and stop positions of each homologous region for the software to scan when searching for optimal positions, once a valid homologous region is identified. Furthermore, the sequence length at the start and end of each homologous region considered during melting temperature analysis and the melting temperature thresholds are both user-adjustable.
For Gibson Assembly primers, the user can define minimum, maximum and preferred lengths of homology between fragments. Each primer is twice this length, with half the primer binding to the amplified region and the other acting as an overhang designed to add the region of homology with the next fragment in the assembly. Other primer parameters that can be adjusted, including minimum and maximum melting temperature, and allowable differences in melting temperature between primer pairs.
Customization of sgRNA selection can be achieved by specifying the minimum GC content of gRNAs, the CRISPR cutting enzyme and PAM sequence being used, on-target and off-target scoring algorithms, and their respective threshold values. GeneTargeter allows for two different families of CRISPR enzymes, Cas9 and Cas12, with variants covering fourteen different PAM sequences. On-target scoring algorithms include the Rule Set 2, Azimuth, and CINDEL scores, while off-target scores include Hsu (weighted and unweighted) and CFD algorithms.
The codon frequency table used for recodonization can be selected from a list including: apicomplexan T. gondii (default), P. falciparum, and P. vivax; non-apicomplexan model organisms Escherichia coli, Saccharomyces cerevisiae, and Rattus norvegicus; and a uniform-frequency, “codon scrambling” table. Two different codon optimization algorithms are available: Codon Adaptation Index (CAI) Maximization, in which each codon is replaced by its most common synonym according to the chosen frequency table [60], and Codon Sampling, in which codons are sampled randomly from their list of synonyms according to their individual relative frequency. In addition, the user can set threshold values for the minimum length of the recoded region.
Lastly, the user may specify the way GeneTargeter handles designs when the recoded region is under the minimum set threshold. The default course of action (i.e. extension of the recoded region) can be replaced by an alternative strategy in which oligonucleotides required for the preparation of the sgRNA sequence, along with 40 bp homology to be used in Gibson Assembly, through a primer annealing-extension reaction. If the length of the recoded region is less than 20 bp under this setting, the recoded region is included within the LHR reverse Gibson primer. The region will be designated for gene fragment synthesis regardless of the setting if its length exceeds the minimum gene fragment size.
Platforms and availability
GeneTargeter can run both as a web and command line application. In both cases, the software is composed of a number of separate, modular components handling different tasks (Fig. S1). At the core of the software is a Python script that processes input genes and produces output files. This core script draws on a series of both pre-existing and custom-built Python libraries, as well as external databases. These databases contain the empty vector constructs, sgRNA scoring databases [53, 57, 58], and codon frequency tables [61]. The user can interact with the software through a Web-based graphical user interface (GUI), available at genetargeter.mit.edu. This Web application was built using HTML5, CSS3, and JavaScript web technologies linked to a Python-run server. The server was built using Flask and Flask-SocketIO, and houses the main GeneTargeter scripts. Alternatively, the main GeneTargeter script can be run locally from a command line interface (CLI), useful if the user wishes to integrate the software within a larger bioinformatics pipeline or parallelize large batches of designs. The code for both the CLI and the web-based software can be found at github.com/pablocarderam/GeneTargeter. Input gene files for the CLI may be downloaded in bulk from PlasmoDB plasmodb.org (Supplementary File 3) for the case of the P. falciparum 3D7 genome.
Construct assembly and validation
DNA assemblies for pSN054-type constructs were carried out as explained in Nasamu et al. [18]. LHR and RHR DNA segments were obtained by PCR from genomic DNA isolated from the P. falciparum NF54 strain, while recoded regions and sgRNA cassettes were obtained by commercial synthesis. The individual fragments were introduced using Gibson Assembly [22] in two steps: first, for the RHR and the sgRNA cassette, and second for the LHR and recoded region. Assembly was confirmed by restriction enzyme digestions designed to reveal individual features of the plasmid, including the RHR and sgRNA cassette (AscI + I-SceI), LHR and Recoded Region (FseI + AsiSI), 10x TetR aptamer array alone (ApaI + XmaI), and 10x TetR aptamer array with PfHSP86 3' UTR (ApaI + XbaI). Digestion products were analyzed using 1% agarose gel electrophoresis. Construct assembly was additionally validated through Sanger DNA sequencing. All DNA primers and fragments required for Gibson assembly were designed by GeneTargeter (Supplementary Table S1).