All processing steps, including the input and output of files and parameters, are visualized as a flowchart in Figure 1. As the entire pipeline is based on multiple sequence alignments, their quality is of great importance. Therefore, the parameters of MAFFT [1] are adjusted to precisely fulfill the alignment requirements in every alignment step. This is of particular importance when aligning the short primer sequences for visualization. In this case the `--addfragments` parameter of MAFFT is used to properly align the short primers to their origin. MAFFT also allows the automated adjustment of the strand direction of a sequence. Another important parameter is `--adjustdirectioin` that allows the automated detection and adjustment of the strand direction in which sequences are provided, as well as the mapping of the reverse primers.
To avoid unwanted distortions of the consensus score(s) due to overrepresented sequences, identical and partial sequences are first removed from the alignment. The removal can be influenced via the `--gapthreshold` parameter by providing a value between 0 and 1. The default value of 0.2 results in the removal of all sequences of the alignment that have more than 20% gap symbols. To identify ideal consensus oligos, the consensus sequence is needed. Therefore, the pipeline uses MAFFT to align all input sequences together in a global multiple sequence alignment and it identifies for every position in the alignment the most common nucleotide. In addition, a consensus score is calculated for every alignment position which is the ratio of the respective count/number of most common nucleotide or gap symbol (-) at that position to the total number of sequences. All letters that are not ATGC are treated as gap. A perfectly conserved region in which all sequences at a given position are identical is thus assigned a consensus score of 1. The pipeline allows the user to control the quality values of the consensus sequence used for primer prediction via the `--consensusthreshold` parameter. The default value of 0.95 ensures that the most abundant nucleotide occurs in at least 95% of the sequences at the given position. In addition, the regions above the threshold must have a contiguous minimum length of at least 20 nucleotides. All regions that fall below these values are excluded from the subsequent primer prediction.
Before the consensus regions are identified for primer design, any gaps are removed from the consensus sequence as well as the corresponding value from the consensus scores. This is necessary because gaps are not encoded by nucleotides and are therefore not relevant for primer design. Gaps in the consensus sequence are caused by insertions in one or more sequences. From this “gapless consensus sequence” the regions relevant for primer design are identified.
As Primer3 [2] searches for the primer pair in a contiguous sequence section, instead of using the area in which primers are to be searched (SEQUENCE_INCLUDED_REGION and SEQUENCE_INTERNAL_INCLUDED_REGION), all areas in which primers are not to be searched are excluded by the pipeline (SEQUENCE_EXCLUDED_REGION and SEQUENCE_INTERNAL_EXCLUDED_REGION). This allows the prediction of primers in non-consecutive sequence segments. Furthermore, the gapless consensus sequence is automatically written into the primer3 parameter file (SEQUENCE_TEMPLATE). All other parameters such as melting temperatures or primer lengths are taken from the user-defined Primer3 parameter file. (For a detailed parameter description check out the primer3 manual (https://primer3.org/manual.html). From this, Primer3 predicts the optimal consensus primers and displays the results in a plain text file. The ConsensusPrime pipeline reads the Primer3 output and creates a comprehensive output including all details of previous filter steps in .html format. The predicted primers are added to a final alignment to be visualized using ClustalX [3].