Pipeline implementation
The Predector pipeline runs a range of commonly used effector and secretome prediction bioinformatics tools for complete predicted proteome, accepted as input in FASTA formatted files (Table 3), and combines all raw and summarised outputs into newline-delimited JSON, tab-delimited text and GFF3 formats. The pipeline is implemented in Nextflow (version >20)53, and a conda environment and Docker container are available for easy installation of dependencies, with scripts to integrate user-downloaded proprietary software into these environments. Predector is available from https://github.com/ccdmb/predector.
Datasets
The training and evaluation datasets consisted of: confirmed fungal effectors, fungal proteins with confirmed subcellular localisation, and an ‘unlabelled’ fungal protein set derived from whole proteomes of well-annotated, model fungal species. The experimentally-confirmed effector protein dataset was curated from literature, PHI-base38, and EffectorP30,54 training datasets (Supplementary Table 2). Effector homologues were also identified from literature (Supplementary Table 2) and by searching the UniRef-90 fungal proteins (UniProtKB query: taxonomy:"Fungi [4751]" AND identity:0.9, UniProt version 2020_01, Downloaded 2020-06-01) using MMSeqs2 version 11-e1a1c55 requiring a minimum reciprocal coverage of 70% and a maximum e-value of 10-5 (-e 0.00001 --start-sens 3 -s 7.0 --sens-steps 3 --cov-mode 0 -c 0.7). Fungal proteins with experimentally annotated subcellular localisation were downloaded from UniProtKB/SwissProt (version 2020_01, Downloaded 2020-06-01), and were labelled “secreted” (non-transmembrane) or “non-secreted” (membrane associated, endoplasmic reticulum localised, golgi localised, and Glycosylphosphatidylinositol (GPI) anchored). UniProtKB download queries are provided in Supplementary Table 2. The ‘unlabelled’ whole proteome dataset was derived from well-studied pathogens, with at least one representative chosen from a range of trophic phenotypes 56: monomertrophs/biotrophs: Blumeria graminis f. sp. hordei
57, Blumeria graminis f. sp. tritici 58, Melampsora lini 59, Melampsora larici-populina 60, Puccinia graminis f. sp. tritici 61; polymertrophs/necrotrophs - Parastagonospora nodorum 43,62, Pyrenophora tritici-repentis 63, Pyrenophora teres f. teres 64, and Pyrenophora teres f. maculata 64; mesotrophs/hemibiotrophs - Leptosphaeria maculans 41, Zymoseptoria tritici 65,66, Passalora fulva67, Dothistroma septosporum67; wilts/vasculartrophs - Fusarium oxysporum f. sp. lycopersici 68,69, Fusarium oxysporum f. sp. melonis 70; and saprotroph (or opportunistic monomertroph/biotroph) Neurospora crassa 71 (Supplementary Table 2). Fourteen of the 24 proteomes above were retained as a separate dataset for final evaluation (Supplementary Table 2). The remainder of the datasets were combined, and redundant sequences were removed to prevent the undue influence of conserved or well studied sequences with multiple records. Redundancy was reduced by clustering proteins with MMSeqs2 version 11-e1a1c 55 requiring a minimum reciprocal coverage of 70% and minimum sequence identity of 30% (--min-seq-id 0.3 --cov-mode 0 -c 0.7 --cluster-mode 0). A single sequence was chosen to represent a set of clustered, redundant sequences, which was prioritised based on supporting information (in order of preference): known effector, SwissProt secreted, SwissProt non-secreted, proteome/effector homologue, longest member of cluster. Clusters that corresponded to the known effectors from the EffectorP 2 30 training and test data sets were automatically assigned to training and test data sets in this study. A randomly selected subset of 20% of the remaining representative members of clusters were also assigned to the test dataset. Data and scripts for generating the datasets are available at https://github.com/ccdmb/predector-data.
Manual effector and secretion prediction scoring
Predicted proteins were ranked using the sum of several weight-adjusted scores derived from a range of software and methods (Table 3, Supplementary Table 3). Proteins were annotated as “multiple_transmembrane” if it was assigned more than one transmembrane (TM) domains by either TMHMM or Phobius, and “single_transmembrane” if it was assigned one TM domain by TMHMM or Phobius (but neither had more than one). For TMHMM “single_transmembrane” we add the additional constraint that if there is a signal peptide prediction (by any method), the number of expected TM AAs in the first 60 residues is less than ten. A protein was annotated as “secreted” if it was predicted to have a signal peptide by any method and was not annotated as a multiple transmembrane protein.
Protein matches to PHI-base were summarised based on the experimental phenotypes of the matched proteins. Proteins were marked as a “phibase_effector_match” if they had any matches with the “Loss of pathogenicity”, “Increased virulence (Hypervirulence)”, or “Effector (plant avirulence determinant)” phenotypes; as a “phibase_virulence_match” if they had any matches with the “Reduced virulence” phenotype and not any of the effector phenotypes; and as a “phibase_lethal_match” if they had any matches with the “Lethal” phenotype. Proteins were also labelled as “effector_match”, “pfam_match”, or “dbcan_match” if they had a significant match to a custom database of known effectors, selected virulence associated Pfam HMMs, or selected virulence associated dbCAN HMMs, respectively (Supplementary Table 2).
Each protein was given two manually designed scores to evaluate effector or secreted protein candidates based on the values and weights in Supplementary Table 3. The secretion score is the sum of the products of value and weight for transmembrane, secreted, signalp3_hmm, signalp3_nn, phobius, signalp4, deepsig, targetp, and deeploc parameters. The effector score is the sum of the secretion score and the sum of the products of EffectorP, and the homology parameters (effector match, virulence match, and lethal match) values and weights.
Learning to rank model training
A “learning to rank” pairwise machine learning method based on LambdaMart 72 was developed using XGBoost 73 to prioritise effectors. Effector homologues in the training data set were held out as an informal validation set, known effector proteins were considered relevant (priority 2), and all other proteins in the train dataset were considered irrelevant (priority 1). To mitigate issues caused by unbalanced class sizes, training data were weighted for effectors as #irrelevant / #relevant and unlabelled proteins were given weight #relevant / #irrelevant. A subset of features output by the Predector pipeline and model constraints for the direction of effect (indicated in brackets as + or - when a constraint was applied; + indicating that increasing values of the feature can only contribute positively towards effector prediction) were selected based on the distributions of parameters in Supplementary File 1:3-40): molecular weight, proportion of cysteines, proportion of tiny AAs (Gly, Ala, Ser and Pro), proportion of small AAs (Thr, Asp and Asn), proportion of non-polar AAs, proportion of basic AAs, EffectorP 1 probability (+), EffectorP 2 probability (+), ApoplastP probability (+), TMHMM TM count (-), TMHMM expected TM residues in first 60 AAs, Phobius TM count (-), DeepLoc membrane probability (-), DeepLoc extracellular probability (+), DeepSig signal peptide (SP) prediction (+), Phobius SP prediction (+), SignalP 3 neural network D-score (+), SignalP 3 HMM S-score (+), SignalP 4 D-score (+), SignalP 5 SP probability (+), and TargetP secreted probability (+). The hyperparameters max_depth, min_child_weight, gamma, lambda (L2 regularisation), subsample (dropout), colsample_bytree, eta (learning rate), and num_boost_round (number of boosted trees) were optimised by maximising the normalised discounted cumulative gain (NDCG) 74 for the highest 500 ranked proteins (NDCG@500) in 5-fold cross validated training. The final model was trained using the optimised hyper-parameters.
Model and score evaluation
The learning to rank model, manually designed scores, and EffectorP pseudo-probabilities were evaluated using rank summarisation statistics using the scikit-learn library 75, which included the coverage error (the rank of the lowest scoring effector), label ranking average precision (LRAP; average proportion of correctly labelled samples with a lower score than each position in the sorted results), the label ranking loss (the average number of results that are incorrectly ordered), and the normalised discount cumulative gain (NDCG; the sum of all ranking priorities divided by the log2 of the rank position in the sorted list (DCG), normalised by the best theoretically possible DCG score) 74. NDCG, LRAP, and label ranking loss were also evaluated for the top 50, 500, and 5000 proteins (indicated with the suffix @50, @500, or @5000). Additionally, to compare classification performance of the learn to rank model with the combined EffectorP and secretion prediction decisions, a decision threshold of 0 was set for the learn to rank model (with > 0 indicating an effector prediction), and the classification metrics precision (the proportion of predicted effectors that are labelled as true effectors), recall (the proportion of known effectors that are predicted to be effectors), accuracy (the fraction of correct predictions), balanced accuracy (the arithmetic mean of precision and recall for binary cases like this, and is less affected by unbalanced data-sets than accuracy), F1 score (the harmonic mean of precision and recall), and matthews correlation coefficient (MCC). For unbalanced datasets like the training set of effectors and non-effectors, MCC is considered a more reliable indicator of model performance than the other methods mentioned above 48. Additionally, to evaluate the performance at different decision thresholds, the precision, recall, and MCC were calculated for 100 score thresholds along the range of each score, and the receiver operating characteristic (ROC) curves were plotted.
For the effector ranking scores, only known effectors were used as the relevant (positive) set with the irrelevant (negative or unlabelled) set consisting of secreted, non-secreted, and proteomes. Because EffectorP is intended to be run on secreted datasets, ranking statistics were only calculated for the subset of proteins that were predicted to have a signal peptide (by any method) and with fewer than two predicted TM domains (by either Phobius or TMHMM), and classification statistics were considered on both this secreted subset, and as a combined classifier (secretion and EffectorP prediction) on the whole datasets. For the secretion ranking score the positive set consisted of the known effectors and SwissProt secreted set, and the negative set was made of the SwissProt non-secreted proteins.