neoANT-HILL requires a variant list for potential neoantigen prediction. Our pipeline is able to handling a previous-formed VCF file (single- or multisample) or a tumor transcriptome sequence data (RNA-seq) in which somatic mutation calling will be performed following GATK best [14–15] with Mutect2 [16] on tumor-only mode. In the current implementation, neoANT-HILL supports VCF files generated using the human genome version GRCh37. The variants are properly annotated by snpEff [17] to identify non-synonymous mutations (missense, frameshift and inframe).
Once the VCF files have been annotated, the resulting altered amino acid sequences are inferred from the NCBI Reference Sequence database (RefSeq) [18]. For frameshift mutations, the altered amino acid sequence is inferred by translating the resulting cDNA sequence. Altered epitopes (neoepitopes) are translated into a 21-mer sequence where the altered residue is at the center. If the mutation is at the beginning or at the end of the transcript, the neoepitope sequence is built by taking the 20 following or preceding amino acids, respectively. The neoepitope sequence and its corresponding wild-type are stored in a FASTA file. Non-overlapping neoepitopes can be derived from frameshift mutations.
A list of HLA haplotypes is also required. If this data had not been provided by the user, neoANT-HILL includes the Optitype algorithm [19] to infers class-I HLA molecules from RNA-Seq. The subsequent step is the binding affinity prediction between the predicted neoepitopes and HLA molecules. It can be executed on single or multi-sample using parallelization with the custom configured parameters. The correspondent wild-type sequences are also submitted to this stage, which allows calculation of the fold change between wild-type and neoepitopes binding score, also known as differential agretopicity index (DAI) [20].
neoANT-HILL employs seven binding prediction algorithms from Immune Epitope Database (IEDB) [21], including NetMHC (v. 4.0) [22–23], NetMHCpan (v. 4.0) [24], NetMHCcons [25], NetMHCstabpan [26], PickPocket [27], SMM [28] and SMMPMBEC [29], and the MHCflurry algorithm [30] for HLA class I. The user is able to specify the neoepitope lengths to perform binding predictions. Each neoepitope sequence is parsed through a sliding window metric. Our pipeline also employs four IEDB-algorithms for HLA class II binding affinity prediction, including NetMHCIIpan (v. 3.1) [31], NN-align [32], SMM- align [33], and Sturniolo [34].
Moreover, when raw RNA-seq data is available (in fastq format), neoANT-HILL can quantify the expression levels of genes carrying a potential neoantigens. Our pipeline uses the Kallisto algorithm [35] and the output is reported in transcripts per million (TPM). Potential neoantigens arising from genes expressing an abundance level under 1 TPM are excluded. In addition, neoANT-HILL also offers the possibility of estimating quantitatively, via deconvolution, the relative fractions of tumor-infiltrating immune cell types through the use of quanTIseq [36].
Our software was developed under a pre-built Docker image. The required dependencies are packaged up which simplify the installation process and avoid possible incompatibilities between versions. As previously described, several analyses are supported and each one relies on different tools. Several scripts were implemented on Python to complete automate the execution of these single tools and data integration.