Data Collection and Pre-processing
Publically available data from Sequence Reads Archive database with the accession number PRJNA328248 is considered for the study. The fastq files directly downloaded from European Nucleotide Archive browser (https://www.ebi.ac.uk/ena/browser/home) and checked the quality of reads using FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) [15]. Fig 1 depicted the workflow.
Read Alignment and Transcript Assembly
After the quality check, the reads aligned with the human reference genome (GRCh38) using the hierarchical indexing for spliced alignment of transcript (HISAT2 v.2.1.0) with the default settings. HISAT2, a splice aware alignment, it is worked based on the Burrows-Wheeler transform and the Ferragina-Manzini (FM) index (http://www.ccb.jhu.edu/software/hisat/) [16]. The output of HISAT2 (SAM format) is converted into BAM file. The sorted BAM file serves as an input for assembling using StringTie. It produces an accurate, complete reconstructs of genes and estimate the expression level of the transcripts [17].
Identification of Novel lncRNA
The merged assembled transcript from StringTie used to identify the novel lncRNAs. To start with the transcript length >200 nucleotides with strand information are considered for further steps and the subset of a filtered transcript is compared with hg38 annotation file using Gffcompare . The class code representing non-coding regions “i”, “u”, and “x” are retained for subsequent steps [18]. The transcripts with Open Reading Frame (ORF) identified using TransDecoder are discarded. The remaining reads checked for coding potential using the tool CPAT(Coding Potential Assessment Tool) and PLEK, (Predictor of Long Non-coding RNAs and messenger RNAs based on an improved k-mer scheme https://sourceforge.net/projects/plek/files/ [19]. CPAT is an alignment-free logistic model to identify non-coding regions of the transcripts [20]. The tool performs better in terms of sensitivity, specificity and accuracy compared to other non-coding region prediction tool, CPC, PhyloCSF. The lncRNA transcripts with a score of less than zero are filtered out for further analysis. The transcripts are processed using standalone BLASTX against Swissprot database to check the false positives. The transcript with an alignment score, E-value >10–5were removed. The outputs of BLASTX are subject to BLASTN against LNCipedia and NONCODE database to get the novel lncRNAs.
GC Content Analysis
Emboss geecee is an online tool used to calculate the content G+C bases of the nucleic acid sequence(s). It sums the number of G and C bases and reports the result to file infractions in the interval 0.0 to 1.0 [21].
Differential Gene Expression Analyses
The BAM file is input to Sub read package to generate the expression gene counts matrix [22]. The count matrix process to identify the differentially expressed genes of EBOV, RESTV, LPS treated cells using DESeq2 [23]. The genes with threshold logFC ±1.5 and adjusted p-value < 0.05 are considered as significant.
DE-lncRNA target prediction and functional annotations
To understand the functional role of DE-lncRNA better nearby genes are identified with the distance of 100kb of upstream and downstream regions for further investigation. The nearby genes extracted using BEDOPS [24] and BED TOOLS [25].
Gene Enrichment Analysis
Gene enrichment analysis would help to identify the interested genes and proteins generated through high-throughput studies. WebGestalt (WEB-based GEneSeTAnaLysis Toolkit) is most widely used tool for gene enrichment analysis. The significant gene terms are filtered out based on the p-value < 0.05.