Isolation of CD4+ T helper (TH) cells and TH1 polarization
Peripheral blood mononuclear cells (PBMC) were isolated from blood donor derived buffy coats through gradient centrifugation (Lymphoprep, Axis shields diagnostics, Dundee, Scotland). Naive CD45RA+ CD4+ T cells were subsequently isolated with magnetic bead separation using the “Naive CD4+ T Cell Isolation Kit II, human“ (Miltenyi Biotec, Bergisch Gladbach, Germany). The cells were then activated and polarized towards TH1 using Dynabeads™ Human T-Activator CD3/CD28 (1 bead/cell) (Dynal AS, Lillestøm, Norway), 5 ng/µl recombinant human IL-12p70, 10 ng/µl recombinant human IL-2 and 5 µg/µl anti-IL-4 antibodies (clone MAB204) (all three from, Bio-Techne, Minneapolis, USA). All primary CD4-T-cell cultures were cultured and differentiated at 37 ֯C, with 5% CO2 in RPMI 1640 media containing L-glutamine, 10% FBS and 1% Penicillin/Streptomycin mixture (all from Gibco, Paisley, United Kingdom).
RNA-seq and proteomics sample collection
Naive CD45RA+ CD4+ T-cells were isolated, differentiated and sampled as above at baseline, 0.5h, 1h, 2h, 6h and 24h for RNA-seq, and baseline, 1h, 2h, 6h, 24h and 5 days for proteomics. RNA was isolated using a ZR-Duet DNA/RNA kit (Zymo Research, Irvine, USA) and stored at -80°C until transport. During the protein extraction, multiple samples were pooled from twelve different individuals to reach the necessary amount of material for the three biological replicates required for subsequent analysis steps.
Mass-spec proteomics
The cells were lysed by sonication. Proteins were digested with trypsin through an in-solution digestion protocol and desalted peptides were labelled with 6-plex TMT reagents (Thermofisher Scientific, Massachusetts, USA). Then, the labelled peptides were mixed and separated using high-pH reverse-phase liquid chromatography, each fraction of which was analysed on an Orbitrap Fusion Lumos Tribrid mass spectrometer (Thermofisher Scientific, Massachusetts, USA). The tandem mass spectrometry data were analysed using MaxQuant (v 1.6.0.1). T he false discovery rate (FDR) for peptide level was evaluated to 0.01 for removing false positive data. For highly confident quantifications of protein, protein ratios were calculated from two or more unique quantitative peptides in each replicate. Data was normalized and removed contaminant and razor peptide. To enrich differentially expressed proteins (DEPs), we analysed the quantitative ratios (as the Log2 value). The fold-change ratio cut off was more than 2 or less than 0.5 based on intensity of 0 min. Searched data went through statistical process with Perseus (v 1.5.1.6). Detailed experimental procedures are provided in the Supplementary information.
RNA-seq library preparation and sequencing
RNA library preparation and the subsequent RNA-sequencing were carried out by the Beijing Genomics Institute (https://www.bgi.com/global/). Library preparation was performed using the TruSeq RNA Library Prep Kit v2 (Illumina, San Diego, USA). Each sample was sequenced to the depth of 40 million reads per samples with pair end sequencing and a read length of 100bp on an Illumina 2500 instrument.
Bioinformatics
All RNA-seq data were processed using the following pipeline. Sample qualities were assessed with fastQC and the mRNA reads were subsequently aligned using STAR[48], with the parameter “--outSAMstrandField intronMotif” and “--outFilterIntronMotifs RemoveNoncanonical”, to the “Homo_sapiens.GRCh37.75.dna.primary_assembly.fa” from Ensemble. The resulting read alignment bam files were assembled into transcripts with StringTie, with default parameters, using the GRCh37.75 gtf annotation from Ensemble. To evaluate mRNA to protein relationship, mRNA reads were mapped to the mass spectrometry signal of protein abundance using the Homo.sapiens and Mus.musculus package in [49]. Correlations were calculated using Pearson correlations across gene expressions, i.e. one coefficient per gene.
Splice variant model construction
We hypothesized that protein abundance could be predicted using a linear combination of the corresponding splice variants. To predict protein abundance, we used the Sklearn[50] implementation of the LASSO[51], an L1-penalized linear regression model.
Here, the time series of one protein is denoted the vector Y, and the corresponding time series of the splice variants are denoted by the matrix X. The rate constant for each splice variant is contained in the vector β. Furthermore, the λ parameter regulates the influence of the L1 term, and was determined individually for each protein. The λ term was chosen to minimize the prediction error of a leave-one-out cross validation. In the Th1 dataset, the time points differed such that the mRNA abundance also had a measurement at t=30 minutes, while the protein data instead had a measurement of t=120h. For comparison, the protein data for 30 minutes was interpolated, while the 120h time point was omitted. The same procedure was performed using the regulatory T cells from Schmidt et al. 2018[14] were Treg induced by either TGF-β, TGF-β and ATRA, or TGF-β and butyrate. Lastly, the same procedure was performed for mice B-cells were B-cell differentiation was induced by the Ikaros transcription factor (GSE75417). Pipe-line and code available from https://gitlab.com/Gustafsson-lab/splice_protein_predictions.
Cross validation
To select the values of λ and τ, a double cross-validation was performed. First, one of the time points of the protein measurements was removed from the set, leaving only 5 data points. Secondly, a leave-one out cross-validation was performed on the remaining 5 time points, giving an estimate of the accuracy of the model approach given a time delay and a lambda value for the penalty term in the Lasso operator. We used the 200 time-delays ranging between 0 and 24h, and a varying set of lambda parameters (increased until all parameters equaled zero). Thirdly, the time delay and penalisation that generated the smallest average squared residuals between the second cross-validation and the data were chosen and used to predict the 6th data point from splice variants. Fourth, this double cross-validation procedure was repeated for all 6 data points.
Disease prediction
Disease relevance of the splice variant models was tested by re-analysis of deep RNA-sequenced case control material of samples containing total CD4 + T-cells, i.e. CD4 + T-cells with all its sub-types. We found T-cell prolymphocytic leukemia (T-PLL, GSE100882), asthma in obese children (GSE86430), and allergic rhinitis/asthma (GSE75011) studies through a Gene Expression Omnibus (GEO) repository search and multiple sclerosis (MS) through collaboration[20]. For each of the studies’ datasets we used the TH1 and Treg derived models on how to combine mRNA splice variants to predict protein abundance. The resulting sets of predicted protein levels were tested for differential expression between patients and controls using a non-parametric Kruskal-Wallis test. We also applied Kruskal-Wallis tests to the individual splice variants that were used by the models. We assessed model effects by measuring the increase of nominally differential expression from model predictions compared to ingoing splice variants into the model. For eleven different proteins from the two largest biomarker studies of MS we could find protein measurements which was compared with our predicted proteins. One study reported 36 out of 92 proteins as significant [32] and another study [33] reported the expression of four proteins whereof two were significant. We found that all our predicted proteins differential expression agreed with the two studies (9/9 negatively reported from first study and 1/1 negatively and 1/1 positively reported from second study) and the corresponding P-value was calculated as ((92-36)/92)9 x (2/4)2 =2.9x10-3.
Protein validation
Three of the proteins predicted differentially expressed (Annexin A1, sCD40L and sCD27) were measured in cerebrospinal fluid (CSF) from two different cohorts, one with of 41 patients with newly diagnosed MS and 23 healthy matched controls and a second with 16 patients with relapsing remitting MS before and after one year of treatment with Natalizumab (See Patient cohort above and Supplemental Table S5). Quantification of sCD27 was performed using the Human Instant ELISA™ kit from eBioscience according to the instructions provided by the manufacturer. The optical densities (O.D.) were read at 450 nm with a wavelength correction at 620 nm in a Sunrise™ microplate reader (Tecan, Shanghai, China). Data acquisition was performed using Magellan™ version 7.1 computer software. The lowest detection limit was 0.63 U/ml and values below the detection limit were given half the value of the detection limit. Statistical differences were determined using Mann-Whitney U-test or Wilcoxon matched-pairs signed rank test in Graphpad Prism. Annexin A1, measured by the human Annexin A1 ELISA kit, was undetectable in all analyzed samples (n=32, of whom n=16 were included before and 16 after one year of treatment with Natalizumab). Multiplex Bead Technology was used to measure soluble CD40L according to the manufacturer's description. The samples were analysed on a Luminex®200™ instrument (Invitrogen, Carlsbad, CA, USA) and data was collected using xPONENT 3.1™ and analysed using the MasterPlex® Reader Fit. The lowest detection limit was 1.6 pg/ml and values below the detection limit were given half the value of the detection limit. sCD40L concentration was below the lowest detection limit in 71 out of 96 samples (74% undetectable) and was therefore considered as undetectable.