Samples and entropy metrics
Forty human peripheral blood mononuclear cell (PBMC) samples were received from each of four groups: (i) Severely affected people with ME/CFS (either house- or bed-bound; MEsa); (ii) Mildly or moderately affected people with ME/CFS (MEmm); (iii) people diagnosed with Multiple Sclerosis (MS; disease controls); and, (iv) Healthy controls (HC). Samples were sourced from the CureME Biobank [21] from female donors aged 40-60 years, chosen to reduce possible age- or sex-dependent confounding effects, although their limited availability among MEsa required us to source samples from younger donors (Table 1).
Table 1. PBMC samples
Age at sample collection
|
18-29 years
|
30-39 years
|
40-49 years
|
50-60 years
|
Severely affected people with ME/CFS, MEsa
|
8
|
9
|
6
|
17
|
Mildly- or moderately-affected people with ME/CFS, MEmm
|
0
|
0
|
20
|
20
|
People with Multiple Sclerosis, MS
|
0
|
0
|
12
|
28
|
Healthy controls, HC
|
0
|
1
|
16
|
23
|
Some samples were removed from the analysis due to quality or process issues. Six CD8+ samples were discarded: one had insufficient PBMCs, and another five libraries had been cross-contaminated. Among CD4+ samples, 12 were discarded: two had insufficient PBMCs, four had insufficient DNA (<80ng), three failed the UMI ligation step, two had a barcoding error, and one failed the pooling stage. This left 154 CD8+ samples and 148 CD4+ samples. Analysis of variance (ANOVA) for the experimental parameters (Table 2) showed that the only association with sample group was CD8+ cell number (), although this was not significant after accounting for multiple tests.
Table 2. Mean values of cell count and purity, and numbers of CD8+ or CD4+ samples.
|
PBMCs ()
|
CD8+ sample
number
|
CD8+ cells ()
|
CD8+ Purity (%)
|
CD4+ sample
number
|
CD4+ cells ()
|
CD4+ Purity (%)
|
MEsa
|
5.27
|
38
|
0.632
|
70.7
|
35
|
0.665
|
72.0
|
MEmm
|
4.50
|
38
|
0.541
|
70.3
|
37
|
0.703
|
76.0
|
MS
|
4.62
|
39
|
0.473
|
68.0
|
39
|
0.652
|
73.5
|
HC
|
4.72
|
39
|
0.585
|
69.8
|
37
|
0.698
|
73.6
|
All samples
|
4.78
|
154
|
0.558
|
69.7
|
148
|
0.679
|
73.7
|
Human herpesvirus infection is proposed to be a trigger for ME/CFS [22]. Cytomegalovirus (CMV, a member of the Herpesviridae family) also affects T-cell clonal diversity [23, 24]. CMV infection is very common in the UK [25], and thus could affect TCR clonotype diversity analyses. Nevertheless, CMV seropositivity, available for 98 of the 160 samples, was not significantly associated with sample group status (, χ2-test; Table 3).
Table 3. Cytomegalovirus seropositivity/seronegativity results, shown by group. Equivocal samples were those with indeterminate IgG levels.
Seropositivity
|
MEsa
|
MEmm
|
MS
|
HC
|
Positive
|
14
|
4
|
11
|
12
|
Negative
|
19
|
9
|
14
|
13
|
Equivocal
|
2
|
0
|
0
|
0
|
We chose to adopt Rényi entropy as our T-cell receptor diversity metric. Many metrics have been proposed for this purpose, most adapted from standard measures used in ecology [26]. Although Shannon entropy [27] and Simpson's diversity [28] are commonly used in the T-cell literature, these metrics are often applied uncritically, without considering what is being measured [29]. To avoid biasing results by choosing a specific diversity metric prior to analysis, we decided to use the more general, and thus more appropriate, Rényi entropy (Methods). For each cell type and a/b/g-chain combination we defined a vector of clonotype counts in a sample. Here, a clonotype is defined as the CDR3 region, plus the full V, D and J gene segments without considering pairing between a-b or g-d chains. Note that -chain data is not included, because recombination of the locus, which occurs first, preferentially removes the TCR-δ locus [30]. Next, we constructed a matrix of Euclidean distances between each vector pair, ensuring that the pair contained the same number of recombinants by randomly down-sampling the more-populous sample 1,000-times (Methods). We needed to down-sample because the TCR recombinant count varied over two orders of magnitude among study samples. Distances were calculated over a pre-set optimised range of a, the order of Rényi entropy. Next, for cell type and a/b/g-chain combinations we partitioned samples by their clonotypes, adapting the machine learning approach described in Greiff et al. [31] (Methods). Once distances were precomputed, investigators were unblinded to the group identity (e.g. MEmm or MS) of each CD8+ sample. CD4+ data were acquired following unblinding.
Multidimensional scaling (MDS) plots
The distance matrix for all three CD8+ chain types (TCR a-, b-, and g-chains) considered together was visualised using a multidimensional scaling (MDS) plot (Figure 1). The four groups (MEsa, MEmm, MS and HC) are not clearly separated in this plot’s two dimensions. Nevertheless, we noted the presence of seven MEsa, two MEmm and three MS disease cases located away from the main cluster at high x-values (). Age, rather than disease status, likely accounts for these apparent outliers. This is because among 21 variables (covering group, age, PBMC count, CD8+ counts pre- and post-selection, purity, DNA amounts, fractions of a/b/g-chains and counts of a/b/g-coding junctions; Table 4), only age (binned by decade) was strongly negatively and significantly correlated with x-axis values (). This likely reflects a narrowing of immune repertoire diversity due to reduced thymic activity (thymopoiesis) as we age [32], which explains our finding that age group is a significant predictor () of CD8+ TREC count. Indeed, retrieving the ages at donation of the MEsa outliers (Figure 1) showed that most (four of seven) were from individuals then aged 18-29 years.
Table 4: Variables tested as potential confounders with MDS plot x-axis using a generalised linear model fit.
Variable
|
Estimate
|
Std. Error
|
t-value
|
|
Significance
|
(Intercept)
|
1.25E+11
|
2.22E+11
|
0.564
|
0.574
|
|
ME/CFS (mm)
|
1.89E+02
|
3.20E+02
|
0.590
|
0.556
|
|
ME/CFS (sa)
|
4.51E+01
|
3.68E+02
|
0.123
|
0.903
|
|
MS
|
4.73E+01
|
3.23E+02
|
0.147
|
0.884
|
|
Age
|
-1.36E+03
|
4.29E+02
|
-3.16
|
0.00199
|
**
|
PBMC Number
|
1.35E+02
|
1.02E+02
|
1.32
|
0.190
|
|
CD8+ % Input
|
-4.54E+01
|
2.87E+01
|
-1.58
|
0.116
|
|
CD8+ Cells Post-selection
|
-7.90E+02
|
7.85E+02
|
-1.01
|
0.316
|
|
Purity
|
6.47E+00
|
8.34E+00
|
0.776
|
0.439
|
|
DNA Extracted (μg)
|
-9.18E+01
|
1.51E+02
|
-0.608
|
0.544
|
|
DNA Yield (ng)
|
6.11E-03
|
3.90E-01
|
0.016
|
0.988
|
|
Demultiplexed Reads (641)
|
1.23E-04
|
5.47E-05
|
2.25
|
0.0263
|
*
|
PEAR Assembled (641)
|
1.19E+01
|
1.69E+01
|
0.701
|
0.485
|
|
Demultiplexed Reads (782)
|
-1.56E-05
|
4.27E-05
|
-0.366
|
0.715
|
|
PEAR Assembled (782)
|
-1.39E+01
|
1.87E+01
|
-0.741
|
0.460
|
|
a Coding Junctions
|
5.74E-01
|
2.38E-01
|
2.41
|
0.0174
|
*
|
b Coding Junctions
|
-4.18E-01
|
3.32E-01
|
-1.26
|
0.211
|
|
g Coding Junctions
|
-8.88E-01
|
3.07E-01
|
-2.90
|
0.00452
|
**
|
a %
|
-1.25E+11
|
2.22E+11
|
-0.564
|
0.574
|
|
b %
|
-1.25E+11
|
2.22E+11
|
-0.564
|
0.574
|
|
g %
|
-1.25E+11
|
2.22E+11
|
-0.564
|
0.574
|
|
Significance levels for the t-test statistic are shown: for (***), (**), and (*). ME/CFS(sa), ME/CFS(mm) and MS were testing whether being a member of the given subgroup was correlated with the axis, relative to healthy controls (HC). Age (binned by decade) was the linear term of the model t for that covariate. Other covariates: PBMC Number, total number of cells per sample; CD8+% Input, percentage of input PBMCs that were CD8+ cells; CD8+ cells post-selection, number of CD8+ cells after the SureSelect process; Purity, corresponding CD8+ purity at this stage; DNA Extracted (μg), total amount of DNA initially recovered from each sample; DNA Yield (ng), amount remaining after size selection; Demultiplexed Reads [Library], total number of reads recovered for the two Illumina flowcells of sufficient quality (library numbers 641 and 782); PEAR Assembled [Library], percentages assembled into clonotypes for the same libraries using the Paired-End Read Merger tool [40]; a/b/g Coding Junctions, total numbers of coding junctions for each chain type; a/b/g %, percentages of each chain type, relative to the other two types (such that the sum was 100%).
Table 5: Variables tested as potential confounders with MDS plot y-axis using a generalised linear model fit.
Variable
|
Estimate
|
Std. Error
|
t-value
|
|
Significance
|
(Intercept)
|
4.93E+10
|
8.91E+10
|
0.553
|
0.581
|
|
ME/CFS (mm)
|
-1.48E+01
|
1.29E+02
|
-0.115
|
0.908
|
|
ME/CFS (sa)
|
1.77E+02
|
1.48E+02
|
1.20
|
0.235
|
|
MS
|
1.50E+02
|
1.30E+02
|
1.16
|
0.249
|
|
Age
|
3.49E+01
|
1.72E+02
|
0.202
|
0.840
|
|
PBMC Number
|
-2.28E+01
|
4.11E+01
|
-0.554
|
0.580
|
|
CD8+ % Input
|
2.70E+00
|
1.15E+01
|
0.234
|
0.815
|
|
CD8+ Cells Post-selection
|
3.02E+02
|
3.15E+02
|
0.959
|
0.340
|
|
Purity
|
-4.29E+00
|
3.35E+00
|
-1.28
|
0.203
|
|
DNA Extracted (μg)
|
2.42E+00
|
6.07E+01
|
0.04
|
0.968
|
|
DNA Yield (ng)
|
1.88E-01
|
1.57E-01
|
1.20
|
0.232
|
|
Demultiplexed Reads (641)
|
5.19E-05
|
2.20E-05
|
2.36
|
0.0199
|
*
|
PEAR Assembled (641)
|
-5.77E+00
|
6.79E+00
|
-0.85
|
0.397
|
|
Demultiplexed Reads (782)
|
-6.02E-06
|
1.72E-05
|
-0.351
|
0.726
|
|
PEAR Assembled (782)
|
-1.04E+00
|
7.51E+00
|
-0.138
|
0.890
|
|
a Coding Junctions
|
1.56E-01
|
9.56E-02
|
1.63
|
0.105
|
|
b Coding Junctions
|
-4.50E-01
|
1.33E-01
|
-3.38
|
0.000998
|
***
|
g Coding Junctions
|
-2.47E-01
|
1.23E-01
|
-2.00
|
0.0474
|
*
|
a %
|
-4.93E+10
|
8.91E+10
|
-0.553
|
0.581
|
|
b %
|
-4.93E+10
|
8.91E+10
|
-0.553
|
0.581
|
|
g %
|
-4.93E+10
|
8.91E+10
|
-0.553
|
0.581
|
|
Significance levels are shown for (***), (**), and (*). Parameters are as for Table 4.
MDS y-axis values were negatively correlated with b (TRB) locus coding junction count (; Table 5). This observation motivated us to perform similar analyses that separated CD8+ recombinants by each of the three loci: TCR a-, b- or g-loci. Again, no clear separation was visible in two dimensions (Figure 2A,B,C), or when a third dimension was added, or when different a-ranges and step size were used (Figure 2D,E). Finally, we undertook this analysis for CD4+ (helper T) cells, separating by TCR a-, b-, or g-chain. Although some structure was apparent, again we observed no clear separation between groups in two dimensions (Figure 3).
Potential Support Vector Machine (P-SVM)
Whether clonotypes of one group (e.g. ME/CFS cases) clustered separately from another (e.g. healthy controls) was then assessed using the Potential Support Vector Machine (P-SVM) approach of Hochreiter et al. [33] using leave-one-out cross-validation. Classification boundaries between groups were defined by the P-SVM and classifications were tested against random expectation using permutation testing (up to label shuffles; Methods; Figure 4).
Our three primary aims were to test for a difference in the expansion of T-cell clonotypes inferred from TCR sequencing between: (i) ME/CFS cases (MEsa+MEmm) and healthy controls, or (ii) ME/CFS cases and MS controls, or (iii) MEsa and MEmm cases. Recognition that each of the three hypotheses was tested six times (both CD4+ and CD8+ cells; and each of a-, b- or g-chains) resulted in a Bonferroni multiple testing correction [34] of for the 18 primary tests. Our six secondary aims were to test for T-cell clonotype expansion differences between: healthy controls and (a) ME/CFS or (b) MEmm or (c) MEsa cases; or between MS cases and (d) healthy controls or (e) MEmm or (f) MEse cases for each of the two cell types and three TCR chains, i.e. tests. Accounting for this total of 54 tests, our Bonferroni-corrected threshold for testing secondary aims was .
Comparison of group classifications against random permutations yielded single-test p-values, each representing the likelihood that two groups had been separated by the P-SVM algorithm by chance alone. Four comparisons (1 of 18 testing a primary hypothesis, and 3 of 36 testing secondary hypotheses) achieved significance at (Table 6). Nevertheless, once the number of tests was accounted for (as above) no comparison remained significant after Bonferroni correction.
We conclude that none of the null hypotheses that we investigated should be rejected.
|
|
CD8+
|
CD4+
|
|
|
a
|
b
|
g
|
a
|
b
|
g
|
Primary Hypotheses
|
ME vs HC
|
0.224
|
1.00
|
1.00
|
0.137
|
0.030
|
0.053
|
ME vs MS
|
0.236
|
1.00
|
1.00
|
0.663
|
0.351
|
0.293
|
MEsa vs MEmm
|
0.395
|
0.224
|
0.433
|
0.396
|
0.875
|
0.735
|
Secondary Hypotheses
|
Cases vs HC
|
1.00†
|
1.00†
|
1.00†
|
0.301†
|
0.104†
|
0.024†
|
MS vs HC
|
0.998
|
0.715
|
0.709
|
0.379
|
0.827
|
0.140
|
MEmm vs HC
|
0.547
|
0.207
|
0.301
|
0.454
|
0.385
|
0.231
|
MEsa vs HC
|
0.020
|
0.597
|
0.133
|
0.064
|
0.193
|
0.318
|
MEmm vs MS
|
0.193
|
0.451
|
0.553
|
0.135
|
0.491
|
0.514
|
MEsa vs MS
|
0.129
|
0.597
|
0.267
|
0.662
|
0.020
|
0.588
|
Table 6. Uncorrected p-values from permutation testing for CD4+ and CD8+ cells, by a/b/g chain type. p-values are indicated in bold. The † symbol indicates that fewer than 1,000 permutations were used to generate the p-value due to computational constraints (<25 days computation time).