Population Cohorts
We built a reference ancestry principal component (PC) space using the combined 1000 genomes and human genome diversity project (1000G + HGDP) whole-genome-sequencing (WGS) dataset11 that forms part of the gnomAD database10 (N = 3,901 with gnomAD ancestry labels). This dataset was chosen as it is the only subset of the gnomAD database where individual-level data is available.
In the development and testing of the ancestry pipeline, we used a cohort of 7,509 individuals with a monogenic disorder of insulin secretion who had undergone tNGS analysis at the Exeter genomics laboratory12–14 (Exeter-MDIS cohort). These individuals have been referred from 113 countries, representing a wide range of genetic backgrounds. WGS data was available for 381 of these individuals.
Development of the Classification Method
We used Plink 1.915 to filter the combined 1000G + HGDP reference dataset to 848,202 AIMs based on frequency (minor allele frequency > 0.05), linkage disequilibrium (window size 100bp, step size 5, LD threshold 0.5) and missingness (missing genotype rate < 0.01). The number of AIMs here is higher than would typically be used in WGS-based ancestry analysis but maximises the possible number of sites that can be covered by random off-target reads. From this dataset, we calculated the first 10 PCs. Next, we used the LASER16 tool to perform Procrustes analysis and place the 7,509 individuals in the Exeter-MDIS cohort into our reference ancestry space. Procrustes analysis is a form of statistical shape analysis used here to identify the optimal translation, rotation, and scaling factors to translate a PC space created using just the AIMs covered by on- and off-target tNGS reads into the original reference space built using all 848,202 AIMs.
Finally, we created a random forest model for classification using the reference PC data and the gnomAD-provided ancestry labels for individuals in the population. To enhance the model's classification ability, we incorporated 10 rounds of self-training.17 In each round, the model iteratively classified 500 individuals from an ancestrally diverse subset of the Exeter-MDIS cohort selected based on kernel density across the PC space, incorporating those with a classification confidence of > 0.9 into the training set for the next round. This step aimed to help the model better understand the boundaries between different population groups.
Method Assessment
We used a correlation analysis to evaluate the effectiveness of the Procrustes step. We compared the PC values generated using the LASER Procrustes method on tNGS data to those generated using standard Plink 1.915 PC projection on WGS data from 381 individuals for whom tNGS and WGS data was available.
To assess the accuracy of the model, we used it to classify a subset of 976 individuals in the gnomAD reference dataset who had not been included in the model training stage. The subset was selected randomly from the original reference population in a population stratified manner. We compared the classification output for the testing subset with the original population labels provided by gnomAD.
As an additional test of the model’s performance on unseen data, we performed population classification and UMAP clustering on the remaining 7,009 individuals in the Exeter-MDIS cohort who were not included in the model training. This was to ensure that the classifications were separated into distinct clusters and to check that the model had not overfitted to the training data.