Samples and applied data
In this study we examined samples from 47 Danube Swabian individuals with well-documented family history dating back to 3–6 succeeding generations with unadmixed Swabian ancestries supported by self-declaration-based family history and the resulting pedigree trees. The Danube Swabian individuals live in the villages of Dunaszekcső and Bár which can be found along the Danube River in Southwest Hungary. 29 samples are from Dunaszekcső and 18 samples were collected from the village of Bár. The Swabian population of these villages remained mostly isolated from other ethnicities until today, providing an opportunity to study their genetic makeup and relationship with other major Eurasian groups.
DNA was extracted from ethylenediaminetetraacetic acid (EDTA)-anticoagulated whole blood and was genotyped on the Illumina Infinium Global Screening Array Beadchip platform which contains 725 831 single-nucleotide polymorphisms (SNPs). Isolation, genotyping and preliminary quality control of the samples was carried out by the third-party service provider Human Genomics Facility (HUGE-F) in the Netherlands at the University of Rotterdam. Quality control and data preparation of the marker data was carried out domestically applying in-house scripts and the PLINK1.9 and 2.0 software packages [5, 6]. The data was filtered using the Hardy-Weinberg equilibrium tests, and additionally, SNPs with missing genotypes were removed from the dataset using PLINK with the ‘geno’ flag applying a threshold value of 0.1. All Swabian individuals passed these tests and 665 073 SNPs remained in the Swabian dataset.
This study belongs to a series of investigations that were approved by the National Ethics Board (ETT TUKEB), and by Regional Ethics Committee of Pécs and follows the principles expressed in the Declaration of Helsinki.
Genome-wide autosomal marker data from other open genotype databases was also considered and applied in the study. We used the Human Genome Diversity Project (HGDP) dataset openly available from the server of Stanford University and also applied datasets from the open genome-wide marker data repository of the Estonian Biocentre [7–10]. Mainly populations from the European, Caucasus and South Asian regions were applied from the HGDP dataset, however, for outgroup purposes we also considered the Uyghurs, Han Chinese, and Yoruba. Estonian Biocentre populations included Hungarians, Romanians and Germans.
Principal Component Analyses based population structure analysis
Population structure analysis along with fixation index (Fst) matrix computation were achieved using the SMARTPCA software of the EIGENSOFT 6.01 package [11]. A merged dataset of Swabian samples, HGDP populations and Estonian Biocentre data was analyzed with SMARTPCA. Included HGDP populations were French, French Basques, Orcadians, North Italians, Sardinians, Tuscans, Russians, Adygey, Balochi, Brahui, Burusho, Hazara, Kalash, Makrani, Pashtun, Sindhi, Uyghurs, Han Chinese, Yoruba, and Hungarians, Romanians and Germans from the Estonian data were also used. The dataset contained n = 601 individuals and 110 733 SNPs. SNPs with strong background linkage disequilibrium (LD) was also pruned out with the ‘indep-pairwise’ command of PLINK1.9 setting the r2 threshold to 0.3. It is necessary before the analyses due to strong background LD can bias the PCA method, but also expectation maximization-based ancestry estimation algorithms. After the pruning process, 80 056 SNPs remained in our merged dataset. We used SMARTPCA with default settings, the σ-threshold was set to 6.0.
Maximum likelihood method-based ancestry estimation
Ancestry estimation was carried out with the ADMIXTURE 1.22 algorithm which is a maximum likelihood estimation method using an expectation maximization approach [12]. The correct number of hypothetical ancestral populations (K) were calculated applying K values of 2 to 10 and cross-validation was also performed in order to find the best fitting K for the relationship of our investigated populations.
TreeMix was also applied along with ADMIXTURE analysis to better describe the relationship of these populations in a maximum-likelihood tree-based manner in addition to the stacked column styled ancestry estimation [13]. The size of the SNP blocks (-k flag) was set to 1000 and we also set the algorithm to seek for 1–6 migration events in the data through multiple runs. For these investigations, the same pruned dataset was used that was created for PCA.
Formal test of admixture
In order to test the relationship of Swabians and other investigated populations, we utilized a formal test of admixture, the 4-population test. The qpDstat program from the ADMIXTOOLS 4.1 package was used for this purpose, and as its name suggests, this test was implemented here as D-statistics [14]. We tested the unrooted phylogenetic trees containing Yoruba, Swabians, Hungarians and various European and Caucasus region populations, Germans, Russians, French, North Italians, Sardinians and Adygey. The setups of the ((W,X)(Y,Z)) unrooted trees were the following:
((Yoruba, Swabians)(Hungarians, Europeans)) and ((Yoruba, Hungarians)(Swabians, Europeans)). This test intended to show us the relationship of Swabians to the Hungarian host population, to the Germans and to various European populations. For these calculations, we used the unpruned version of the previously created dataset.
Identical by descent and homozygous by descent analyses
For assessing the sources of ancestry in the investigated Swabian samples, we implemented here the Refined IBD algorithm of Beagle 4.1 [15]. The software seeks in phased haplotype data for IBD segments between all pairs of individuals, which shows us the relative share of one population in the ancestry of the investigated population. Before the analysis, the data was converted according to the needs of the software using the PLINK1.9 software. The major alleles were set as A1 allele and the dataset was converted to Variant Call Format 4.1 with the PLINK/SEQ software [16]. The minimum segment length was set to 3 centiMorgan, the IBD trim parameter value was 10. The IBD scale parameter was calculated with the \(\sqrt{n/100}\) recommended formula since our data contained more than 400 individuals [15]. Using the inferred IBD segment data, we calculated an average pairwise IBD sharing between Swabians and various populations with the following formula according to Atzmon et al.:
$$Average pairwise IBD sharing=\frac{{\sum }_{i=1}^{n}{\sum }_{j=1}^{m}{IBD}_{ij}}{n\bullet m}$$
IBDij is the length of the IBD segment shared between individuals i and j. The n and m are the number of individuals in the groups I and J [17].
We also calculated the average number and average length of IBD segments between Swabians and the investigated various populations.
Besides IBD segments, Refined IBD simultaneously detects homozygous by descent (HBD) segments, which allows us also to infer the genome-wide autozygosity of respective populations. This can imply the degree of isolation and degree of inbreeding of these groups. Therefore, average length and number of HBD segments were also calculated.