General correlations among all selected geography parameters
Figure 1 showed the general correlations among all selected 34 geography/society parameters, in which geographical factors themselves have specific intra-correlation characteristics. First, most geographical parameters are concentrated into one cluster. This cluster contains 12 parameters that correlate positively with each other: aash, cula, gres, mrvr, area, road, gnei, airp, port, geft, rway, and agrc. Only one parameter, fore, is not in the cluster. On the contrary, the fore parameter has a negative correlation with all the above 12 parameters. Second, in the above mentioned cluster, the port parameter has least correlation with other parameters, which is somehow unexpected. But it is relatively easy to understand why the fore parameter negatively correlated with all other geographical parameters. The more forest, the less human activity, the less values of those human activity-related geographical factors.
General correlations among all selected society parameters
When we look at correlation profiles among the 21 society parameters, we found that they are basically divided into four clusters: cluster 1 (popu, army, pold, cnpl, indu, mepr, relg, mipr, weft, tova, aash), cluster 2 (hdi, agdp, pden), cluster 3 (mrta, fert, regi, rupo) and cluster 4 (crim, ceex, race). Cluster 1 has 11 parameters that positively correlate with each other; cluster 1 has little correlations with cluster 2 and cluster 4, but it has a strong negative correlation with cluster 3. Cluster 2 and cluster 3 has a more negative correlation. Cluster 2 contains hdi (Human development index) and agdp (Average GDP per person). This means the more level of economic development, the less values of those in cluster 3. Economy can decrease the diversity of human society.
Correlations between geography and society parameters
The hdi and agdp parameters in the cluster 2 have 8-9 negatively correlated factors: fore (Forest coverage), rupo (Rural population), regi (Country/regions that speak the same language), mrta (Mortality rate), fert (Fertility rate), crim (Country and region for importation), ceex (Country and region for exportation), race (Race in the country) and aash (Annual average rainfall), in which fore, regi and aash belong to geographical factors. So hdi and agdp also have negative correlations with some geographical parameters (though the correlation level is not high), which means economic progress also diminishes the diversity level of geographical elements.
There are two parameters, port and fore, specially positioned in all PCA diagrams (supplementary file 4). These two parameters are strongly negative with each other, and neither have strong positive correlations with most other geography/society parameters. There are nine parameters negatively correlating with port: pden, fore, rupo, mrta, fert, regi, crim, ceex and race. All other parameters either positively correlate, or have little correlations, with port. So the parameter port influences the geography/society environment, though not strongly, but broadly.
Correlation between a specific gene SNP and Geography/society parameters
PCA analysis was undertaken for 13 language genes one by one with 34 geography/society parameters. Each language gene gives around 10 different SNPs (see SNP number in Table 2). All PCA results were quantified and demonstrated in Figure 2.
For the 1-10 SNPs of each gene, most appear in all four quadrants in the four-quadrant PCA diagram at the same time, though the SNPs of TM1-7 only appear in the 2nd and 4th quadrants (Supplementary file-4). No all SNPs from a single gene stay aggregated in a corner of the PCA map (only in one quadrant) as expected; because all genes passed through million years of mutual adaptation; if all SNPs from a single gene stay aggregated, that would means that many factors or SNPs from other genes counteract with them. Such a single gene would likely be lost during evolution.
The strongest positive correlations were seen at (ATP-1~army, ATP-1~gres, ATP-1~pold, ATP-1~popu, and ATP-1~road). ATP-1 (rs78371901) is one of the SNPs of language gene ATP2C2; this gene encodes the ATPase secretory pathway Ca2+ transporting-2 protein. Diseases associated with ATP2C2 include specific language impairment and some oral communication disorders. The army (Active duty army), gres (Geographical resource), pold (Population aged 65 years or older), popu (Population in the sample country) and road (Road) contain only two geographical factors: gres and road. The gres mainly represents natural resources such as different types of mineral resources.
The strongest negative correlations were seen at (NFX-6~area, NFX-6~army, NFX-6~gres, NFX-6~mrvr, and NFX-6~road). NFX-6 (rs1440228) is one of SNPs of NFXL1 gene. NFXL1 encodes a Nuclear Transcription Factor (X-Box Binding-Like 1). Gene Ontology annotations related to this gene include DNA-binding transcription factor activity and proximal promoter DNA-binding transcription repressor activity, plus RNA polymerase II-specific activity. It is associated with a disease of Specific Language Impairment. The area (Area of the country), army (Active duty army), gres (Geographical resource), mrvr (Main river) and road contain four geographical factors. Interestingly, gres and road and involved both the strongest positive and strongest negative correlations.
Four geography/society parameters demonstrated least correlations with language gene SNPs (Figure 2C), and they are aash (Annual average rainfall), fore (Forest coverage), pden (Population density of the country) and rway (Runway traffic mode); Another several parameters demonstrated second least correlations with language gene SNPs, and they are ceex (Country and region for exportation) , crim (Country and region for importation), agrc (Agriculture, forestry, husbandry and fishery) and relg (Religion in the country).
In Figure 2C, there is another interesting point. For each geography/society parameter, the number of SNPs with positive correlations with it is almost the same as the number of SNPs with negative correlations with it. That suggests that each parameter is coincidently balanced by similar numbers of language gene SNPs with opposite correlations with it.