Here, we report a novel locus for late-onset NCD at a (GCC)-repeat in the 5′ UTR of the SMAD9 gene, and indication of natural selection at this locus at the inter- and intra-species levels. This locus harbors combinatory genotypes that specifically (unambiguously) predispose or protect against moderate to severe late-onset NCD in humans. Interestingly, a number of those combinatory genotypes were detected repeatedly. The NCD patients harboring the specific genotypes encompassed a spectrum of possible diagnoses, including CVD and AD. Although the percentage of individuals harboring such genotypes was modest (approximately 4% of the genotypes in each group), those genotypes represent an underappreciated feature, which may enhance the perspective of disease pathogenesis in disorders that appear to be complex, and yet may be linked to unambiguous genotypes at certain STR loci.
The primary importance of (GCC)-repeats stems from a possible link between that type of STR and natural selection, mainly for two reasons: Firstly, (GCC)-repeats are specifically enriched in the exons. Secondly, CpG-rich sequences are mutation hotspots [27] and frequently interrupted by single nucleotide substitutions as a result of C to T transitions, which is also the likely possibility at the (GCT)-residue at the immediate downstream flanking sequence of human SMAD9 (GCC)-repeat. The intact block of the (GCC)-repeat in primates, and not in any other order, supports selective advantage of this repeat in this order.
We found significant excess of the (GCC)7 allele in the NCD group and genotypes that consisted of (GCC)7 and not (GCC)9 in this group only. On the contrary, we found genotypes in the control group only, and not in the NCDs, that consisted of (GCC)9 and not (GCC)7. Based on the above findings, we propose that the (GCC)7 allele may function as risk factor for late-onset NCD, whereas (GCC)9 may be protective. Similar to our findings in SMAD9, we have previously reported another predominantly biallelic (GCC)-repeat of 8 and 9 repeats in the 5′ UTR of the human SBF1 gene, in which excess of the shorter allele was detected in the NCD group [28].
Searching the Genome Aggregation Database (gnomAD) for the human SMAD9 (GCC)-repeat yielded inconclusive data for the annotated alleles and genotypes (https://gnomad.broadinstitute.org). The above finding is most likely due to the frequent failure of the general whole-exome sequencing methods to capture GC-rich sequences. Successful PCR amplification of the human SMAD9 gene is challenging and warrants stringent conditions and special GC-rich buffer preparations as described in the Methods. Furthermore, this imperfect STR is disrupted by T nucleotides in its 3′ end, as revealed by the (GCT)-residue. This indicates that conventional fragment analysis is not an efficient method for scoring (GCC)-repeats at this locus, and it is necessary that every sample is sequenced for obtaining accurate data.
SMAD9 is predominantly expressed in the brain and skeletal tissues [18, 29, 30], and the protein encoded by this gene can translocate into the nucleus and affect the transcriptional regulation of target genes. Higher order brain functions and skeletal phenotypes (characteristics that have significantly diverged in primates versus other orders of animals) may be selection forces for the expansion of this STR in primates. Various (GCC/GGC)-repeats of a similar length range to the human SMAD9 gene STR can alter gene expression activity [31, 32]. Our bioinformatics analysis revealed that the number of (GCC)-repeats may change the RNA secondary structure (stem-loops) and accessibility (unpaired RNA bases) of, at least, exons 1 and 2 of human SMAD9 (Fig. 5). RNA stem-loops in structurome data reveals widespread association with protein binding sites [33], which may, in turn, alter the processes linked to transcription and translation.
Another interesting feature at this locus and a number of other previously reported instances are the low-frequency alleles, which might have been subject to negative natural selection [8, 14, 34]. In human SMAD9, examples of those alleles are (GCC)8 and (GCC)10. Two genotypes consisting of those rare alleles i.e., 7/8 and 8/10, were detected in the NCD group only. Genotypes consisting of low-frequency alleles at a (GCC) locus in the NCD patients were also detected in the RASGEF1C and SBF1 gene loci [14, 28]. While allele and genotype-wise, the (GCT)-residue did not skew in the NCD group versus controls, conjunction of (GCT)1 and (GCT)3 with (GCC)7 were detected in two NCD patients, and not in any controls.
Reported instances of STR allelic natural selection at non-coding loci in humans are rare [8, 14, 34], and the SMAD9 (GCC)-repeat provides a potentially valuable locus to further test this phenomenon.
It is warranted that this STR locus is sequenced in larger samples and in a spectrum of neurological disorders. Mechanisms underlying allele and genotype selection should also be examined in future functional studies.