Study samples
All the individuals recruited were of African ancestry from the 2 participating countries (Ghana and Nigeria). As per the established AfriCRAN protocol developed by Butali and colleagues, recruitments of infants with OFC and their parents were done during the evaluation of the affected child for surgical repair of their clefts. We recruited children affected with nsCL/P and the unaffected parents (father and mother) i.e., case-parent trios for this genetic study. In some situations, we recruited just mother and affected child i.e., dyads and in rare occasions, we recruited other family members like siblings and grand-parents. For the current study, only case-parent trios were included (Figure 1a). Before a trio was recruited, the parents must have reported no family history of any major birth defect. Following the study design, ethical approvals were obtained at the local institution review boards (IRBs) at the participating sites: Lagos University Teaching Hospital (ADM/DCST/HREC/VOL.XV/321), Obafemi Awolowo University Teaching Hospital (ERC/2011/12/01), Kwame Nkrumah University of Science and Technology (CHRPE/RC/018/130) and University of Iowa (IRB ID #: 201101720). The methods used in the recruitment at the different centers were carried out in accordance with statutory guidelines and regulations. Informed consent was obtained from all subjects included in this study.
A case-parent trio was recruited following deep phenotyping of the type of cleft and ruling out other congenital anomalies. This ensured the case (affected child) had a nonsyndromic cleft phenotype while the parents were unaffected. A standardized phenotyping protocol was used by the surgeons during the physical examination, taking clinical photographs and detailing the cleft phenotypes in a clinical database as reported in our previously published works 10,53. Echocardiography was used to rule out congenital cardiac defects. For each trio, the cleft status describing the type of cleft was recorded. Table 1 shows the number of trios, site of recruitment and their cleft status. The distribution of the cases based on cleft types is as shown in Figure 1b.
Saliva samples were collected from the parents and the affected child using the Oragene saliva tool kits. Each case-parent trio was assigned a unique identifier number and their epidemiological and clinical information were remotely uploaded into a secure REDCap database. Following de-identification, the saliva samples were shipped to the Butali laboratory at the University of Iowa for processing.
DNA Extraction and XY Genotyping
Saliva samples received from the recruitment centers were labeled with their unique identifier (UNID) number. The DNA was isolated from the saliva samples using the Oragene DNA extraction protocol. Extracted DNA samples were quantified using Qubit (http://www.invitrogen.com/site/us/en/home/brands/Product-Brand/Qubit.html; ThermoFisher Scientific, Grand Island, NY). Stocks and working aliquots of each DNA samples were made for future testing.
We confirmed the reported sex using the TaqMan XY genotyping. Confirmation of the sex is an inhouse quality control (QC) used in the Butali laboratory. Working aliquots (25μl) passing QC with DNA concentration ≥250ng were shipped to the Broad Institute for whole genome sequencing supported by the Gabriella Miller Kids First program.
Whole-Genome Sequencing and Variant Calling
Our nsCL/P case-parent trios’ DNA samples were part of the cohorts sequenced under the Gabriella-Miller Kids First (GMKF) Pediatric Research Consortium (https://kidsfirstdrc.org/) This consortium was established and funded to address the knowledge gaps in the understanding role of the genetics in the etiology of structural birth defects and pediatric cancers. The WGS was conducted at the Broad Institute with entire genome sequenced an average of 30 times (30x WGS). The binary alignment map (BAM) and sequence alignment map (SAM) files were obtained after the sequence data were aligned to the Human genome assembly GRCh38 (hg38). Alternate alleles (i.e., variants from the reference genome), were called when present using the GenomeAnalysisToolKit (GATK) pipelines at the Broad Institute (https://software.broadinstitute.org/gatk/best-practices/workflow). Briefly, these variants include single nucleotide variants (SNVs) and Insertions or deletions (Indels), were called using the HaplotypeCaller in GVCF mode and GenotypeGVCFs for single-sample variant calling and the multiple-sample joint variant calling respectively. Variants were stored in a variant call format (VCF) file which was used for further analyses.
Quality Control
The quality control of the case-parent nsCL/P African trios WGS data was done using PLINK v.1.9. Each individual in a case-parent trio were evaluated on variety of quality metrics. Individuals with missingness > 10%, inconsistency between the sex reported and the average homozygosity of X-chromosome or Hardy-Weinberg Equilibrium (HWE) < 1E-06 were dropped. Also, trios showing deviation from the expected degree of relatedness between the case (offspring) and parents, or case-parent trios with Mendelian errors outside three standard deviations from the mean were dropped. Additionally, individuals with variant calls beyond 4 standard deviations from mean heterozygote/homozygote ratio were dropped.
A case-parent trio was retained only when the offspring and parent samples meet these quality control thresholds. If at least one sample in a trio does not meet these thresholds, the entire case-parent trio was dropped. After the quality control, 130 out of 150 case-parent trios were retained for downstream analyses.
Analyses for De novo Mutations (DNMs) contributing to risk of nsCL/P
Following the variant calling, we filtered for high confidence protein-altering DNMs using the data filtration pipeline in Figure 1C. Variants were first filtered based on genotype quality (GQ) ≥ 20 and a read depth (DP) ≥ 10. The high-quality variants were then filtered for mutations present in the affected case but absent in the unaffected parents (DNMs). Potential DNMs passing these filtering steps were then examined for high impact/ protein-altering mutations. Such mutations are within the coding region of the genes and the selected consequences are loss of function (LOF) and missense mutations creating altered gene products.
Following the identification of these coding DNMs, we filtered for those variants with minor allele frequency (MAF) ≤ 1% (0.01). We did this by comparing the identified DNMs to variants reported in the 1000 Genome database (https://www.internationalgenome.org/), Exome Variant Server database (https://evs.gs.washington.edu/EVS/) and Genome Aggregation Database (https://gnomad.broadinstitute.org/). Allelic frequencies in these public databases contains whole genome sequencing data from over 7000 African and African-American controls including individuals from Ghana and Nigeria.
We then identified those genes with DNMs with some evidence of involvement in human craniofacial development. This was achieved by mining the DECIPHER database (https://www.deciphergenomics.org/) to identify those genes with copy number variants (CNVs), indels and SNVs reported in individuals with craniofacial anomalies. We prioritized those genes recognized as associated with lip and palate anomalies or anomalies in other craniofacial structures. In a bid to identify the contribution of these DNMs to the risk of nsCL/P, we also mined the Mouse Genome Informatics (MGI) database. We focused on genes among our list with cleft phenotype in mouse knockouts.
Next, we predicted the functional consequence of these DNMs on protein structure and functions using the bioinformatic tools such as Sorting Intolerant From Tolerant, SIFT (http://sift.jcvi.org/)54, Polymorphism Phenotyping, PolyPhen2 (http://genetics.bwh.harvard.edu/pph2/ ) 55 and Combined Annotation Dependent Depletion, CADD (https://cadd.gs.washington.edu/) 56. We identified DNMs predicted to be deleterious, damaging or among the topmost deleterious mutations in the human genome.
Furthermore, we investigated the effect of missense DNMs on the protein structure and function. We used the bioinformatic tool, Help you Protein Explained: HOPE (https://www3.cmbi.umcn.nl/hope/) 57 to predict effects of the amino acid change on the protein structure and function. Additional computational methods were used to predict the structural effects of the DNM on one of the most reported cleft candidate gene products. Starting from the predicted protein structures generated by AlphaFold2, we locally optimized the structure to relax its backbone torsions and performed sidechain optimization (i.e., sidechain repacking) to find the most favorable position for each sidechain and improve MolProbity scores 58. Both optimizations were done with the AMOEBA polarizable force field59,60. We then used the optimized protein structure to calculate the protein stability change due to the amino acid changes. Folding free energy changes (i.e., protein stability changes) were measured using NAnoscale Molecular Dynamics (NAMD) by calculating the free energy change due to mutation for a folded and unfolded state of the wildtype and mutant protein61. The protein stability change (∆∆G) is defined as ∆∆G = ∆Gfolded - ∆Gunfolded. Generally, protein stability changes that are greater than 1 kcal/mol are more likely to cause disease. This analysis offered insight into the effect of the DNMs on protein functionality.
Sanger-Sequencing Validation
To eliminate false positive DNMs, we conducted Sanger-sequencing validations of selected high impact DNMs discovered through WGS in our case-parent trios. Briefly, we designed primers around the DNMs by including 500 base pairs upstream and downstream of the mutation locus. The primers were designed using primer3 (https://primer3.ut.ee/) and optimized for the application of the regions containing the DNMs. The optimized primers were used to amplify these loci in DNA samples using a DNA concentration of 4 ng/μl in a 10 μl polymerase chain reaction. A YRI HapMap sample was added to the plate as a negative control. Details of the primers and annealing temperatures are available from the Butali laboratory upon request. The PCR products were sent to Functional Biosciences, Inc., Madison, WI (https://functionalbio.com/) for sequencing. Sequence data were investigated to confirm the DNMs.
Gene Set Enrichment Analysis (GSEA) and SysFACE based craniofacial gene expression analysis
We did a gene set enrichment analysis to identify those processes significantly enriched within our DNMs gene set. We used the Database for Annotation, Visualization and Integrated Discovery (DAVID) to identify the biological processes significantly enriched within gene sets with the DNMs. The list of the genes with DNMs were entered in as a query in the DAVID Bioinformatics Resources 6.8 (https://david.ncifcrf.gov/) and the analysis was run. The biological processes with at least a nominal p-value (p < 0.05) were selected, among others. Among those with suggestive significance value (p < 0.05), we identified those processes involved in lip and palatal development where genes on our list were involved.
To gain biological insights on the candidate genes among the DNM gene lists, we examined the expression in relevant craniofacial tissues using SysFACE (Systems tool for craniofacial expression-based gene discovery), as previously published 16,62. Mouse craniofacial transcriptomics microarray data for maxilla, frontonasal and palate at embryonic (E) and post-natal (P) stages was used for examining gene expression. Transcriptomics data from public databases such as FaceBase (https://www.facebase.org) and NCBI GEO (https://www-ncbi-nlm-nih-gov.udel.idm.oclc.org/geo/) was meta-analyzed as previously described62. The following FaceBase datasets (FB00000352, FB00000353, FB00000107, FB00000254, FB00000264, FB00000468.01, FB00000474.01, FB00000477.01, FB00000905) and NCBI Gene Expression Omnibus (GEO) datasets (GSE7759, GSE55965, GSE22989, GSE31004, GSE11400) were considered in the analysis. Craniofacial tissue expression, presented in fluorescence intensity units, were used to generate heatmap representation.