A pan-tissue map of accessible chromatins in maize
We profiled the accessible chromatins for 12 major tissues or cell types in maize (Supplementary Table 1). Extending from the earlier data of ear2,18, tassel18, leaf2, seedling14,15, root16 and mesophyll17, we newly generated the ATAC-seq data of embryo, endosperm, pericarp, shoot apical meristem (SAM), silk and stem (Fig. 1a). Each data had two biological replicates that exhibited high reproducibility, canonical enrichment near gene TSS (transcription start sites), and clustered well in both principal component and heatmap analysis (Supplementary Fig. 1 and Supplementary Fig. 2). Therefore, the data of two biological replicates were combined for subsequent analysis. There were 26,743 to 39,561 ACRs (Fig. 1b) identified in each tissue, covering approximately 0.67–1.0% of the B73 reference genome (Fig. 1b). As expected, ACRs were mainly distributed in euchromatin regions (Fig. 1c), aggregating into some hotspots that either conserved or specifically existed in different tissues (Supplementary Fig. 3). We further classified ACRs into genic (36–52%), proximal (19–25%) and distal (29–46%) ACRs according to their intersection with genes (Fig. 1d). Due to the proliferation of maize genome, a considerable number of dACRs were located in “gene deserts” regions (100 kb away from the nearest gene, Supplementary Fig. 4a). Interestingly, these ultra-distal ACRs usually exhibited high conservation across the tissues we profiled, and could be independently verified by maize single-cell ATAC-seq data19 and ChIP-seq data2 of various active histone modifications (Supplementary Fig. 4a and 4b). They were also found to be entwined with chromatin interactions, some pinpointed to genes in 3D genomic data2 (Supplementary Fig. 4b), indicating the cis-regulatory role of maize ACRs in both local and distal manners.
In contrast to the bloated maize genome that primarily composed by TEs (~ 85%), only ~ 25% of ACRs overlapped with TEs (Fig. 1e), suggesting the enrichment of ACRs for low-copy sequences in maize. Specifically, dACRs (~ 40%) had higher TE composition over pACRs (~ 23%) and gACRs (~ 15%). Regarding to TE superfamilies, Gypsy accounted for over half of TEs in dACRs, while Helitrons enhanced substantially in pACRs and gACRs which was consistent with their strong preference for gene-rich regions in maize26 (Supplementary Fig. 5a). It was worth noting that the insertion time of LTRs associated with ACRs seemed to be more ancient than the remaining non-association LTRs, suggesting the purification of recent TE insertions that may disrupt ACRs’ function (Supplementary Fig. 5b). Unlike TEs, simple repeats were reported to be enriched in CREs in humans27 and also found to be overrepresented in maize ACRs (Supplementary Fig. 5c). Specifically, they exhibited a notable valley in the core region of ACRs while elevated in flanking regions, in line with their presumed role to assist the binding of TFs27.
By integrating the data of PlantTFDB28, JASPAR29 and DAP-seq2,30, the DNA binding motifs of 306 maize TFs from 36 families were curated (Supplementary Table 2). We discovered the motifs of 243 TFs within 50 bp sequences flanking the summit of ACRs, with each tissue having 157 to 191 TFs (Supplementary Table 3). Similar to a report in cotton31, the motifs of BRR-BPC, C2H2, Dof, ERF and TCP families, which were fundamentally important for plant growth and development, were highly enriched in all profiled tissues (Fig. 1f and Supplementary Fig. 6). Specifically, the enrichment of SBP families was high in the ACRs of ear, tassel and SAM, in consistent with one of their major roles in establishing meristem boundaries within inflorescence tissues32. Additionally, we also found the motifs of some WRKY family genes that overrepresented in roots and mesophyll, a phenomenon that was accorded with their significantly enhanced transcription in similar tissues33,34.
We further merged the ACRs of all tissues into 80,365 pan-tissue ACRs (~ 61.4 Mb, ~ 2.9% of the maize genome), accounting for 2.8 to 4.4-fold of the ACR length in a single tissue. Our estimation approached the maize single-cell ATAC-seq assays (~ 4%) that had integrated the data of 56,573 nuclei from 6 major tissues19. However, our ACRs were apparently not saturated, representing approximately 85% of saturation according to Michaelis-Menten equation analysis (Fig. 1g), indicating the further necessity to incorporate more specialized tissues or cell-types. Pan-tissue ACRs were further classified into core, variable and tissue-specific ones (Fig. 1g). The contrasting numbers of variable (n = 46,483) and core ACRs (n = 6,475) again suggested the high tissue specificity of accessible chromatins in maize. To further reveal the selective pressure on ACRs, we estimated the SNP density within and flanking ACRs using data of maize HapMap335. As expected, the core ACRs had the lowest SNP density in their central region (Supplementary Fig. 7), followed by variable and private ACRs that were accorded with their conservation degree. Interestingly, the flanking regions of core and variable ACRs shown significantly higher SNP density than private ACRs, possibly due to their correlation with previously reported higher genetic diversity of genic and proximal regions in maize36. In summary, we generated a pan-tissue map of over 80k maize ACRs that could allow their evolutionary characterization in subsequent sections.
Evolutionary constraint of maize ACRs in other grass genomes
Comparative genomic studies in plants have commonly characterized the conservation or innovation of protein-coding genes, yet the evolutionary constraint of more widespread regulatory sequences remains less investigated. Based on the pan-tissue ACRs, we employed a Cactus-based comparative approach (Supplementary Fig. 8) to trace their evolutionary trajectory in 34 Poaceae species and 4 outgroups (Supplementary Fig. 9 and Supplementary Table 4). To demonstrate the robutness of our method, we identified the regions in sorghum genome that highly constrained (alignment coverage > = 90%) with maize ACRs, which could successfully reconstruct the two maize subgenomes (Supplementary Fig. 8b). When expanding to other Poaceae species, the number of highly constrained ACRs was negatively correlated with their pairwise divergence (Fig. 2a). For example, there were 7,117, 3,592 and 1,421 ACRs that highly constrained in sorghum (diverged ~ 11.1 Mya), foxtail millet (~ 22.8 Mya) and rice (~ 49.1 Mya) respectively, which exhibited a good consistency with their overall phylogeny (Fig. 2a and Supplementary Table 4). In general, the constraint of ACRs situated between that of protein-coding sequences and randomly selected DNA sequences (Fig. 2a), suggesting an accelerated evolutionary rate of ACRs over protein-coding sequences in crops.
We then projected ACRs into a planar map (N1-N2, Fig. 2b) according to the number of highly constrained species (N1) and the number of species without constraint (N2, alignment coverage < = 10%), which exhibited a similar triangle as the map generated by comparing human CREs against hundreds of mammal genomes21. It was worth noting that only 1% (n = 808) of the maize ACRs were highly constrained in most Poaceae species (N1 > = 20) (Fig. 2b), contrasting with approximately half of human CREs that were constrained in primates. In addition, there were a substantial number of ACRs (~ 34.6%) that specifically existed in maize (N2 > = 34) (Fig. 2b). These data implied unprecedented high genetic diversity of regulatory sequences in grasses, and further prompted us to explore the driving forces of maize specific ACRs. We found a profound proportion (~ 70%) of maize specific ACRs located in distal regions (Fig. 2c), while both highly and moderately constrained ACRs were exclusively located (> 90%) in genic regions. Meanwhile, nearly half of the sequences in maize specific ACRs were annotated as TEs (Fig. 2d), particularly Gypsy elements, contrasting with very rare TE composition in highly and moderately constrained ACRs. These data highlighted the major role of TEs, mostly resided in regions distal to genes, that involved in the innovation of maize specific ACRs.
To further characterize the interspecific constraint of ACRs, we applied the Uniform Manifold Approximation and Projection (UMAP) algorithm on all ACRs based on their alignment coverage in each of the non-maize genomes (Fig. 2e). This analysis resulted in a UMAP resembling ‘a bunch of flowers’ with constrained (N1 > = 10) ACRs enriched in stalks, while the maize specific ACRs formed two distinct groups, one (class I, n = 7,897) dispersed within the ‘petal’ region, and the other (class II, n = 19,780) formed an ‘isolated island’. We found almost no sequence homology of class II ACRs in other Poaceae genomes as compared with sparse homology of class I ACRs. Although class II ACRs exhibited higher affinity with TEs over class I ACRs, there were still more than half of both class II and class I ACRs that not originated from TEs (Fig. 2d). These non-TE maize specific ACRs might result from sequence mutations that accumulated due to genome redundancy after whole genome duplication of maize.
Genes regulated by highly constrained or species-specific ACRs are likely to encode for traits of conservation or innovation, respectively21. Using our previously generated Chromatin Interaction Analysis by Paired-End Tag Sequencing (ChIA-PETs) data37, we identified these two different sets of genes (n = 2,041 and 10,779), many of which that unable to be identified by the simple principle of proximity (Supplementary Fig. 10). Genes targeted by highly constrained ACRs were found to be enriched in photosynthesis pathways including NADPH: quinone reductase activity and response to light stimulus (Supplementary Fig. 11a). They were also participated in the transmembrane transport of some polyols (Supplementary Fig. 11a) among which mannitol served as a major product of photosynthesis38, as well as Xylose which could play a crucial role during the formation of plant cell walls39,40 to mediate plant response to abiotic stresses (Supplementary Fig. 11a). On the other hand, genes (n = 10,779) regulated by maize specific ACRs were primarily enriched in plant immune or defense response pathways as were commonly observed for PAV genes41,42 (Supplementary Fig. 11b). They include jasmonic acid and salicylic acid metabolic process (Supplementary Fig. 11b), which could regulate plant resistance against pathogen invasion43 during the continuous evolution of plant immune system44. Meanwhile, genes regulated by maize specific ACRs were also enriched in flower development and morphogenesis pathways (Supplementary Fig. 11b). One special example was ramosa1 (ra1), a pivotal gene that could regulate maize inflorescence branching architecture45 and specifically expressed in immature tassel and ear in our profiled tissues (Supplementary Fig. 12). It was notable that only one ACR was identified in both the gene body and 20-kb flanking regions of ra1 (Fig. 2f), which was very likely to be a decisive regulatory locus of ra1. This ACR was maize-specific since it had no constraint in even sorghum and other Poaceae species, therefore it might function through ra1 to contribute to the substantial changes of inflorescence architecture in maize compared to other grasses.
We further investigate the expression profiles of aforementioned two set of genes to find more clues regarding their roles in maize speciation. Although genes regulated by highly constrained ACRs exhibited overall higher expression level (Supplementary Fig. 13a), those regulated by maize specific ACRs, by contrast, exhibited significantly higher tissue specificity (Fig. 2g). This result was consistent with previous findings that tissue-specific expressed genes usually evolved faster46 and were more likely to be affected by speciation and gene duplication47. We further utilized an independent dataset that applied the phylogenetic approach to estimate the age of genes in Poaceae48, and found a higher proportion of young genes among genes that regulated by maize specific ACRs (Supplementary Fig. 13b). Collectively, we demonstrated the massive number of maize specific ACRs that primarily derived from the activity of TEs, and genes they regulated might involve in the innovation of some maize specific traits such as inflorescence development and immune response.
Intraspecific constraint of ACRs and their association with maize domestication
Recent release of diverse maize reference genomes14,49,50 offers us an opportunity to further explore the intraspecific constraint of maize ACRs. By adopting the same comparative approach, we identified highly constrained ACRs in a total of 50 non-B73 maize genomes (Fig. 3a and Supplementary Table 5), finding approximately half of ACRs were fixed in nearly all maize genomes (Supplementary Fig. 14a). This ratio was comparable to the estimation of intraspecific constraint of CREs (~ 50%) in primates22, and accorded with a previous estimation of genetic diversity in maize that was comparable or even higher than the divergence between human and chimpanzee51. On the other hand, only ~ 7% (n = 5,239) of ACRs were constrained in less than 10 maize genomes (Fig. 3a), among which 342 were specifically found in B73. It seemed that the conservation of ACRs we estimated here was stronger than genes, since a pan-genome analysis in maize NAM population had reported a higher degree of gene variability14 (~ 70% were dispensable or private genes). To make clarification, we simultaneously estimated the constraint of ACRs and genes in our panel. We found apparently stronger constraint of genes over ACRs, and their constraint were highly correlated (r = 0.98) with only a few exceptions (e.g., LH244 and PH207) that the constraint of ACRs was slightly stronger than genes (Supplementary Fig. 14b). Interestingly, ACRs that constrained intraspecifically characterized quite different genomic properties with interspecifically constrained ACRs, for example, TEs were common in intraspecifically constrained ACRs while rare in interspecifically constrained ones (Fig. 3b). In addition, a considerable proportion of intraspecifically constrained ACRs were found in distal regions (Fig. 3c), while almost no distal ACRs was found in interspecifically constrained ones. These findings underscored the importance of historical TEs that shared in the common maize ancestors, which might reside in regulatory regions and subsequently be fixed after maize diversification.
Genome comparison between wild teosinte and maize had uncovered hundreds to thousands of candidate genes associated with maize domestication5,7, yet there was still relatively few information regarding the regulatory sequences involved in this process. Based on the classical case of tb1, which was a ‘gain-of-function’ Hopscotch insertion that fixed in maize while absent in teosintes5, we hypothesized that ACRs which were highly constrained in most maize genomes while absent in two wild teosintes (Zea mays ssp. parviglumis and Zea mays ssp. mexicana)52 should play a pivotal role in rewiring transcriptional networks during maize domestication. Under this scenario, we could successfully validate the ACR that colocalized with tb153 (Supplementary Fig. 15a). On genome-wide scale, we totally identified 497 ACRs that potentially involved in maize domestication. Interestingly, as compared with all highly constrained ACRs in maize, this small group of highly-constrained and domestication-associated ACRs shown uncanonically higher TE composition (Fig. 3c), as well as a profound ratio within distal non-genic regions. These results again highlighted the prevalence and importance of TEs, particularly those located distal to genes during domestication of maize. We were interested in the TFs that involved in regulating these domestication associated ACRs and further predicted their TFBSs. We found significant motif enrichment of TCP family (Fig. 3d), which was accorded with the master regulator of maize domestication gene tb1. In addition, we also discovered the enriched motifs of C2H2 family including ramosa1 gene, as well as ERF and bZIP families that were fundamentally important during plant development.
To offer more functional clues into these domestication-associated ACRs, we again used the ChIA-PETs data to find a total of 262 genes that entangled with these ACRs through chromatin interaction. Notably, we noticed significant enrichment of these genes in another independent domestication gene list (Supplementary Fig. 15b) that identified by scanning SNPs between teosinte and maize populations54. Moreover, among the differentially expressed genes (DEGs) identified in three major tissues between teosintes and maize55, we found a significantly higher ratio (17.6%, Supplementary Table 6) of DEGs within these interacting genes over randomly selected genes without interactions (12.7%, Fig. 3e). As an example, we identified an ACR in ~ 51 kb upstream of bHLH68, also known as phytochrome-interacting factor1 or ZmPIF1 that could play versatile roles in regulating plant growth, development and drought tolerance in maize56. Chromatin interactions between this ACR and bHLH68 can be found not only in our ChIA-PETs data but also in other 3D genomic datasets (Fig. 3f ), and this ACR was fixed in multiple maize lines while absent in teosintes, which correlated well with its up-regulation in maize over teosintes (Supplementary Fig. 13). A similar case was also found for a domestication-associated ACR and its interacting gene cmu2 (Supplementary Fig. 15c), which encoded a chorismate mutase with elevated expression in maize ear and leaf and might be involved in plant pathogen infection resistance57 (Supplementary Fig. 13). In summary, our results demonstrated the great promise of analyzing evolutionary constraint of ACRs to identify candidate regulatory loci that associated with crop domestication and subsequent adaptation processes.
Constraint and dominance of ACRs between two maize subgenomes
It was well documented that genes in one maize subgenome always dominated expression over genes in the other one58, a phenomenon known as subgenome dominance which was still less investigated for chromatin accessibility. Using 3,719 high-quality genes pairs with exact 2:1 synteny to sorghum genes (Supplementary Fig. 16), we reconstructed two maize subgenomes (m1, 1,181 Mb; m2, 726 Mb) and classified ACRs into m1 (n = 47,666) and m2 (n = 28,442), respectively (Supplementary Table 7). It was notable that the numbers of highly constrained ACRs in m1 (n = 4,647, 9.7%) and m2 (n = 2,326, 8.1%) when compared to sorghum were higher than the highly constrained ACRs when mutually compared between m1 and m2 (~ 1,200, Fig. 4a and 4b), a phenomenon accorded with the well-known reciprocal subgenome fractionation observed for genes59. We then characterized key metrics of ACRs between two subgenomes, finding both the number and density of ACRs in m1 were higher than m2, although their overall differences were subtle (Fig. 4c and Supplementary Fig. 17a). Accordingly, the transcriptional differentiation was also weak when profiled across two subgenomes (Supplementary Fig. 18a). However, when we focused on these 3,719 pairs of syntenic genes, the dominance of both chromatin accessibility (Fig. 4d and Supplementary Fig. 17b-c) and gene transcription was significantly enhanced (Supplementary Fig. 18b). Meanwhile, the chromatin accessibility of these conserved regions was also much stronger than the remaining less-conserved regions (Fig. 4c and 4d). These results suggested the importance of chromatin accessibility that could function to differentiate duplicated gene loci to introduce subsequent gene expression differentiation60.
To further explore accessible chromatin in specific duplicated loci, we focused on teosinte branched1 (tb1, m1) and its m2 paralog tcp6, both of which encoded the TCP family transcription factors. Although gene density and chromatin accessibility were fairly high in up- and downstream regions of tcp6, no ACR identified in its gene body and 10-kb flanking regions (Fig. 4f), which was consistent with its silenced expression in major maize tissues (Supplementary Fig. 19a). In contrast, tb1 was exclusively activated in ear, which was likely resulted from the functional Hopscotch insertion that shown colocalization with two independent ACRs in our data (Fig. 4f). We also investigated two other syntenic genes pairs, ub2 (m1) and ub3 (m2), gt1 (m1) and Zm00001d048172 (m2), both were crucial in regulating plant architecture61,62. Cases of ub3 and gt1 were similar to tb1 that their causative insertions, named KRN47 and prol1.162, exhibited accessible chromatin in most of our profiled tissues (Supplementary Fig. 19b and 19c). However, no parallel functional insertion was found near ub2 and Zm00001d048172, and consequently their transcription was weakened or completely silenced (Supplementary Fig. 19a). These data suggested that subgenome-specific functional sequence insertions, mostly TEs, might be one of the driving forces underlying the subgenome differentiation of chromatin accessibility and gene expression.
We next sought to decipher the divergence of TFBSs between two maize subgenomes. Unlike genes and ACRs that exhibited strong fractionation63 (Fig. 4a), approximately 80% of the TFBSs (69.1–86.3% across our profiled tissues) can be identified in both two subgenomes (Fig. 4e and Supplementary Fig. 20a). However, the divergence of TFBSs enhanced in promoter and distal ACRs (Fig. 4e and Supplementary Fig. 20b and 20c), e.g., only one-third of the TFBSs in distal ACRs of seedling were shared between two subgenomes. Therefore, the strong conservation of TFBSs between two subgenomes might come from the complementation of ACRs from different functional regions. Additionally, we found a higher number of subgenome-specific TFBS in m1 than m2, which was accorded with the general principal of m1 dominance over m2 (Fig. 4e and Supplementary Fig. 20a-c). For instance, the TFBS of GLK family, which could regulate some earliest photosynthesis steps by activating genes encoding chloroplast-localized and photosynthesis-related proteins64, were exclusively found in m1 (Supplementary Fig. 21). Interestingly, genes that encoding these GLK TFs were also existed in m1, indicating their regulatory functions that preferentially exerted within the same subgenome. These results indicated that the regulomes of two maize subgenome were largely shared except for a few TF families that specifically functioned within a single subgenome to precisely regulate the development of particular tissues.
Trait-associated SNPs are enriched in ACRs that highly constrained in maize
In species with giant and complex genomes, for example in maize65 and humans66, trait-associated SNPs (TASs) identified by Genome Wide Association Studies (GWAS) often fall into non-coding regulatory regions67. We anticipated a better annotation of maize TASs using our pan-tissue map of ACRs. Therefore, we firstly performed GWAS for 15 complex traits in a maize association mapping panel (n = 260) that genotyped across HapMap3 SNPs, resulting in 110 to 363 TASs (minor allele frequency > 0.05, P-value < 10− 5) for each trait (Supplementary Fig. 22 and Supplementary Table 8). We then integrated all TASs (n = 2,984) to improve their statistical power when assessed with ACRs. As expected, TASs were overrepresented in ACRs with their density being 1.64 (cob-weight) to 6.49-fold (leaf-length) higher over the remaining non-ACR regions (Fig. 5a). In addition, the density of TASs in m1 ACRs was overall higher than m2 ACRs (Fig. 5b), indicating a predominant role of m1 subgenome in the genetic regulation of maize complex traits. Interestingly, although all the traits we investigated here were believed to be classical quantitative traits, six out of them (for example, upper-leaf-angle and days-to-tassel) exhibited non-canonical results that the TASs density was higher in ACRs of m2 subgenome (Fig. 5b). The existence of hotspots with an excessive of TASs (Supplementary Fig. 23a and b), which usually entangled with strong LD and can be mapped within m1 or m2 subgenomes for different traits, might be one of the appropriate explanations to these observations.
Researches in humans have revealed that TASs of diseases or other complex traits are more likely to be enriched in interspecific constrained CREs21,22. However, due to low and limited number of interspecifically constrained ACRs in Poaceae (Fig. 2b), the power to investigate maize TASs in ACRs with a varying degree of interspecific constraint was statistically weak. However, we observed that TASs were overrepresented in ACRs with higher intraspecific constraint (Fig. 5c). Specifically, the density of TASs in ACRs that highly constrained in most maize genomes exhibited 1.4-fold increase compared with ACRs with weak constraint in maize. We further verified this result (Supplementary Fig. 24) by using another independent TASs dataset (n = 2,360) that also mapped on maize B73 V4 genome and tested in an ultra-large maize population with more than 1,000 lines68. These findings suggested that complex traits in maize are primarily driven by genetic variants in ACRs, especially for those that fixed in diverse maize population.
We finally tried to annotate some new GWAS loci that resided in ACRs. As an illustration, we identified a lead SNP (chr3: 215,791,183) that significantly associated with both days-to-silk and days-to-tassel, which fell into a non-coding region that was ~ 5.3 kb away from nearest gene. Notably, this lead SNP was colocalized with an intraspecifically highly constrained ACR that accessible in three reproductive tissues (ear, tassel, and SAM) (Fig. 5d), suggesting their potential role of regulating genes involved in flowering. Further incorporation of 3D chromatin interaction data, for example the HiChIP data (Fig. 5d), could further pinpoint to a list of candidate genes that were likely interacting with this ACR, although the exact function of these GWAS loci and interacting genes needed further experimental validations. To end up our work, we have integrated the pan-tissue ATAC-seq data with our recently established HEMU online database (https://shijunpenglab.com/HEMUdb/databrowse)69–71 to facilitate easy visualization and analysis of any loci that recorded by B73 V4 genome coordinates.