Sample collection and culture
Coral colonies of A. muricata, M. foliosa, and P. verrucosa were collected from the Xisha Islands (15°40′–17°10′ N, 111°–113° E) at water depths from 5 to 10 m. The colonies were transferred to laboratory aquariums for acclimation and culture. M. capricornis colonies were obtained from Haiyi Aquatic Breeding Co., Ltd. All coral branches sampled were from the same colony.
DNA isolation and sequencing libraries
Genomic DNA was extracted from coral branches using the DNeasy Blood & Tissue Kit (Qiagen, Cat. No. / ID: 69504) following the manufacturer's instructions. The DNA quantity and quality were assessed using Qubit, Nanodrop, and Femto Pulse machines. Four distinct sequencing technologies were employed to acquire the genome sequences. These included paired-end sequencing libraries with a 350 bp insert size, 10X Genomics linked-read libraries with a 600 bp insert size, Hi-C libraries with a 350 bp insert size, and single-molecule real-time sequencing (SMRT) bell libraries with a 30 kb insert size. The libraries were constructed following protocols specific to each sequencing strategy and subsequently sequenced on the Illumina HiSeq X Ten platform (for paired-end, 10X Genomics, and Hi-C libraries) and the PacBio Sequel platform (for PacBio libraries).
Genome sequencing and assembly workflow
The workflow for stony coral genome sequencing and assembly consisted of nine steps: 1) genome survey, 2) V1 Genome—genome assembly was performed using Canu software (v. 2.2) with sequencing data from the PacBio Sequel platform, 3) V2 Genome—sequence errors were corrected using NextPolish (v. 1.3.1) software, based on sequencing data from the Illumina HiSeq X Ten System and the PacBio Sequel platform, 4) V3 Genome—heterozygous contigs were removed using Purge Haplotigs software (v. 1.0.2), 5) V4 Genome—exogenous sequence contaminations of metagenomic data were removed using BlobTools software (v. 1.0), with reference to the NCBI Nucleotide (Nt) database, 6) V5 Genome—exogenous sequence contaminations of endosymbiotic Symbiodiniaceae and symbiotic algae were removed using BLAST software (v. 2.2.26), guided by relevant sections of BioProject PRJNA544778 (Supplementary Table 1), 7) V6 Genome—contigs of V5 genome were linked into scaffolds using fragScaff software (v. 140324.1), guided by 10X Genomics linked-read sequencing data, 8) V7 Genome—scaffolds of V6 genome were anchored into a pseudochromosome-level genome using ALLHiC software (v. 0.9.13), with guidance from Hi-C sequencing data, and 9) V8 Genome—gaps in the V7 genome were closed using TGS-GapCloser software (v. 1.2.1), utilizing sequencing data from the PacBio Sequel platform.
Genome survey
The genome sizes were assessed by analyzing the K-mer spectrum derived from the paired-end Illumina sequencing data. The frequency of each K-mer (k = 17) was determined using Jellyfish (v. 2.1.3). K-mer analysis (K = 17) was employed to estimate the genome size and heterozygote frequency of the four studied stony corals. GC-content separation analysis was utilized to identify any exogenous sequence contamination.
Symbiotic algae isolation
The Symbiodiniaceae and other symbiotic algae were isolated and enriched from stony coral tissues using stepwise filtration and density gradient centrifugation in sterile filtered seawater (FSW). After 24 hours of antibiotic treatment (penicillin, streptomycin sulfate), the cells were cultured in 96-well plates and 24-well plates using modified L1 medium. Single clones of algae were obtained through single cell isolation and culture. The species were identified using internal transcribed spacer 2 (ITS2) sequences. All cultured Symbiodiniaceae were mixed to meet the sample quantity requirements for next-generation sequencing, which was used for subsequent decontamination of the stony coral genomes.
Building contaminant sequence dataset
Since stony corals from specific ecological niches have unique types of endosymbiotic Symbiodiniaceae and symbiotic algae16, solely relying on the NCBI NT database for removing their sequence contaminations is not sufficient. To address this, we constructed a dataset by isolating the endosymbiotic Symbiodiniaceae and symbiotic algae from the four studied stony corals and sequencing their genomic DNA using the Illumina HiSeq X Ten System. This dataset comprised a subset that included a mix of Symbiodiniaceae from all four stony corals, nine subsets of symbiotic algae, and one subset of crustose coralline algae (Supplementary Table 1). Relevant removal procedures were subsequently carried out, guided by this dataset.
Illumina and PacBio sequencing
The raw sequencing data generated by the Illumina HiSeq X Ten platform, which included adapters, more than 10% of unknown nucleotides (N), and 50% of low-quality bases (Q-value <= 10), were trimmed to obtain high-quality reads. The subreads of the PacBio sequencing data were filtered using default parameters.
10X Genomics sequencing
The Chromium library was prepared following the protocols provided by 10X Genomics, using the Genome Reagent Kit v2 (10x Genomics, San Francisco, California, United States). Sample quantity and quality were assessed using Qubit, Nanodrop, and Femto Pulse machines. For each library, approximately 10 ng of high molecular weight (HMW) genomic DNA (with a mean fragment length greater than 65 kb) was utilized. In the microfluidic Genome Chip, a library of Genome Gel Beads was combined with HMW template gDNA, master mix, and partitioning oil to generate Gel Bead-In-Emulsions (GEMs) within the Chromium apparatus. Each Gel Bead was then functionalized with millions of copies of a 10x™ barcoded primer. The Genome Gel Bead dissolution inside the GEM released primers containing (i) an Illumina R1 sequence (Read 1 sequencing primer), (ii) a 16 bp 10x Barcode, and (iii) a 6 bp random primer sequence. The R1 sequence and the 10x™ barcode were added to the molecules during the GEM incubation, while P5 and P7 primers, R2 sequence, and Sample Index were added during library construction. The library was then subjected to 10 cycles of PCR amplification. In total, four 10X Genomics linked-read libraries were constructed and sequenced using an Illumina HiSeq X Ten platform.
Hi-C sequencing
To perform Hi-C sequencing on stony coral branch tissues, we adjusted the concentration of PBS buffers to 0.035 M in order to match the osmotic pressure of the seawater in which stony corals reside. One Dovetail Genomics Hi-C library was prepared using the whole body of an individual stony coral branch that had been treated with 75% alcohol. The procedure involved fixing the chromatin in the nucleus with a 2% formaldehyde solution in MS buffer (10 mM potassium phosphate, pH 7.0; 50 mM NaCl; 0.1M sucrose) at room temperature for 30 minutes under vacuum, followed by extraction. The fixed chromatin was then digested with 400 U of HindIII restriction enzyme (NEB) at 37 °C for 16 hours. DNA ends were labeled with biotin and incubated at 37 °C for 45 minutes, with the enzyme subsequently inactivated using a 20% SDS solution. DNA ligation was carried out by adding T4 DNA ligase (NEB) and incubating at 16 °C for 4-6 hours. After ligation, proteinase K was added to reverse cross-linking during an overnight incubation at 65 °C. The DNA fragments were purified and dissolved in 86 μL of water. Purified DNA was treated to remove biotin that was not internal to ligated fragments and then sheared to an average fragment size of approximately 350 bp. DNA fragments labeled with biotin were finally separated using Dynabeads® M-280 Streptavidin (Life Technologies). The Hi-C libraries were quality-controlled and sequenced on an Illumina HiSeq X Ten platform.
Total RNA isolation
The coral branch samples were ground manually into fine powder using a frozen mortar and pestle in liquid nitrogen. TRIzol® LS Reagent (Thermo Fisher Scientific, 10296028, Waltham, MA, USA) was added to the ground samples at a ratio of approximately 1:3 (sample to reagent), and the mixture was further homogenized. After centrifugation, the supernatant was collected and combined with BCP reagent (Molecular Research Center, BP 151, Cincinnati, OH, USA). The resulting solution was then centrifuged, and the supernatant was mixed with Isopropanol (Amresco, 0918-500ML, Radnor, PA, USA) and allowed to stand overnight at -20 °C. After another round of centrifugation, the supernatant was discarded, and the RNA pellet was washed with 75% Ice Ethyl alcohol, Pure (Sigma-Aldrich, E7023-500ML, Taufkirchen, München, Germany). The RNA samples were then ready for transcriptome sequencing.
RNA-seq
The PacBio Sequel platform was used for full-length transcriptome sequencing, while the Illumina HiSeq X Ten platform was employed for short-read transcriptome and small RNA (miRNA) sequencing.
BUSCO assessment
The BUSCO (v. 5) assessment in this study was conducted according to the methodology described in a previous study44.
Gene annotation
The basic genome annotation workflow in this study consists of three pipelines: repetitive sequence annotation, gene annotation (structure and function annotations), and non-coding RNA (ncRNA) annotation (Supplementary Tables 1, 2). To enhance annotation quality, both full-length and short-read transcriptomic data were employed (Supplementary Tables 3-5). After ensuring that the full-length transcriptome data had reached saturation (Supplementary Table 3), we used full-length transcriptomic data, short-read transcriptomic data, and our reference genomes to conduct a quantitative analysis on the gene structures of transcriptomic isoforms, alternative splicing (AS) genes, fusion genes, and LncRNAs (Supplementary Tables 3, 4). Additionally, employing short-read transcriptomic data and our reference genomes, we performed a quantitative analysis of miRNAs (Supplementary Table 5).
Chromosome macrosynteny analysis
Focusing on the characteristics of orthologous coding genes at chromosome loci and protein sequences, we employed the General Feature Format version 3 (GFF3) files (longest.filter.gff3.gz and pep.fa) (Supplementary Table 1) to conduct an integrated visual chromosome synteny analysis across genomes, utilizing Multiple Collinear Scanning toolkits (mcscanx)45 based on the protein sequences derived from the investigated genomic data6, 49, 50.
Homeobox gene analysis
Homeodomain domain sequences of each Hox protein subclass were obtained from the homeobox gene database (HomeoDB2) (Supplementary Table 6)46. The homeodomain domain sequences of each Hox protein subfamily were aligned using MAFFT software (v. 7.158) with default parameters, forming a multiple sequence alignment for each subfamily. Based on the multiple sequence alignment of each Hox protein subfamily, a probabilistic model (profile HMM) for each subfamily was constructed using the hmmbuild software (v. 3.2.1) with default parameters, to query sequence databases and find homologous sequences. Pfam domains of all protein sequences in each species were predicted using the PfamScan (v. 1.6) software with default parameters and the Pfam database (v. 32). Protein sequences with a predicted Homeodomain (PF00046.29) domain were processed in a manner similar to the HMM training set using the hmmsearch software (v. 3.2.1). The resulting hits for each Hox subfamily were ranked according to E-value with a cut-off of 1e-10, and the protein sequence ranked first was identified as the corresponding protein of the respective Hox subfamily (Supplementary Table 6).
Comparative genomics analysis
Genomes and annotation files of 20 cnidarians and one A. queenslandica were downloaded from the NCBI database, excluding the four genomes assembled in this study. For genes with multiple alternative isoforms, only the longest transcript was chosen to represent the gene in each species. The proteins of the 25 species were aligned using an all-against-all search algorithm with an E-value threshold of 1e-7. The orthomcl markov clustering program was then utilized with the parameter "-inflation 1.5" to cluster gene families. Few single-copy genes were identified, limiting the species tree reconstruction. Orthologous genes from the 25 species were assigned to address this. Gene families with at least one gene per species were collected based on the blastp results obtained from the gene family clustering analysis. For gene families with multiple copies, the member with the best blast match to the genes of A. queenslandica was selected as the orthologue for analysis, with the longest orthologue in A. queenslandica designated as the reference gene. In total, 1,554 orthologous genes were gathered for subsequent phylogenetic analysis. For specific parameters, see Supplementary Table 7.
To analyze the expansion (Set4) and contraction (Set5) of orthologous gene families (Supplementary Table 7), the CAFÉ tool (v. 4) was employed. This tool utilized a stochastic birth and death model with variable lambda values to study changes in gene families along each lineage in the phylogenetic tree. A probabilistic graphical model was introduced to calculate the probability of transitions in gene family size from parent to child nodes. The corresponding P-values were calculated for each lineage based on the conditional likelihood. A threshold of 0.05 was used to identify significantly expanded or contracted gene families.
To identify genes undergoing positive selection (Set6) along the Scleractinia stony coral lineage (foreground branch), a comparison was made with other species from the phylum Cnidaria (background branch). Multiple sequence alignment of 2,092 single-copy genes was conducted using MUSCLE (v. 3.8.31). The branch-site model in the codeml program, which is part of the PAML package (v. 4.7), was employed to detect signals of positive selection.
Pan-genomic analysis
A pan-genomic map was created for the selected stony corals, including core genes, dispensable genes, and private genes. Core genes are present in all individuals (Set1-Set3, Supplementary Table 7), dispensable genes are present in some individuals (at least two), and private genes are found in a single individual. The genes were clustered using OrthoMCL software (v. 1.4) to identify single-copy gene families, multi-copy gene families, and species-specific gene families.
Insertion times of TEs.
To understand the temporal dynamics of TE activities during the evolution of stony corals, divergences of TEs from the consensus sequences extracted from RepeatMasker (v. 4.1.5) results were adjusted for multiple substitutions using the Jukes-Cantor formula K = −300/ 4× Ln (1 − D × 4/300), where D represents the distance between the fragmented repeat and the consensus sequence. Insertion times of TEs were estimated using the equation T = K/2r, where T is the insertion time, and r is the nucleotide substitution rate for cnidaria, 4.0e-9 referred to in previous study4.
PMSC analysis
The PMSC analysis in this study was conducted according to the methodology described in a previous study4.
scRNA-seq
To prepare the single-cell suspension, a 0.035 M PBS with a pH of 8.0 was created. The samples were washed three times with this buffered solution. A digestion solution was prepared by diluting 1‰ Tyrisin (Sigma-Aldrich, T1426-500MG, St. Louis, MO, USA) in the 0.035 M PBS. The branches of A. muricata, P. verrucosa, M. capricornis, and M. foliosa underwent digestion for 15 minutes, 45 minutes, 2 hour, and 3 hours, respectively. Next, the solution was filtered using 40 μm FisherbrandTM Sterile Cell Strainers (Thermo Fisher Scientific, 22-363-547, Waltham, MA, USA) and centrifuged at 1,000 rpm for 5-6 minutes. The resulting supernatant was collected, and the digestion was stopped by adding BSA (Sangon Biotech, A600332-0005, Shanghai, China) to a final concentration of 0.4 mg/ml. Living cell counting and assessment were performed using an automated cell counter (TC20TM, Bio-Rad Laboratories, Inc., Hercules, CA, USA). After washing and resuspension, the qualified cells were prepared into a single-cell suspension at an appropriate concentration for the library construction of the BD Rhapsody™ Single-Cell Analysis System. Sequencing was then performed on the Illumina NovaSeq 6000 System.
Single-cell transcriptome analysis
The Umi-tools pipeline (https://umi-tools.readthedocs.io/en/latest/index.html) was used to analyze the raw data with default parameters. The filtered_gene_bc_matrices output from umi-tools was loaded into the Seurat package (v. 3.0). To remove dead and low-quality cells, genes expressed in more than three single cells were kept, and each cell was required to have at least 200 expressed genes. Additionally, DoubletFinder was used to identify and remove doublets. After filtering, the remaining cells were used for downstream analysis (Supplementary Table 9).
The scaled data were first normalized using 'LogNormalize'. The 'FindVariableFeatures' function with mean.cutoff (0.0125-5) and dispersion.cutoff (1, Inf) parameters was used to identify highly variable genes for clustering analysis. For principal component analysis (PCA), the scaled data were reduced to 50 approximate principal components (PCs) based on the highly variable genes. The number of PCs used for t-SNE analysis was determined using the Elbowplot function. Clusters were identified using the 'FindClusters' function with selected PCs and a resolution of 0.6. Cluster-enriched genes were identified using the 'FindAllMarkers' function with a 'wilcox' test, using parameters of 'min.pct = 0.25' and 'logfc.threshold = 0.25' to define enriched genes. The single-cell data structures and trajectories were visualized and explored using t-SNE (using the 'RunTSNE' function) and UMAP (using the 'RunUMAP' function with parameters 'metric = correlation', 'min.dist = 0.3', and 'umap.method = umap-learn').
The Seurat (v. 3.0) was utilized for data normalization, dimensionality reduction, clustering, and differential expression analysis. Specifically, the Seurat alignment method, canonical correlation analysis (CCA), was employed for integrated analysis of datasets. To perform clustering, highly variable genes were selected and principal components based on these genes were used to construct a graph, which was then segmented with a resolution of 0.6.
To determine the cell clusters of A. muricata, the marker genes of each cell type5,30 and cluster-enriched genes were utilized (Supplementary Table 8). The Wilcox algorithm was employed to calculate statistical significance. Cluster-enriched genes meeting the following criteria were considered signature genes: (1) adjusted p-value < 0.05 after Bonferroni correction using all features in the dataset, (2) log fold-change of the average expression > 2.0, and (3) pct.1 > 0.5 (pct.1 refers to the percentage of cells where the feature is detected in the first group) (Supplementary Table 9).
For the cell cluster determination of M. foliosa, M. capricornis, and P. verrucosa, supervised clustering was performed on the single-cell sequencing data based on marker genes (Supplementary Table 8)5,30. The R package scSorter v0.0.2 (category) was used for initial clustering47, followed by further clustering correction and dimensionality reduction using the R package ssingleCellNet v0.1.0 (group) to generate visualization (Supplementary Table 10)48.
SIDT1 analysis
The structures of the amu-miR-100 precursor and SIDT1 protein tetramer were constructed using AlphaFold210. For details on the gene analysis, preparation of simulation systems, molecular dynamics simulations, pulling molecular dynamics simulations, potential of mean force simulations, and calculations of relative permeability coefficients, please refer to Supplementary Table 8.
References:
44. Waterhouse, R. M. et al. BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol. Biol. Evol. 35, 543–548 (2018).
45. Wang, Y. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40, e49 (2012).
46. Zhong, Y.-F. & Holland, P. HomeoDB2: functional expansion of a comparative homeobox gene database for evolutionary developmental biology. Evol. Dev. 13, 567–568 (2011).
47. Guo, H. & Li, J. scSorter: assigning cells to known cell types according to marker genes. Genome Biol. 22, 69 (2021).
48. Tan, Y. & Cahan, P. SingleCellNet: A computational tool to classify single cell RNA-seq data across platforms and across species. Cell Syst. 9, 207-213.e2. (2019).
49. Chapman, J. A. et al. The dynamic genome of Hydra. Nature 464, 592–596 (2010).
50. Li, Y. et al. Chromosome-level reference genome of the jellyfish Rhopilema esculentum. Gigascience 9, giaa036 (2020).