Plant materials and genome sequencing
The N. benthamiana LAB strainwas routinely grown in a custom soil mix (potting mix, clay and vermiculite) with a day/night cycle of 16 h/8 h at a constant temperature of 25°C in greenhouse of Peking University Institute of Advanced Agricultural Sciences, Weifang, China. After two weeks, the fresh young leaves were collected and used for genome sequencing and optical mapping. Tissues of roots, leaves, stems, flowers and seeds at two days post anthesis were used to perform Illumina RNA-sequencing on Novaseq 6000 platform and full-length transcriptome sequencing on PacBio Sequal IIe instrument. The high-molecular-weight (HMW) genomic DNA were extracted and used for the construction of sequencing library44. To achieve telomere-to-telomere assembly of N. benthamiana, we performed 110× PacBio HiFi sequencing on Sequel II platform, 40× ONT ultralong sequencing on GridION X5/PromethION sequencer, and 100× Illumina paired-end sequencing on Novoseq 6000 platform. Then we constructed Hi-C sequencing libraries using HindIII restriction enzyme, which were subsequently sequenced on Illumina NovaSeq 6000 to obtain 150× paired-end short reads. The Hi-C data were classified as valid or invalid interactions using HiC-Pro (v3.1.049)45, and only valid interaction pairs were retained for subsequent analysis. Details of genome and transcriptome sequencing are described in the Supplementary Methods.
Bionano optical map generation
The fresh young leaves were collected from N. benthamiana and processed in Grand Omics (Wuhan, China) for construction of optical maps. The high-molecular-weight DNA was extracted using the Prep Plant DNA Isolation Kit (Bionano Genomics), and labelled using the DLE-1 enzyme following the DLS protocol (Bionano Genomics). The labeled DNA was then loaded onto the Saphyr system (Bionano Genomics) for the imaging analysis. A total of 329.6× coverage of effective molecules was generated with an average label density of 14.96/100 kb. The DNA molecules were subsequently assembled using Bionano Solve (v3.7) and the Bionano images of the assembly were visualized using Bionano Access (v1.7.1). Details of Bionano data processing are described in the Supplementary Methods.
Genome assembly
To obtain a high contiguity assembly, Hifiasm (v0.19.5)16 was employed combining the HiFi and ultralong ONT reads with the parameters of “-l 0 --n-hap 2 --hom-cov=111”. The well-assembled HiFi & ONT assembly was in a total length of 2.85 Gb with contig N50 of 144.7 Mb, leaving only three gaps. Two gaps located in 45S rDNA repeat arrays of Chr06 and Chr19 were resolved by following NOR assembly (Supplementary Methods). The last gap, corresponding to NUPT insertion in Chr12, was closed by ultralong ONT reads and confirmed by coverage depth in the IGv46. After gap filling, we generated an assembly comprising 19 contigs with 38 telomeres. Then Hi-C reads were used to anchor all contigs using the pipeline of BWA (v0.7.17)47, Juicer (v1.5)48 and 3D-DNA (v180419)49. We manually checked the assembly through chromatin interaction patterns in Juicebox (v1.11.08)50. An mis-assembly in Chr03 was corrected and the adjustment was then confirmed in the IGv46. Finally, we obtained a T2T genome assembly of N. benthamiana with total length of 2.85 Gb and contig N50 of 146.4 Mb.
Validation of the T2T genome assembly
To assess the quality of the genome assembly, we performed extensive validation in multiple approaches. Firstly, Bionano optical map were aligned to the in silico genome assembly using RefAlinger (v12432.12542rel) tool in the pipeline of Bionano Solve (v3.7). Then the Hi-C interaction matrix of the final assembly was manually checked in Juicebox50 (Supplementary Methods). Secondly, we aligned the NB.T2T assembly with recently published assemblies of N. benthamiana using Syri (v1.6.3)51 and found good collinearities. Thirdly, we mapped HiFi, ONT and NGS reads against the NB.T2T assembly and checked for read coverage for any abnormal signals. Finally, to assess genome completeness, we applied BUSCO (v5.4.3)52 for ortholog detection using solanales_odb10 database (n=5,950). Quality value (QV) was estimated using Merqury (v1.3)22 from HiFi reads. LTR Assembly Index (LAI) was calculated using LTR_retriever (v2.9.0)53 to evaluate the assembly continuity of LTR retrotransposons23.
Repeat sequence annotation
The de novo repeat library for the N. benthamiana was constructed by RepeatModeler (https://github.com/Dfam-consortium/RepeatModeler). Then combining the universal Repbase database (version 20181026), the repetitive elements were annotated using RepeatMasker (v4.1.2)54. The intact LTR elements were identified using the pipeline of LTR_Finder (v1.2)55, LTRharvest (v1.6.2)56, and LTR_retriever (v2.9.0)53. Then LTR sequences from RepeatMasker and intact LTR elements from LTR_retriever were transferred to TEsorter (v1.3)57 to classify the subfamily of LTR-RTs. Among which, the Ty3-Gypsy elements was classified into seven clades as Athila, CRM, Galadriel, Ogre, Reina, Retand, and Tekay. The insertion time of intact LTR retrotransposons were calculated using LTR_retriever53. The TRASH3 pipeline was used to identify satellite repeats. Two major microsatellites were identified in centromeres, including CEN33 (ACGAGTCAGG ACGTGGCAGG ACATGGCCATGGC) and CEN43 (ACGTGTCAGG ACGCGTCAGG ACGCGTCAGG ACATGGCCATGGC).
Gene model annotation
To predict coding-gene models, we applied MAKER (v2.31.11)58 pipeline combining evidence from homology protein, transcript evidence, and ab initio prediction. The proteins used for homology-based prediction were from N. attenuata59, N. benthamiana18,19, S. lycopersicum, S. tuberosum11, and universal Swiss-Prot proteins. The redundant sequences were excluded using cd-hit (v 4.8.1)60. RNA-seq reads were mapped to the assembled genome using HISAT2 (v2.2.1)61, followed by genome-guided transcript assembly by StringTie (v1.13)62. The full-length transcriptome reads were processed using the SMRT Analysis software Isoseq3 (https://github.com/PacificBiosciences/IsoSeq). The SNAP63, GeneMark-ET64 and AUGUSTUS65 models were trained using MAKER258 and BRAKER2 (v2.1.6)66 pipeline as previously reported. Finally, the trained models, protein and transcript evidences were integrated into MAKER2 to predict credible gene structures. Gene models were then manually corrected in IGV-GSAman (https://gitee.com/CJchen/IGV-sRNA) with the support of transcript coverage and previous annotations18,19.
Gene annotation assessment
To provide a valuable resource for plant research community, we compared our genome annotation (NB.T2T) with two recently published versions, NB.PCP18 and NB.MP20. First, OMArk (v2.0.3)67 was performed to assess not only the completeness but also the consistency of the gene repertoire in three versions. The recommended LUCA.h5 database was used to exploit orthology relationships. Second, OrthoFinder (v2.5.4)68 was conducted to identify orthogroups using non-redundant protein sequences from three versions.
Organelle genomes assembly
The mitochondria and chloroplast genomes were assembled with ONT data using NextDenovo (v2.5.0)69 incorporated in the PMAT pipeline70. Bandage software (v0.8.1)71 was employed to visualize the organelle genomes, and remove any linear organelle fragments and nuclear sequences. Then ONT ultra-long reads containing mitochondrial genes were extracted and subsequently aligned to the mitochondrial genome using blastn within Bandage to resolve repeat regions within the graphical mitochondrial genome. Finally, we obtained the circular mitochondria and chloroplast genomes, annotated using GeSeq (v2.0.3)72 online website, and deposited at the China National Center for Bioinformation under accession number C_AA066595 and C_AA066594, respectively. To identify organelle genome insertions within nucleus genomes, blastn searches were performed and the positions were visualized by R package RIdeogram73.
CENH3 ChIP-seq
Previously, based on the amino acid sequence of NbCENH3, two peptides corresponding to the N-terminus (H2N-ARTKHLALRKQSRPPSRPTA-COOH) or full sequence of NbCENH3 was synthesized to produce anti-CENH3 antibodies in rabbit. The ChIP cloning experiment was then conducted on chromatin extract from young leaves using anti-CENH3 antibodies according to a previously reported protocol74. The ChIP library was finally amplified with the VAHTS® Universal DNA Library Prep Kit for Illumina V3 (Vazyme ND607), and sequenced on the Illumina Novaseq 6000 platform to produce 150-bp paired-end reads. Since the two antibodies exhibited consistent results, so only the antibody against the full sequence of NbCENH3 was used in this study with two biological replicates.
Centromere identification in public genomes
To extrapolate a phylogenetic story about plant centromeres, we downloaded eight published T2T assemblies and four draft assemblies. The centromeres of Arabidopsis6, rice5, wheat7, einkorn8, oat9, maize10, potato11, pepper14, and soybean24 were identified by CENH3 ChIP-seq, while the grape75 and carrot76 centromeres were predicted based on distribution of satellite repeats. Additionally, the centromeres of coffee25 and tomato12 were found by searching centromere-specific retrotransposons as CRC (Centromeric retrotransposons in Coffea) and TGR4 (Ty3-Gypsy type retrotransposons) in whole genomes, respectively. As for tobacco, two centromeric retrotransposons colocalized with NtCENH3 were helped to predicate putative centromere regions13. The predicted centromeres of coffee, tomato and tobacco were further confirmed by Hi-C interaction matrixes, as the centromeres always showed obvious inter-chromosome interactions14.
ChIP-seq data processing
Histone modification ChIP-seq data was downloaded from NCBI database under BioProject PRJNA881799. The CENH3 ChIP-seq were conducted as above mentioned. Paired-end reads were preprocessed with fastp (v0.23.2)77 to remove low-quality bases, and then aligned to the NB.T2T assembly using bowtie2 (v2.5.1)78 with default settings. Alignments with mapping quality of < 30 were discarded, and all read duplicates were removed using Samtools(v1.10)79. For each dataset, the bamCompare tool from Deeptools (v3.5.1)80 was used to quantify histone methylation and CENH3 enrichement in bigwig format. The coverage was calculated as the log2 of the ratio (ChIP/control) per 50-bp bin. For profiling of CENH3 occupancy, alignment bedGraphs were also used to calculate log2(ChIP/control) in 10-kb bins. CENH3 domains were identified by comparing the ChIP and input data using MACS2 (v2.2.7.1)81 with following parameters: “--nomodel --extsize 291 --keep-dup auto -broad”.
ChIP-seq Metaplots
The above bamCompare generated bigwig files were used to calculate the enrichment level across centromeric elements as CEN33/43 satellites, Gypsy retrotransposons, Copia retrotransposons, CENH3 domains, and CRM/Galadriel/Ogre/Tekay elements using computeMatrix tool from Deeptools80 in ‘scale-regions’ mode with parameters of “-regionBodyLength 2000 -beforeRegionStartLength 2000 -afterRegionStartLength 2000”. Then metaplots for CENH3, H3K4me3 and H3K9me2 were plotted with plotHeatmap available from Deeptools. The visualization of centromeric and telomeric satellite repeats was accomplished using StainedGlass82.
DNA methylation processing
We used Nanopolish (v0.13.2)83 to identify 5-methylcytosine bases in a CpG context in ONT reads. The raw reads were firstly indexed to the fast5 files, and then aligned to the genome using Minimap284 (-ax map-ont). Then Nanopolish was used to call methylation, and the script of calculate_methylation_frequency.py was used to count the methylation frequency at each CpG sites. We further summarized the methylation levels at 10-kb no-overlapping windows.
Subgenome phasing
We used SubPhaser (v1.2.5)32 to phase and partition the subgenomes of allotetraploid N. benthamiana based on differential k-mers among homoeologous chromosome sets. Briefly, the syntenic analysis was firstly performed by JCVI (v1.1.19)85 and MCScanX86, and genes in syntenic blocks were identified as homeologues. Then SubPhaser searches for the subgenome specific k-mers, and assigns homoeologous chromosomes into two subgenomes. To infer the potential diploid ancestors of N. benthamiana, we applied mapping-based and phylogenetic-based methods. The Illumina short reads of N. benthamiana were mapped to composite reference of five diploid Nicotiana genomes (N. attenuata, N. glauca, N. longiflora, N. sylvestris, N. tomentosiformis) using sppIDer87. The mapping rates against each diploid genome indicated the genetic contributions of diploid species to the allotetraploid. Meanwhile, we also mapped the Illumina short reads of N. sylvestris (NCBI BioProject: PRJEB1826) and N. attenuata (NCBI BioProject: PRJNA316810) to NB.T2T assembly. The coverage depth against two subgenomes revealed their evolutionary relationship in a certain degree.
Phylogenetic analysis
OrthoFinder (v2.5.4)68 was used to identify orthogroups using non-redundant protein sequences from 21 species (Supplementary Table 12). Each subgenome of N. benthamiana and N. tabacum was considered as an independent pseudospecies. In total, 361 single-copy orthogroups were identified to construct the maximum likelihood tree using RAxML (v8.2.12)88. The CodeML and MCMCTree programs in the PAML (v4.9)89 were used to analyze amino acid substitution models and estimate divergence times. IQ-TREE (v2.0.3)90 was applied to construct the phylogenetic tree of Gypsy retrotransposons based on RT (reverse transcriptase) domains. The maximum likelihood tree of CENH3 proteins was also conducted using IQ-TREE90.
Additional references
44. Zhang, M. et al. Preparation of megabase-sized DNA from a variety of organisms using the nuclei method for advanced genomics research. Nat. Protoc. 7, 467-478 (2012).
45. Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol. 16, 259 (2015).
46. Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24-26 (2011).
47. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754-1760 (2009).
48. Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 3, 95-98 (2016).
49. Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92-95 (2017).
50. Durand, N.C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 3, 99-101 (2016).
51. Goel, M., Sun, H., Jiao, W. B. & Schneeberger, K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol. 20, 277 (2019).
52. Manni, M., Berkeley, M. R., Seppey, M., Simao, F. A. & Zdobnov, E. M. BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol. Biol. Evol. 38, 4647-4654 (2021).
53. Ou, S. & Jiang, N. LTR_retriever: A highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiol. 176, 1410-1422 (2018).
54. Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics 5, 4-10 (2009).
55. Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, W265-W268 (2007).
56. Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9, 18 (2008).
57. Zhang, R. G. et al. TEsorter: an accurate and fast method to classify LTR-retrotransposons in plant genomes. Hortic. Res. 9, uhac017 (2022).
58. Cantarel, B. L. et al. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 18, 188-196 (2008).
59. Ray, R. et al. A persistent major mutation in canonical jasmonate signaling is embedded in an herbivory-elicited gene network. Proc. Natl. Acad. Sci. USA 120, e2308500120 (2023).
60. Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658-1659 (2006).
61. Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357-360 (2015).
62. Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290-295 (2015).
63. Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004).
64. Lomsadze, A., Ter-Hovhannisyan, V., Chernoff, Y. O. & Borodovsky, M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res. 33, 6494-6506 (2005).
65. Stanke, M., Tzvetkova, A. & Morgenstern, B. AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome. Genome Biol. 7, S11 (2006).
66. Brůna, T., Hoff, K. J., Lomsadze, A., Stanke, M. & Borodovsky, M. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genom. Bioinform. 3, lqaa108 (2021).
67. Nevers, Y. et al. Quality assessment of gene repertoire annotations with OMArk. Nat. Biotechnol. (2024).
68. Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 20, 238 (2019).
69. Hu, J. et al. NextDenovo: an efficient error correction and accurate assembly tool for noisy long reads. Genome Biol. 25, 107 (2024).
70. Bi, C. et al. PMAT: an efficient plant mitogenome assembly toolkit using low-coverage HiFi sequencing data. Hortic Res. 11, uhae023 (2024).
71. Wick, R. R., Schultz, M. B., Zobel, J. & Holt, K. E. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 31, 3350-3352 (2015).
72. Tillich, M. et al. GeSeq - versatile and accurate annotation of organelle genomes. Nucleic Acids Res. 45, W6-W11 (2017).
73. Hao, Z. et al. RIdeogram: drawing SVG graphics to visualize and map genome-wide data on the idiograms. PeerJ Comput. Sci. 6, e251 (2020).
74. Nagaki, K., Kashihara, K. & Murata, M. A centromeric DNA sequence colocalized with a centromere-specific histone H3 in tobacco. Chromosoma 118, 249-257 (2009).
75. Shi, X. et al. The complete reference genome for grapevine (Vitis vinifera L.) genetics and breeding. Hortic. Res. 10, uhad061 (2023).
76. Wang, Y.H. et al. Telomere-to-telomere carrot (Daucus carota) genome assembly reveals carotenoid characteristics. Hortic. Res. 10, uhad103 (2023).
77. Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884-i890 (2018).
78. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie2. Nat. Methods 9, 357-359 (2012).
79. Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078-2079 (2009).
80. Ramirez, F. et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 44, W160-W165 (2016).
81. Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9, R137 (2008).
82. Vollger, M.R., Kerpedjiev, P., Phillippy, A.M. & Eichler, E.E. StainedGlass: interactive visualization of massive tandem repeat structures with identity heatmaps. Bioinformatics 38, 2049-2051 (2022).
83. Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Methods 14, 407-410 (2017).
84. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094-3100 (2018).
85. Tang, H. et al. Synteny and collinearity in plant genomes. Science 320, 486-488 (2008).
86. Wang, Y. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40, e49 (2012).
87. Langdon, Q. K., Peris, D., Kyle, B. & Hittinger, C. T. sppIDer: A species identification tool to investigate hybrid genomes with high-throughput sequencing. Mol. Biol. Evol. 35, 2835-2849 (2018).
88. Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312-1313 (2014).
89. Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586-1591 (2007).
90. Nguyen, L. T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268-274 (2015).