Plant material and DNA extraction
A total of 12 individual samples representing 11 Distylium species were sampled from the Plant DNA Bank of China at the Institute of Botany, Chinese Academy of Sciences. All samples were identified based on morphological characters. The details of the plant samples are presented in Table S1. Total genomic DNA was extracted from the leaf tissues of herbarium specimens of this genus following the modified CTAB DNA extraction protocol [22].
Sequence, chloroplast genome assembly, and annotation
The total DNA was constructed using 350-bp insert libraries according to the manufacturer’s instructions, which was then used for sequencing. Paired-end sequencing was performed on an Illumina HiSeq X-ten at Novogene (Tianjin, China), yielding approximately 4 Gb of high-quality 150-bp paired-end reads per sample.
The raw reads obtained from Novogene were filtered using Trimmomatic 0.39 [23] with the following parameters: LEADING = 20, TRAILING = 20, SLIDING WINDOW = 4:15, MIN LEN = 36, and AVG QUAL = 20. High-quality reads were assembled de novo using the SPAdes 3.6.1 program [24]. The chloroplast genome sequence contigs were selected from the initial assembled reads in SPAdes by performing a BLAST search using several related Hamamelidaceae chloroplast genome sequences as references. The chloroplast genome sequence contigs were further assembled using Sequencher 5.4.5. All plastid assemblies were annotated in Plann [25] using D. macrophyllum (GenBank Accession number: MN729500) as the reference, and missing or incorrect genes were checked in Sequin. A circular diagram for the chloroplast genome was generated using OGDRAW [26]. All chloroplast genomes assembled in this study have been deposited in GenBank under Accession numbers MW248109 - MW248120.
Microstructural mutation events
The Perl script microsatellite identification tool (MISA, http://pgrc.ipk-gatersleben.de/misa/misa.html) was used to identify the microsatellite regions of the chloroplast genome with the parameters set to 10 (repeat units ≥ 10) for mononucleotide simple sequence repeats (SSRs), 6 (repeat units ≥ 6) for dinucleotides, 5 (repeat units ≥ 5) for trinucleotides, 4 (repeat units ≥ 4) for tetranucleotides, and 3 (repeat units ≥ 3) for pentanucleotides and hexanucleotides.
The chloroplast genomes sequences were aligned using MAFFT [27] manually examined, and adjusted. Based on the aligned sequence matrix, the indels were manually checked and divided into categories of repeat indels and normal indels, according to Dong et al. [15]. D. dunnianum was used as the reference to determine the size and position of the indel events.
Sequence divergence analysis
Single nucleotide substitutions and the genetic p-distances were calculated using MEGA 7.0 [28] based on the aligned chloroplast genome sequences. To assess sequence divergence and to explore highly variable chloroplast markers, nucleotide diversity (π) was calculated by sliding window analysis using DnaSP v6 [29] with a widow size of 600 bp and a step size of 100 bp.
Nucleotide diversity and the number of haplotypes were used to assess marker variable for all barcodes (hype-variable markers and the universal plant DNA barcodes, such as rbcL, matK, and trnH-psbA). The tree-based method was utilized to calculate discrimination power. A neighbor-joining (NJ) tree was prepared in PAUP using the K2p distance.
Phylogenetic analyses
To elucidate the phylogenetic positions of Distylium within Hamamelidaceae and the interspecific phylogenetic relationships within Distylium, multiple alignments were performed using the whole chloroplast genome of 24 Hamamelidaceae samples representing 11 genera, including Cercidiphyllum japonicum, Daphniphyllum oldhamii, and Liquidambar formosana as outgroups. The Hamamelidaceae chloroplast genomes were aligned using MAFFT, and ambiguous alignment regions were trimmed with Gblocks 0.91b [30]. The maximum-likelihood (ML) analysis was run with RAxML-NG [31] with the best-fit model from ModelFinder [32]. Branch support was assessed by fast bootstrap methodology using non-parametric bootstrapping and 500 ML pseudo-replicates.
Mrbayes v3.2 [33] was used to infer the Bayesian inference (BI) tree. The BI analysis was run for 20 million generations, in which a tree was sampled every 1,000 generations. Two independent Markov Chain Monte Carlo (MCMC) analyses were performed and each chain started with a random tree. The first 25% of the sampled trees was discarded as burn-in, while the remaining trees were constructed in a majority-rule consensus tree to estimate posterior probabilities.
Molecular clock dating
We used BEAST v2.5.1 [34] to estimate the divergence times of Hamamelidaceae using three priors based on the complete plastome sequences. Based on the average value obtained by Xiang et al. [9] in a calibrated analysis, three priors were used: (i) the average age of the most recent common ancestor (TMRCA) of Hamamelidaceae (the root of the tree) was 108 Ma; (ii) the crown age of Hamamelideae/Fothergilleae was 89 Ma; and (iii) the crown age of Mytilarioideae was 58.3 Ma. Each secondary prior was placed under a normal distribution with a standard deviation of 1.
The GTR nucleotide substitution model and the prior tree Yule model were selected with the uncorrelated lognormal distribution relaxed molecular clock model. The MCMC run had a chain length of 400,000,000 generations with sampling every 10,000 generations. The stationary phase was examined through Tracer 1.6 [35] to evaluate convergence and to ensure sufficient and effective sample size for all parameters surpassing 200. A burn-in of 10% generations was discarded, and TreeAnnotator v2.4.7 was used to produce a maximum clade credibility tree.