Plant collection and propagation
Plants used in this study are part of a working collection of wild and medicinal Solanaceous species from the GEME lab at the Universidad Nacional de Colombia, in Bogotá Colombia. Briefly, seeds and cuttings were collected during multiple field trips across Colombia (Table S1). All samples were collected legally under the permit “Permiso Marco de Recolección de Especímenes de Especies Silvestres de la Diversidad Biológica con Fines de Investigación Científica No Comercial” assigned to the Universidad Nacional de Colombia by the Ministerio de Medio Ambiente (Resolución 0255 del 12 de marzo de 2014). Propagules were brought to the greenhouse, disinfected, and grown under common conditions until the development of multiple sets of fully developed leaves.
The objective of this study was to sample species representing the major clades in the family. We thus selected 129 species from 12 different tribes and Ipomoea purpurea from the Convolvulaceae family as an outgroup. (Fig. 1, Table S1). To be able to correlate the metabolome and transcriptome of different samples we used the same samples to extract RNA and metabolites. We collected three mature leaves from all selected plants. The material was snap-freezed in liquid nitrogen and stored at -80°C until extraction.
Metabolite extraction and characterization
We conducted extractions from leaves of 1–3 individuals per species for a subset of 50 different taxa (Table S1). Briefly, 10 mg of lyophilized leaf material was macerated in liquid nitrogen, suspended in 100% methanol, incubated at 70°C for 15 minutes, and centrifuged for 10 minutes at 12,000 rpm. We then added 1:3 chloroform-water, vortexed, and centrifuged again for 10 minutes. The supernatant was transferred to a new tube and dried in a speed-vac.
LC-MS analyses were conducted in Dr. Alisdair Fernie’s lab at the Max Planck Institute for Molecular Plant Physiology. Vacuum-dried polar phases were resuspended in 300 µL of 50% water methanol. Samples were sonicated for 10 min in an ice-cooled sonicator bath and then centrifuged for 15 min at full speed (> 12,000 rpm). The supernatant was transferred to LC tubes for analysis. The samples were run on a UPLC-LC-MS machine. In brief, the UPLC system was equipped with an HSS T3 C18 reverse-phase column (100 × 2.1 mm internal diameter, 1.8 µm particle size; Waters) that was operated at a temperature of 40°C. The mobile phases consisted of 0.1% formic acid in water (solvent A) and 0.1% formic acid in acetonitrile (solvent B). The flow rate of the mobile phase was 400 µL/min, and 2 µL of the sample was loaded per injection. The UPLC instrument was connected to an Exactive Orbitrap focus (Thermo Fisher Scientific) via a heated electrospray source (Thermo Fisher Scientific). The spectra were recorded using full-scan positive and negative ion-detection mode with all ion fragmentations (MS2), covering a mass range from m/z 100 to 1,500. The resolution was set to 70,000, and the maximum scan time was set to 250 ms. The sheath gas was set to a value of 60 while the auxiliary gas was set to 35. The transfer capillary temperature was set to 150°C while the heater temperature was adjusted to 300°C. The spray voltage was fixed at 3 kV, with a capillary voltage and a skimmer voltage of 25 V and 15 V, respectively. MS spectra were recorded from minutes 0 to 19 of the UPLC gradient. Processing of chromatograms, peak detection, and integration was performed using Refiner MS (version 5.3; GeneData) to generate a table of intensity variation across metabolomic features and samples (Table S3).
Identification and analysis of TAs and SGAs
To identify TAs and SGAs using untargeted LC-MS we assembled a database of ratios of mass-charge (m/z), fragmentation patterns, and retention times (RTs) of known TAs and SGAs. We created a tool in R programming language using the Shiny library that allows us to search for metabolites using this database and metabolomic data in Refiner MS format (available at https://github.com/EstevanGN/Metabolomic-search). Interesting clusters were then confirmed by manual inspection of the chromatograms using Xcalibur Data Acquisition and Interpretation Software (Thermo Fisher Scientific). We thus identified 24 metabolic features corresponding to TAs and 36 corresponding to SGAs (Table S2).
The MetaboAnalyst platform (http://www.metaboanalyst.ca) was used to normalyze metabolomic data using raw peak intensities and the log-transformation. For downstream analyses (WGCNA and PCCA) we averaged the transformed peak intensities across replicates from the same species (Table S2). To reduce dimensionality in metabolic data we conducted a PCA using JMP with the variation in TAs and SGAs separately and extracted the species loadings in the first component.
RNA extraction and analysis
For RNA extractions we pooled replicates to obtain a single pool per species. RNA was extracted with 5:1 trizol-chloroform and precipitated in isopropanol and 75% ethanol. Samples were sent to NovoGene for RNAseq genotyping using paired-end 150 bp libraries and an Illumina NovaSeq™ 6000 S4 Sequencing System. We obtained 6 GB of clean sequence data per sample, which provides sufficient sequencing depth for the assembly of de-novo transcriptomes, and the analysis of gene expression. Raw sequences were uploaded in the SRA archive (bioproject PRJNA1070281).
Transcriptome assembly
We assembled transcriptomes for all 129 species using a de-novo pipeline. Firstly, we removed adapters and low-quality sequences using Trimmomatic v0.39 (Bolger et al. 2014) with the following parameters: ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:140. We used paired clean reads to create de novo transcriptome assemblies for each species using Trinity v2.12.2 (Haas et al. 2013) with default program parameters. We assembled leaf transcriptomes for all species, and, for a subset of species, we combined root and leaf reads to create combined leaf-root transcriptomes (Table S1).
To reduce sequence redundancy and complexity and to remove non-coding transcripts we first obtained the longest representative sequence for each trinity “gene” using Trinity's get_longest_isoform script (Haas et al. 2013). Then TransDecoder v5.5.0 (https://github.com/TransDecoder/TransDecoder) was used to identify open reading frames and predict regions' CDS sequences. We finally used CDHIT (Li & Godzik, 2006; Fu et al. 2012) to remove short redundant sequences with the parameters -c 0.99 -n 10 -r 0 -T 6. The resulting fasta files of non-redundant CDS sequences were used to detect orthologs and quantify gene expression.
Orthology analysis
The non-redundant CDS sequences obtained from the previous step were used as input for the OrthoFinder tool (Emms & Kelly, 2019). We employed the multiple sequence alignment option, using DIAMOND (Buchfink et al. 2015) for initial sequence comparison, MUSCLE (Edgar, 2004) for the alignments and Fastree (Price et al. 2010) for tree rendering. We used the Orthofinder-generated species tree for the last phase of OG identification.
We used two datasets as input for Orthofinder. (1) Leaf transcriptomes from Solanaceae species, including 129 de novo transcriptomes assembled by us as well as the CDSs extracted from the genomes of 4 species with sequenced and annotated genomes (Atropa belladonna, Datura stramonium, Solanum lycopersicum, and Capsicum annuum. See table S1). This dataset was used for analyses that focus on gene expression variation across species. (2) CDSs extracted from the assembled genomes of 31 Solanaceae species (Table S2) downloaded from the National Genomics Data Center (https://ngdc.cncb.ac.cn/) and Solgenomics (https://solgenomics.net/). This dataset was used to analyze the genomic presence and distribution of SGA and TA OGs across species.
Phylogenetic analyses
We created a phylogeny of Solanaceae species using transcriptomic data. We selected 243 OGs present in all species (Table S10). MAFFT alignments from each OG were used to create individual phylogenetic trees using Raxml (Revell & Harrison, 2008) with GTRGAMMA substitution model, rapid Bootstrap analysis (200 bootstraps), and searching for the best-scoring ML tree in one program run. Phylogenetic trees were then analyzed with ASTRAL pro2 for multi-copy gene tree inference (Zhang & Mirarab, 2022). We used default ASTRAL parameters and selected the best tree.
Gene Expression quantification
OGs were used to compare expression across species. Because we are working with de-novo transcriptomes, the differences in gene expression across OGs and species may reflect either a gene expression variation (expressed/not expressed) or the number of transcripts included in an OG. We used a cumulative expression methodology that takes into consideration differences in the number and length of sequences included in each transcriptome and OG. We first used the Salmon program (Patro et al. 2017) to map the clean reads of each sample to their respective CDSs using the parameters “-l A -p 8 --validateMappings”. We then calculated normalized estimates of cumulative expression for each OG, AOG and TPM10KOG, which were adapted from Munro and collaborators (Munro et al. 2022) using the formulas:
AOG = (Sum of reads per OG * 103)/(total effective length of CDSs in OG)
TPMOG = (A * 106) / (Sum of A across all OGs)
TPM10KOG = (TPM * number of expressed OGs)/104
Ortholog Annotation
We annotated OGs using the CDSs of species with sequenced and annotated genomes included in Orthofinder analyses. We assigned to the OGs the corresponding functional annotations of genes from reference genomes. Specifically, we used the Refseq-RNA codes of genes from reference genomes to retrieve functional annotations in the DAVID server (https://david.ncifcrf.gov/summary.jsp) and assign these annotations to OGs.
To identify SGA and TA genes we created a database of characterized SM genes from reference species. This database includes all biochemically validated enzymes, transporters, and regulatory genes involved in the biosynthesis of TAs and SGAs. We identified OGs containing these genes and assigned their respective functions. OG annotations are provided in Table S4.
We searched for functional enrichment in DEGs and coexpression modules (next sections) by using the “Functional Annotation Chart” tool from the DAVID server (https://david.ncifcrf.gov/home.jsp) using RefSeq codes from genes included in the different sets (Table S7).
Differential expression analyses
We used the EdgeR package (Robinson et al. 2010) to conduct differential expression analyses to identify OGs whose expression differs between the TPCs and SPCs identified through our metabolomic analyses and a literature review. We used as input the matrix of expression counts (AOG) across 129 species. OGs with low expression were filtered and remaining data was normalized using the trimmed means of m values method (TMM). 11,740 OGs remained after filtering. We conducted two comparisons: (1) TA-producing species vs other species and (2) SGA-producing species vs other species. Those OGs showing an FDR-corrected p-value lower than 0.05 were considered significantly differentially expressed in each comparison. Full results are provided in Table S5.
Network analyses
We used the WGCNA package from the R software (Langfelder & Horvath, 2008) to identify co-expression modules related to alkaloid metabolism using expression data (TPM10K) from leaf transcriptomes of 129 species (Tables S5 and S6). WGCNA was run with the following parameters: power = 6, TOMType = "signed", minModuleSize = 20, maxBlockSize = 30000, deepSplit = 2.5, reassignThreshold = 0, mergeCutHeight = 0.25, corType = "pearson". To identify alkaloid-associated co-expression modules we searched for modules containing key TA genes (e.g., TRs, LS, H6H, HDH) and SGA genes (e.g., GAME genes). We used metabolomic data (first two loadings of metabolic PCAs for SGA and TA variation) to calculate Pearson correlations between gene expression and metabolic variation.
Analysis of genomic distribution
To analyze the distribution of SGA and TA across Solanaceae lineages we first identified SGA and TA OGs in the orthology analysis conducted with Orthofinder using CDSs of 32 Solanaceae assembled genomes (Table S1). We then used genomic annotations (gff files) to localize the genes belonging to these OGs and calculated the number of SGA and TA genes across all non-overlapping 1MB genomic windows. Regions with the highest numbers of OGs were investigated as potential BCs (Table S9). Finally, we analyzed the number of copies of known TA and SGA genes and genes preferentially expressed in TPCs or SPCs within genomic intervals containing putative BCs.