Data collection, processing, mapping of reads and expression quantification
We performed extensive literature mining to gather as many potato RNA-Seq datasets as possible. We downloaded 3,227 raw read sequencing files (.sra) from the National Center for Biotechnology Information (NCBI) Sequence Reads Archive (SRA) database and converted them into FASTQ format. We combined reads obtained from the same library in a single FASTQ file for single-end (SE) data or two files for paired-end (PE) data, resulting in 2,636 libraries (85.24% are PE and 14.75% are SE data) from 155 NCBI BioProjects comprising 20 broad tissue categories (Table S2).
We excluded reads containing adapter sequences or reads with average quality of less than 20. We excluded 32 samples that contained less than 100,000 reads or for which less than 50% of reads remained after trimming. The reads from each sample were mapped onto the reference genome, followed by assembling transcripts and then performing quantification of transcript abundance. We used 2.604 samples containing an average of 23,060,781 read pairs per sample with PE data and 36,832,773 reads per sample with SE data for read mapping. Mapped and uniquely mapped reads corresponded to an average of 80.73% and 67.80%, respectively. We excluded 157 samples in which > = 50% of reads failed to map or > = 40% could not map uniquely. Finally, we excluded 106 samples which were made of combinations of multiple tissues, such as callus, plantlet, seedling, whole plant and mixed tissues. In total, we kept 2341 samples from 147 NCBI BioProjects for downstream analyses (Table S3).
Leaf was the most abundant tissue representing 45.4% of the samples, while petiole tissue represented 0.21% (Table S4). We have also found that about 58% (1361 of 2341) of the libraries were unstranded. Finally, we assembled transcripts and estimated transcript abundances in raw read counts and transcripts per million (TPM) at the gene level (Figure S1).
Systematic analysis of thousands of potato RNA-Seq samples
In transcriptomics studies, the clustering of samples is instrumental in identifying broad transcriptional similarities between samples and identifying potential technical artefacts and mislabeled samples. Here we employed hierarchical clustering to identify mislabeled samples. The clustering analysis revealed two major clades comprising samples from aerial and underground tissues. However, interestingly, we found an additional cluster consisting of samples from pollen only (Fig. 1). In addition, we observed that seven samples from underground tissues were clustered with aerial tissues, while 35 aerial tissues clustered with underground tissues. In order to avoid the influence of these potentially mislabelled samples, we excluded these 42 samples from the downstream analyses.
In this study, we classified a gene as expressed if the gene had a minimum TPM threshold of 1 in at least one sample and found that across all samples about 87% of known potato reference genes (33981 of 38977) were expressed. An average of 18589 genes were expressed per sample. The tissue with the highest number of expressed genes was leaf (31427 genes), whereas pollen had the lowest number of expressed genes (12801 genes) (Table S5). We found that 12600 genes were expressed in at least 90% of samples, including 1121 genes in all 2299 samples. About 83% of all genes not expressed in any sample had coding sequences comprising < 300 codons (Figure S2).
Housekeeping and tissue-specific genes
Due to the availability of an extensive collection of RNA-Seq samples covering a wide range of tissues and environmental conditions, we also pursued identifying housekeeping (HK) genes for potatoes. In this study, we identified 281 HK genes (Table S6) using a previous described method (Hoang et al., 2017). We evaluated the expression levels of HK genes in all tissues and found that the genes had very low expression variation (Fig. 2A). Furthermore, we used the tissue-specific index Tau to estimate tissue-specificity and confirm whether the identified HK genes broadly expressed across all tissues. The Tau scores of the HK genes ranged from 0.058–0.282 (Fig. 2B).
We compared the global expression patterns between tissues to identify tissue-specific genes (Figure S3). All 15 tissues were compared pairwisely, resulting in 308 genes with a significantly higher expression in a single tissue compared with all the others (Fig. 2C and Table S7). Interestingly, more than 90% (278 of 308) of these genes had Tau indexes > 0.8 and a median Tau of 0.97005 (Fig. 2B). Given their solid preferential expression in particular tissues, we called these genes tissue-specific (Tau > 0.8). The tissue-specific genes ranged from 11 in roots to 137 in pollen. Interestingly, 18 tissue-specific genes belonged to ten transcription factor (TF) gene families (Table S8). The number of tissue-specific TF genes ranged from one in fruit, root and style to nine in pollen.
Identification of novel transcripts
We compared the genomic coordinates of the transcripts assembled in our study with the reference transcripts (dAg) using Gffcompare (Pertea & Pertea, 2020) and categorized them into 15 classes (Table S9). We found that 99.22% (58274 of 58734) of the transcripts precisely matched the exon-intron splice junctions of known transcripts (class “=”). We also investigated class-J and class-U categories, which account for 17,312 and 30,832 transcripts, respectively. Class-J comprises multi-exon transcripts with at least one known exon junction, while class-U encompasses transcripts located in intergenic regions. While class-J transcripts include new isoforms of known genes, those from class-U identify potentially new genes. We found that approximately 84% (14,476 of 17,312) of the class-J transcripts and about 11% (3,489 of 30,832) of the class-U transcripts contain a complete open reading frame (ORFs) (Table S9). We found that 14,476 class-J transcripts belong to about 30% of reference genes (11,736 out of 39,088). In addition, we found 608 transcription factors belonging to 59 TF families (Table S10) and 94 NLR genes (Table S11) in class-J transcripts. The gene ontology enrichment analysis revealed that the class-J transcripts were enriched with several biological processes (FDR < 0.05). The top five enriched biological processes were “response to abscisic acid” (GO:0009737), “salt stress (GO:0009651)”, “water deprivation” (GO:0009414), “cold response” (GO:0009409), and “positive regulation of transcription, DNA-templated” (GO:0045893) (Table S12). On the other hand, we found 1150 non-transposon genes within 3489 class-U transcripts. Interestingly, we found 108 transcription factors (TF) belonging to 26 families (Table S13) and five NLR genes (Table S14) in the class-U transcripts. However, we did not find significantly enriched gene ontology terms in these transcripts.
Co-expression network construction and detection of co-expression clusters
To determine if our co-expression network has a scale-free architecture (Barabási & Bonabeau, 2003), we calculated the Pearson correlation coefficient (PCC) for each pair of genes with a threshold of 0.5 and determined the number of times a particular gene is co-expressed with other genes at this threshold (node degree). We plotted the resulting power law distribution, which showed a negative correlation between node frequency (the number of genes with a certain number of connections) and node degree (the number of connections per gene). This distribution confirms the scale-free topology of our network (Figure S4). We constructed an HRR-based co-expression network using the above computed gene-gene PCC values using the CoNekT framework (Proost & Mutwil, 2018). The constructed network contained 28,388 nodes representing genes and 4,57,580 edges representing associations between two nodes, such as HRR and PCC (Table S15). Using the heuristic cluster chiseling algorithm (Mutwil et al., 2010), we identified 853 clusters of co-expressed genes with the size of modules ranging from 2 to 285. We found that about 51% of co-expression clusters contained just two genes, while about 32% contained more than ten genes (Table S15). We visually assessed the quality of the identified clusters by inspecting the deviation of expression patterns of individual genes against the average expression pattern of the respective cluster. In this study, we considered genes with a Z score smaller than ± 1 as a tight co-expression in respective clusters. Based on these criteria, we found that an average of 85.39% of the genes across all clusters showed a tight co-expression (Figures S5 & S6). To understand the relationships between the identified clusters and tissues, we plotted a heatmap for the Z scores of the average expression level (TPM) per module at each tissue (Fig. 3), and we found about 54% (461) clusters that showed distinct expression patterns across tissue, i.e., Z score larger than ± 1 in at least three tissues. To understand the function of these clusters, we conducted an enrichment analysis, which revealed that more than 65% of the clusters contained at least one significantly enriched (corrected p-value < 0.05) biological process (Table S15). The identified clusters, thus, effectively grouped the genes that may participate in the same biological pathways and constitute the basis for identifying gene co-expression clusters underlying various agronomic traits.
Co-expression clusters related to anthocyanin biosynthesis
We searched for co-expression modules associated with anthocyanin biosynthesis using the gene ontology (GO) term "anthocyanin-containing compound biosynthetic process (GO:0009718)" in our co-expression network. We found a single co-expression cluster, Cluster_90 (corrected p-value < 8.7e-05), containing genes including the ones that encode the structural enzymes involved in the anthocyanin biosynthesis (Table S16; Table S17), except the primary regulator gene, R2R3 MYB TF, the homolog of PhAN2 (R2R3 MYB TF), present on chromosome 10 (Jung et al., 2009). Nonetheless, in this cluster, we found three new MYB TFs (SOLTUB.AGRIA.G00000008919, SOLTUB.AGRIA.G00000017419, and SOLTUB.AGRIA.G00000019730) mapping to chromosomes 2 and 5. In addition, we also found 19 TFs belonging to bHLH, MADS, B3, C2H2 and AP2 TF families and two WD40 repeat-containing proteins in Cluster_90 (Fig. 4A; Table S18). Moreover, various GO terms such as “flavonoid biosynthetic process” (GO:0009813), “organic substance biosynthetic process” (GO:1901576), “pigment metabolic process” (GO:0042440), “DNA-binding transcription factor activity” (GO:0000981), and “anthocyanin-containing compound metabolic process” (GO:0046283) were significantly enriched (corrected p-value < 0.05) in this cluster (Table S19).
To find the primary regulatory gene of anthocyanin biosynthesis R2R3 MYB TF in our reference genome, we searched the reference protein sequences using the protein sequence of PhAN2 (UniProt ID: A4GRV2) as a query using BLASTP (Altschul et al., 1990). This search resulted in the identification of four homologs of PhAN2 in potatoes, mapping to chromosome 10. Three of them are present in three different clusters, namely Cluster_133, Cluster_78, and Cluster_85. In contrast, the fourth homolog does not have a cluster assignment. Among the three homologs, only one homolog (SOLTUB.AGRIA.G00000035098) is present in a cluster, Cluster_78, in which the GO terms “phenylpropanoid metabolic” (GO:0009698) and the “proanthocyanidin biosynthetic” (GO:0010023) processes were significantly enriched (corrected p-value < 0.05) (Table S20). Hence, we consider this gene as the homolog of PhAN2, which regulates the early biosynthetic genes in our reference genome (dAg). In addition, Cluster_78 also contains seven TFs belonging to MYB, TCP, NAC, SBP, GATA and bHLH TF families (Fig. 4B; Table S18; Table S21).
Co-expression clusters related to tuberization
We searched in our co-expression network for co-expression clusters harbouring genes, StSP6A and IT1, involved in tuberization (Tang et al., 2022). We found two co-expression clusters, Cluster_23 and Cluster_97, containing IT1 and StSP6A, respectively. Cluster_23 contained seventy genes, including IT1 (Table S22; Fig. 5A). This cluster contained eight genes belonging to seven TF gene families: SRS, bZIP, bHLH, MADS-box, TCP, GATA, and AP2/ERF-ERF. These TFs are predominantly expressed in stolons, sprouting tubers or tuber meristem (Table S23). In addition, various GO terms such as “seed trichome elongation” (GO:0090378), “lipid transport” (GO:0006869), “the developmental process involved in reproduction” (GO:0003006), and “cellular process involved in reproduction in a multicellular organism” (GO:0022412) (Table S24) were significantly enriched (corrected p-value < 0.05) in this cluster. Cluster_97 contained 128 genes, including StSP6A (Table S25; Fig. 5B). This cluster contained 12 genes belonging to seven TF gene families, including C3H, TUB, LSD, MADS, C2C2-CO-like, HB-HD-ZIP, and NAC (Table S23). In addition, GO terms for hundreds of biological processes, including “regulation of long-day photoperiodism, flowering” (GO:0048586), “cellular response to light stimulus” (GO:0071482), “regulation of photoperiodism, flowering” (GO:2000028), “cellular response to radiation” (GO:0071478), and “response to red or far-red light” (GO:0009639), were significantly enriched (corrected p-value < 0.05) in this cluster (Table S26).
Co-expression clusters related to defense responses
In this study, we identified 578 genes which belong to different classes of nucleotide-binding (NB) domain and leucine-rich repeat (LRR) (NLR) genes (Table S27) using NLR-Annotator (Steuernagel et al., 2020). A total of 432 out of 578 NLR genes were assigned to 119 co-expression clusters which contain 1–44 NLR genes per cluster. Among the 119 co-expression clusters, 43 were enriched for at least one biological process involved in defense mechanisms (Table S28), such as "response to biotic stimulus (GO:0009607)", "defense response (GO:0006952)", "response to fungus (GO:0009620)", "defense response to fungus (GO:0050832)", "response to bacterium (GO:0009617)", "defense response to bacterium (GO:0042742)", "response to virus (GO:0009615)", and "defense response to virus (GO:0051607)".
We found eight of 14 known NLRs effective against Phytophthora infestans (Rpi genes) (Armstrong et al., 2019) in three co-expression clusters, Cluster_223, Cluster_210, and Cluster_103, while the remaining Rpi genes were either not assigned to any cluster or clusters did not enrich for any of the above-mentioned biological processes (Table S29). Cluster_223 contains 58 genes, of which 32 encode NLRs (Table S30; Fig. 6A). In this cluster, we found four Rpi genes, Rpi-R3b, Rpi-R9a, Rpi-vnt1.1, and Rpi-vnt1.1 A2056, mapping to two homologs in the reference genome, SOLTUB.AGRIA.G00000038927 and SOLTUB.AGRIA.G00000032822. In this cluster, the biological process “defense responses” to fungi, bacteria and viruses were enriched (Table S31). Cluster_103 contains 92 genes, of which 44 encode NLRs (Table S32; Fig. 6B). In this cluster, we found two Rpi genes, Rpi-R8 and Rpi-ber, mapping to two homologs in the reference genome, SOLTUB.AGRIA.G00000032965 and SOLTUB.AGRIA.G00000035214. In this cluster, the biological process “defense responses” to fungi and bacteria were enriched (Table S33). Cluster_210 contains 92 genes, of which 11 encode NLRs (Table S34; Fig. 6C). Similarly, this cluster contained two Rpi genes, Rpi-blb2 and Rpi-blb3, mapping to two homologs in the reference genome, SOLTUB.AGRIA.G00000044086 and SOLTUB.AGRIA.G00000013669. In this cluster, defense responses related to a specific pathogen were not enriched, but “response to biotic stimulus” and “defense response” were enriched (Table S35).
Co-expression cluster related to self-incompatibility
We searched in our co-expression network for clusters harbouring genes involved in self-incompatibility (Dzidzienyo et al., 2016). We found one co-expression cluster, Cluster_30, containing the S-RNase gene SOLTUB.AGRIA.G00000001844 which showed an extreme expression in style samples (mean TPM of 5783.82). This cluster contained 99 more genes and the majority of these genes showed high mean expression in style samples (Table S36; Fig. 7). However, surprisingly, we found no enriched GO terms in this cluster. Furthermore, this cluster contained two genes belonging to two TF gene families: GATA and bHLH.
Data availability through a web server
The data presented above are easily accessible by researchers to explore the expression atlas of 2299 transcriptome samples and the co-expression network interactively via a web server called StCoExpNet. This web server is freely available at http://134.99.224.164/conekt.