Data sources
UCR data acquisition
The human UCR coordinates were obtained from Bejerano et al. (2004). The sequences of the UCRs were extracted from the human genome via coordinates (obtained from GENCODE release 45 to GRCh38 (hg38).p14)). We used bedtools[76] version 2.31.0 with the getfasta algorithm. General command: bedtools getfasta [OPTIONS] -fi < input FASTA> -bed < BED/GFF/VCF>
The genomic annotation data
The human genomic annotation (GFF/GTF formats) was obtained from GENCODE[77] release 45 to the GRCh38(hg38).p14 genome (available at https://www.gencodegenes.org/human/). For mouse genomic annotation, we used the GENCODE release M34 for the GRcm39/mm39 genome (available at https://www.gencodegenes.org/mouse). Finally, the rat genomic annotation was obtained from RefSeq 106 (ENSEMBLE release 111) for the mRatBN7.2 rn7.2 genome (available at https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_015227675.2).
UCR genome remapping
For human, we remapped the 481 UCRs for NCBI37(hg16) genomic coordinates (BED format) via the liftOver tool (https://genome.ucsc.edu/cgi-bin/hgLiftOver) to human GRCh37(hg19), mouse NCBI34(mm6) and rat Baylor 3.1(rn3). The coordinates were converted from GRCh37(hg19) to GRCh38(hg38).p14 for reference to the human genome, NCBI34(mm6) to GRCm39/mm39 for mouse and Baylor 3.1(rn3) to mRatBN7.2 rn7 for rat via the NCBI-remapping tool (https://www.ncbi.nlm.nih.gov/genome/tools/remap).
UCR overlapping functions
The UCR coordinates (BED format) were overlapped with respective genomic annotation releases in GFF/GTF (GENCODE and RefSeq annotations) or BED formats (mirRNA, regulatory, and SNP/SNVs annotations) and extracted via the R (v4.3.2) script built with the foverlaps function from package data.table (v1.14.2) available at https://www.rdocumentation.org/packages/data.table/versions/1.14.2.
Identification of UCRs in the miRNA annotation data
To detect which UCRs overlapped with miRNAs and mirtrons, we obtained these features from miRBase release 22.1 [78], MirGeneDB 2.1 [79], and MirtronDB 1.1 [80].
Neanderthal and ancient human genomes
Neanderthal genomes were obtained from direct genome projects in BAM format: Altai (https://www.eva.mpg.de/genetics/genome-projects/neandertal/) [81], Chagyrskaya 8 [82] (https://www.eva.mpg.de/genetics/genome-projects/chagyrskaya-neandertal/), Mezmaskaya 3 (PRJNA765125), Vindija 19.33 [83] (http://cdna.eva.mpg.de/neandertal/Vindija/bam/Pruefer_etal_2017/Vindija33.19/), and Denisovan 8 [84] (https://www.eva.mpg.de/genetics/genome-projects/denisova/). The ancient humans in the ENA database: Ust'-Ishim (PRJEB6622), Yana (PRJEB29700), AHUR_2064 (PRJEB29074), Sunghir Burial 3 (SIII) (PRJEB22592), Zlatý kůň (ZKU) (PRJEB39040), Sumidouro 5 (PRJEB29074) Scandinavian hunter-gatherers (PRJEB32786), Ötzi (Tyrolean Iceman) (PRJEB56570), and Ayayema A460 (PRJEB29074). The Kalahari genome was obtained from RAW data (FASTq, paired-end sequences) from the work of Schuster et al. 2010 work [85]. More information about the samples is provided in Supplementary Table S6.
Kalahari genome assembly
All genomic FASTAs were obtained in the BAM format via the consensus algorithm (http://www.htslib.org/doc/samtools-consensus.html) from SAMtools [86] (http://www.htslib.org/). The Kalahari genome was assembled chromosome by chromosome and aligned with GRCh38(hg38) as a reference assembly. The BAM format was also converted to FASTA via the aforementioned method.
Human pangenome project data
From the human pangenome project [87] (https://humanpangenome.org/) we selected the 47 genome haplotypes (primary assemblies) data from the NCBI bioproject (PRJNA730822) in genomic FASTA format, which included the following: 04 Southern Han Chinese (SHC), 07 Puerto Rican in Puerto Rico (PUR), 03 Colombian in Medellin, Colombia (CLM), 02 Esan in Nigeria (ESN), 07 African ancestries from Barbados (Caribbean) (ACB), 04 Peruvian in Lima, Peru (PEL), 01 Vietnamese Kinh In Ho Chi Minh City, Vietnam (KHV), 08 Mandinka in Gambia - Western Division (MWG), 04 Mende in Sierra Leone (MSL), 01 Punjabi in Lahore, Pakistan (PLP), 02 Yoruban in Ibadan, Nigeria (YIN), 02 Kinyawa, Kenya (KWK), 02 Ashkenazim Son HG002 (AIS), 01 Han Chinese Son/HG005 (HCS). More information about the samples is available in Supplementary Table S6.
Identification of ‘extra’ UCRs (exUCRs)
For this analysis, we used the Neanderthal and human ancient genomes previously described and the pangenome genome and human genome references GRCh38(hg38).p14 and T2T-CHM13(hs1) in genomic FASTA format (T2T-CHM13(hs1) reference has complete the sequence of the Y chromosome[88] and was included to mapping the UCRs). First, each genome was converted into the BLAST database format via makeblastdb -in genome.fasta -dbtype nucl -out genomedb.DB. We subsequently performed similarity searches via the BLASTn algorithm of the 481 UCRs against the genomes via the following command line: blastn -query UCRs.fasta -db genome.DB -out ucr-genomeDB.txt -evalue 0.000001 -max_target_seqs 200 -outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen sstrand qcovs qcovhsp.
We then filtered the results on the basis of size of the aligned region (at least 200 bp), percent identity (at least 75%), and query coverage (at least 80%). Finally, we consider those results located in different coordinates in relation to the reference sequence. That is, a region was considered an “extra UCR” if its coordinates were different from the original UCR in the reference genome.
Literature review of UCRs and UCR overlapped genes
The electronic search was performed up to June 2024 in the PubMed and Google Scholar databases. We have included original studies, reviews, and chapters written in English. In the PubMed database, the terms “uc.x” or “gene name” were searched, and the title and abstracts were screened. In Google Scholar, the terms “ultraconserved region” + “uc.x” were used. Duplicates, theses, preprints, and unrelated texts were excluded from the study.
Cancer Expression Data Sources from TCGA
For gene expression analysis, we obtained RNA-Seq datasets for different cancer types from the TCGA database. We selected cancer datasets containing at least 30 samples per condition (tumor and normal tissue). To obtain the gene read counts and TPM values for each cancer type, we used the TCGAbiolinks package from R [89]. For the differential expression analyses, we selected genes annotated as lncRNAs that contained at least one UCR in some of its isoforms.
Differentialexpression (DE) analysis and visualization of results
For differential expression analysis, we used the Wilcoxon rank-sum test, as described in [90]. For this analysis, we used the gene read counts of the 83 lncRNA-T-UCRs previously selected. As a first step, we filtered genes that had very low counts via the filterByExpr function from the egdeR package. After that, we normalized the gene counts via the trimmed mean of M values (TMM) method. We subsequently calculated each gene’s counts-per-million (CPM) values and used it as input in the wilcox.test function in R to calculate the p-value. Finally, we set a p-value cut-off on the basis of an FDR threshold using the Benjamini & Hochberg method. To visualize the differentially expressed lncRNA genes across the 10 cancer types, we constructed Vulcan plots for each of them via the ggplot2 package in R. In addition, we selected genes that were differentially expressed in at least one type of cancer. Using this list, we constructed heatmaps for each cancer type using the TPM values obtained from TCGA database. The heatmaps were constructed via the function Heatmap of the ComplexHeatmap package from R. To analyze the differentially expressed lncRNA-T-UCRs among the 10 cancer types, we constructed a bar plot via package of ggplot2 from R [91].
Cancer survival analysis from TCGA impact measurement
To analyze the impact of lncRNA-T-UCRs on patients prognosis via TCGA survival data, we employed both the Cox proportional-hazards model and the Kaplan-Meier (KM) method via the survival R package [92]. TCGA survival data were downloaded via XenaBrowser (https://xenabrowser.net/datapages/) and only patients used for differential expression analysis were included. For the Cox model, Bonferroni correction was applied post-hoc for multiple testing adjustments. When a significant result was found with the Cox model, the same lncRNA-T-UCR was further analyzed via the KM model, which groups patients on the basis of median expression value (high and low). Graphical visualizations were generated via the survminer R package (https://cran.r-hub.io/web/packages/survminer/index.html).
Single-cell RNA-sequencing (scRNA-seq) analysis
We analyzed publicly available scRNA-seq data to determine the transcriptional profiles of 83 lncRNA-T-UCRs mapped to UTR and annotated in TCGA. BRCA (GSE148673, GSE161529, and E-MTAB-8107), KIRC (EGAS00001002325, GSE159115, phs002065.v1.p1, nc9bc8dn4m.1), LUAD (GSE131907, GSE123904, HRA000154, and E-MTAB-6149), PRAD (GSE141445), HNSC (nasopharyngeal cancer; GSE150430), and COAD (EGAS00001003779) scRNA-seq data were downloaded from the Curated Cancer Cell Atlas [93] (https://www.weizmann.ac.il/sites/3CA/) database. We selected 162 untreated primary samples belonging to the 10X platform whose count data were available. The selected datasets and samples are described in Supplementary Table S23. THCA and UCEC were excluded from this analysis because of the lack of data availability and our inclusion criteria. The count data were processed via the Seurat (v4.0.6) package in R (v4.2.2) [94]. KIRC, BRCA, and LUAD had more than one available dataset and were integrated via the IntegrateData function [95]. The identities of the cell types were based on the expression of canonical markers. In total, the RNA sequencing data from 596,400 single cells were analyzed. The heatmaps were generated and analyzed via the Morpheus online tool (https://software.broadinstitute.org/morpheus/)
Regulatory element annotation and enrichment sources
To identify possible regulatory evidence in UCRs, we obtained transcription factors (TFs) and enhancers from the human genome (hg38) from the Open Regulatory Annotation database (ORegAnnoDB) version 3.0 [14]. These datasets are also accessible through the UCSC Genome Browser https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=686342163_2it3aVMQVoXWn0wuCjkNOVX39wxy&c=chr1&g=oreganno). The Enhancers dataset is enriched with the VISTA Enhancer [15], Visel et al. (2008) and Snetkova et al. (2021) publications [9, 10].
SNP/SNV data acquisition
SNP identification and annotation were performed via dbSNP version 156 for GRCh38 VCF as a reference [96]. Tabix (HTSlib v1.13) [97] extracted variants in UCRs, resulting in a filtered VCF file. To add variant frequency information, the VCF file was annotated via Annovar (v2020-06-08) [98] with the following formats and versions provided by the tool: ABraOM (v20181204) [99], 1,000 Genomes (v20150824) [100], and gnomAD 4.0 (v20231127) [16]. Variants were filtered to retain those with a frequency equal to or greater than 1% in at least one subpopulation from the analyzed databases (ABraOM; 1,000 Genomes: African, Admixed American, East Asian, European, South Asian; gnomAD: African, Amish, Latino/Admixed American, Ashkenazi Jewish, East Asian, Finnish, Non-Finnish European, Middle Eastern, South Asian, Other, XX, XY). Additionally, to find SNPs in UCRs in ancient humans and pangenome data we aligned all the UCRs with the MUSCLE algorithm [101]. Here, we selected the common SNPs from the Super Enhancer database (SEdb) release v2.0 [18] (available at https://bio.liclab.net/sedb/download.php) to search for SNP motifs, SNP GWASs, and SNP eQTLs.