Phylogenomic analysis and global evolutionary associations of conserved eukaryotic proteins
To identify associations between prokaryotic and eukaryotic protein families, separate hidden Markov model (HMM) databases for prokaryotes and eukaryotes were constructed using a custom, cascaded, sequence-to-profile clustering pipeline, implemented using mmseqs2 28, followed by a multistep data-reduction and multiple sequence alignment (MSA) procedure to generate HMM profiles using hhsuite 2932 (see Methods for details). Initially, a prokaryotic database of 75 million protein sequences was curated from 47,545 complete prokaryotic genomes obtained from the NCBI GenBank in November 2023 and supplemented with proteins extracted from 146 Asgard genome assemblies (Extended Data Figure 1) 6,8. To avoid including genes present only within a narrow subset of species, possibly resulting from horizontal transfer from eukaryotes post-LECA, we reconstructed the “soft-core” pangenome for each of 26 curated prokaryotic taxonomic classes. These pangenomes include only those genes that are present in at least 67% of the families within each class of bacteria and archaea (see Methods) resulting in an initial database of 6.3 million nonredundant sequences. The initial eukaryotic database consisted of protein sequences from 993 species taken from EukProt v3 30, cleaned using mmseqs2 to remove likely prokaryotic contaminants and clustered to 25 million nonredundant sequences.
Both the eukaryotic and prokaryotic were clustered, clusters were realigned using muscle5 and turned into HMM profiles using HHsuite (see Methods). The resulting eukaryotic HMM dataset was queried against the prokaryotic dataset using hhblits 29 to identify sets of homologous protein sequences. Each eukaryotic cluster and all its significant prokaryotic hits constituted an individual sequence set, hereinafter referred to as a Eukaryotic/Prokaryotic Orthologous Cluster (EPOC). The EPOCs constitute groups of homologous proteins from eukaryotes and prokaryotes (each EPOC contains a unique set of eukaryotic proteins, but some clusters of prokaryotic proteins can be present in multiple EPOCs) that were used for phylogenetic tree construction, annotation, and evolutionary hypothesis testing. The final EPOCs include 5.7 million prokaryotic and 1.9 million eukaryotic sequences, mapping to 90% and 8% of the respective non-redundant datasets.
To infer the most likely prokaryotic ancestry of the eukaryotic proteins in each EPOC, rather than relying on the tree topology directly, we employed a probabilistic approach for evolutionary hypothesis testing using constraint trees. Following the construction of an initial master tree, we exhaustively sampled all arrangements of likely sister clades relative to the eukaryotic outgroup and obtained Expected Likelihood Weights (ELW) for the set of possible sister clade models, (Extended Data Figure 2) 31. Given that the ELW metric is analogous to model selection confidence, here we take it to be proportional to the probability of a sampled prokaryotic clade to be the true sister group of the given eukaryotic clade among a set of competing models. For each EPOC, our analysis dynamically accounts for long branch outliers and is robust to phylogenetically non-homogenous clades (see Methods). This analysis is further capable of resolving eukaryotic paraphyly, treating each eukaryotic clade within a EPOC as a single datapoint for downstream analysis. The resulting data included 14,300 EPOCs annotated using profiles generated from KEGG Orthology Groups (KOGs) 32, each with an MSA generated using muscle5 33, a maximum likelihood tree inferred using IQtree2 34 and associated ELW values for all candidate prokaryotic sister phyla. The analysis of prokaryotic ancestry was performed only for those eukaryotic clades that included more than 5 distinct taxonomic labels, with at least one coming from Amorphea and one from Diaphoretickes, the two expansive eukaryotic clades considered to emit from either the first or the second bifurcation in the evolution of eukaryotes 35,36. Thus, these clades represent genes likely mapping back to the LECA.
Considering the global average distribution of ELW values across all EPOCs covering 4330 unique KOGs, the single greatest average ELW (aELW), here referred to as an association, was with Asgard archaea. Further associations with Cyanobacteria, Actinomycetota, Betaproteobacteria and Alphaproteobacteria, as well as trace association with additional bacterial classes, were detected at lower levels. However, the number of included sequences and topological diversity of the trees varied substantially across the EPOCs, with many trees showing low maximum ELW values. Excluding EPOCs with a maximum ELW < 0.4 yielded a robust core set of 5590 EPOCs stemming from the LECA, covering 2540 KOGs and improving the interpretability of the results. This core set covers a wide range of ubiquitous metabolic functions, information processing pathways, transporters, and permeases, as well as regulatory and housekeeping proteins. By contrast, it does not include metabolic pathways limited to individual eukaryotic supergroups and thus can be considered a rough approximation of the core gene set of the LECA. Limiting the analysis to this core subset of well assigned eukaryotic families with wide taxonomic coverage notably increased the global Asgard association which now accounted for the 62 % of the ELW across more than 6000 unique data points across at least 2500 protein families.
Our approach readily reproduced known evolutionary associations at the global functional level. Averaging ELW scores across EPOCs based on the KEGG ontology shows support for major Asgard association for, among other functional systems and pathways, the ribosome, RNA and DNA polymerases, Ras-like GTPases, Ubiquitin-mediated protein degradation, the proteasome, and large parts of the core metabolic network. In contrast, prominent alphaproteobacterial associations included proteins involved in oxidative phosphorylation, glutathione metabolism and Fe-S clusters biogenesis. Together, the Asgard and alphaproteobacterial associations amounted to an aELW of 0.55 + 0.06 = 0.61 across all of Metabolism and to 0.77 across Genetic Information Processing. Thus, we observed a far stronger overall association between Asgards and eukaryotes across diverse biological functions and pathways than previously described 5,6,8 although a consistent association between eukaryotic core metabolism and diverse bacterial phyla is still present.
Broad, dominant Asgard contributions to eukaryogenesis
In accord with the key contribution of Asgard archaea to eukaryogenesis, we observed associations of Asgard proteins with a wide array of cellular functions. In previous work, the strongest Asgards traces have been noted across the information processing systems, with unambiguous associations with DNA replication, core transcription, RNA processing as well as translation and protein trafficking 2,6,8. Here, we consistently observed strong Asgard associations for genome replication and transcription and further detected pronounced Asgard traces for nucleotide excision repair, mismatch repair and homologous recombination. Additional well-known associations, such as ribosomal proteins, were extended to include translation factors, components of the co-translational membrane insertion machinery, protein targeting and aminoacyl-tRNA biosynthesis (Extended Data Figure 3, Extended Data File 1). Thus, all groups of core eukaryotic proteins involved in information processing appear to be almost exclusively of Asgard descent.
We detected additional Asgard associations extending far beyond the information processing systems, including prominent contributions to the machinery involved in nucleocytoplasmic transport as well as downstream protein sorting, glycosylation and targeting. In particular, central components of the ER associated, N-linked glycan biosynthesis and transfer, including both cytoplasmic and lumenal monoglycosyltransferases, as well as the core of the oligosaccharyltransferase complex (OSTC) are strongly associated with Asgard (Extended Data Figure 3). Notably, enzymes associated with glycosylation maturation in the Golgi complex did not show strong Asgard or other prokaryotic associations in our analysis, possibly, due to extensive diversification of domain architecture in eukaryotes. The Asgard connections of the eukaryotic glycosylation machinery further included the synthesis of GPI-anchors, which post-translationally tether targeted proteins to the membrane 37, here detected as unambiguously Asgard-derived. We also detected an Asgard origin of the 7-subunit (UDP-GlcNAc)-transferring (GPI-GnT)-monoglycosyltransferase complex responsible for initiating GPI-anchor synthesis, components required for the maturation of the GPI-anchor, as well as the transamidase complex and factors responsible for protein transfer onto the mature GPI-anchor (Extended Data Figure 3).
Of major importance to eukaryogenesis is the provenance of the pathways for the biosynthesis of bacterial-type lipids, given that (at least) all binary archaea-bacteria symbiogenesis scenarios require a transition from archaeal to bacterial lipids in the membranes 9. Although strong Asgard associations were observed for large parts of the overall metabolic network, we observed a high degree of mosaicism in the pathways for fatty acid synthesis and decay. The global aELW values favored Asgard origin for these pathways, but there were also notable associations with Actinomycetota. However, most KOGs within these pathways are represented by multiple EPOCs with conflicting assignments potentially obscuring any consistent signal (Extended Data Figure 3). By contrast, less perplexity was observed within the adjoining ER localized pathways for sphingolipid metabolism, a broad class of derived plasma membrane lipids in eukaryotes. Previous studies have highlighted possible convergent origins of this pathway in bacteria and eukaryotes 38,39, but here we detected broad associations with Asgard. Of further note is the ER-associated isoprenoid biosynthesis pathway, here also found to be strongly associated with Asgard. In eukaryotes, isoprenoids form the precursor units for sterols, carotenoids and terpenoids, synthesized in the ER lumen via either the mevalonate pathway or the MEP/DOXP pathway. In Archaea, isoprenoids are the precursors for the ether-linked membrane lipids 40. Here we found the mevalonate pathway, from Acetyl-COA to mevalonate and further to Farnesyl and Geranyl diphosphate, to be strongly Asgard-associated (Extended Data Figure 3), with the key enzymes hydroxymethylglutaryl-CoA synthase (HMGCS), mevalonate kinase (MVK), phosphomevalonate kinase (PMVK), mevalonate diphosphate decarboxylase (MVD) being clearly Asgard-derived.
In conclusion, we detected Asgard associations across a wide range of cellular functions and metabolic pathways while noting a distinctly weaker Asgard signal for pathways involved in bacterial lipid biosynthesis, suggesting a complex evolutionary history.
Limited and highly specific alphaproteobacterial contributions
In line with the central role of mitochondria in eukaryotic energy metabolism, we primarily observed associations between Alphaproteobacteria and mitochondrially localized metabolic pathways. As expected, apart from the components of the mitochondrial translation system, the most prominent alphaproteobacterial associations were evident for complexes involved in oxidative phosphorylation and the associated ubiquinone synthesis (Extended Data Figure 3). Outside these central energy-transforming functions, we only detected sparse contributions from core alphaproteobacterial genes. One such prominent association was the pathway for iron sulfur cluster (ISC) biogenesis. As previously reported 41, the ISC assembly machinery is of alphaproteobacterial origin, and in accord with these observations, the 4Fe-4S ISCA platforms as well as IBA57 and Fe-S cluster binding ferredoxin-1 and 2, were found to be strongly associated with Alphaproteobacteria (Figure 3, Extended Data Figure 3). However, the 2Fe-2S precursor scaffold ISCU showed mosaic associations, with a minor but clearly detectable Asgard contribution. Notably, the cysteine desulfurase NFS1 and the upstream pathways for the biosynthesis of sulfur-containing amino acids, cysteine, and methionine, were strongly Asgard- associated. The ISC biosynthesis is intimately linked to the general redox homeostasis and core sulfur metabolism via glutathione, directly coordinating Fe-S clusters during synthesis and transport 42. In line with this role in Fe-S coordination, although some Asgard association persisted, we observed specific associations of glutathione metabolism with alphaproteobacteria, including glutathione hydrolase, dehydrogenase and reductase, as well as the family of glutathione transferases, GST, GSTP and GSTK1 (Extended Data Figure 3). Outside the mitochondria, ISC insertion depends on the cytosolic targeting complex CIA, which consists of CIAO1, CIA2B and MMS19 43. For the CIA components CIA1 and CIA2B, we observed clear association with Asgard whereas MMS19 was not detected in our data. Taken together, these observations indicate that the contributions of alphaproteobacteria to the gene set of the LECA are functionally specific and apparently limited in scope, clearly centered around mitochondria-related functions.
Paucity of functionally consistent contributions from other bacteria
Although our analysis greatly expanded the Asgard contributions to eukaryogenesis, while also revealing a limited but prominent and functionally consistent alphaproteobacterial association, contributions from diverse other bacteria were consistently detected. For some biological functions, this diverse bacterial component accounted for the majority of the aELW, and roughly one third of the analyzed KOGs (680 of 2540), and EPOCs (1810 of 5590) were associated neither with known ancestors of endosymbionts, Alphaproteobacteria and Cyanobacteria, nor with Asgard. However, in a sharp contrast to Asgard associations including information processing, protein glycosylation and trafficking, and other functions as discussed above, or oxidative phosphorylation and sulfur metabolism for Alphaproteobacteria, EPOCs associated with diverse other bacteria showed few if any coherent functional trends.
Considering the diverse set of bacteria, and all possible KEGG maps and modules, only Alphaproteobacteria were associated with pathway including more than 20 EPOCs and with a greater aELW than Asgard (Glutathione metabolism, Figure 4). For all other analyzed functional classes of eukaryotic genes, bacterial associations were weaker than the associations with Asgard (Figure 4). The second most individually prominent bacterial contribution was from Myxococcota, of the former deltaproteobacterial clade. Although globally weaker than Asgard associations, Myxococcota showed consistent associations with nicotinate and nucleotide synthesis, including both purine and pyrimidine synthesis, as well as nucleoside sugar metabolism. Myxococcota were unique in this regard as most bacteria showed diffuse associations across sugar and fatty acid metabolism, and/or diverse transporters. The nucleotide-related associations with Myxococcota were primarily limited to phosphatases and phosphoribosyltransferases acting on nucleotide sugars including 5 and 3’ nucleotidases, and the respective EPOCs showed little to no competing Asgard association. While noteworthy, these associations were limited to a few unique KOGs whereas all other associations with Myxococcota remained scattered across various pathways (Extended Data Figure 5).
In addition to investigating metabolic pathways for global associations, we directly examined those individual EPOCs that were highly likely to be derived from diverse bacteria. For this analysis, we considered a stricter subset of the core EPOCs, requiring a eukaryotic outgroup containing at least 15 taxonomic clades and prokaryotic sister taxa with at least 20 sequences and ELW > 0.7. Only 16 of the 127 unique KOGs meeting these criteria were found to be associated with diverse bacterial lineages, mostly, Actinomycetota, FCB group and Betaproteobacteria, with minor contributions from Mycoplasmatota and Campylobacteriota (Extended Data Figure 6). The remaining 111 were Asgard-derived. The 16 prominent bacterial KOGs covered a wide range of cellular functions from MFS transporters to lipases to components of core sugar metabolism and cardiolipin synthesis once again with no noticeable trends. Taken together, although many associations were individually significant, no pathways appeared significantly enriched in associations with any single bacterial taxon, and we identified virtually no consistent trends among the bacterial associations. Instead, we interpret these diffuse associations as indications of highly specific contributions of limited functional scope from diverse bacteria other than the known endosymbiotic partners.
Relative contributions of evolution pre- and post-LECA
To compare the ancestral stem lengths in phylogenetic trees of core eukaryotic genes of different inferred origins we employed the methodology originally implemented by Pittis and Gabaldon 26. Briefly, we define the raw stem length as the distance from the LECA node to the shared Last Common Ancestor (LCA) node of Eukarya and its most likely prokaryotic sister phyla. To account for differences in evolutionary rates, we divided this raw stem length by the median of eukaryotic branch lengths, measured from the LECA to each leaf (Figure 5C). Considering that the astronomic time post-LECA is the same for all genes, if the tempo and mode of pre- and post-LECA evolution were the same, the normalized stem length is proportional to the time elapsed since the divergence of the gene from its prokaryotic donor to LECA, that is, reflects the timing of acquisition. Using this approach, Pittis and Gabaldon found that proteins of alphaproteobacterial descent had significantly shorter normalized stem lengths than proteins of archaeal descent, consistent with a mitochondria-late scenario for eukaryogenesis 26.
Across the full set of eukaryotic stem lengths for our dataset (5,850 stems), we observed a wide distribution with a sharp maximum close to 0.05, highly reminiscent of the previous findings eukaryogenesis 26 (Figure 5B). However, in our analysis, the stems for genes of alphaproteobacterial origin were significantly (p ≈ 9.9x10‑6; Figure 5) longer than those of Asgard origin, and longer than those of other major bacterial contributors as well (Figure 5B). Comparison of the stem length distributions across functional classes of genes (Figure 5A) suggested an explanation for these observations. The shortest stems belong to the Genetic Information Processing functional category, belying our expectations of these genes being the longest-residing genes in the nascent eukaryotic lineage. We suggest that another major determinant of the relative stem length of a gene is the amount of adaptive evolution post-acquisition, which is necessary to adjust the newly acquired gene to the alien intracellular molecular environment. Thus, genes inherited from the Asgard ancestor, in particular, those involved in information processing, were pre-adapted to the cellular environment of the evolving protoeukaryotes, whereas genes acquired from radically different bacterial sources had to substantially adapt post-acquisition, increasing the apparent lengths of their pre-LECA stems. A case in point is the set of oxidative phosphorylation components, which were apparently acquired during the mitochondrial symbiogenesis, and thus, simultaneously, from the same donor. The distribution of their stem lengths (Figure 5A) was as broad as that for proteins of other functional classes (Figure 5A and Extended Data Figure 7), demonstrating that stem lengths are mostly determined by factors other than the acquisition time. Thus, our stem length analysis failed to provide unequivocal resolution of the temporal order of the prokaryotic contributions to eukaryogenesis. Nevertheless, the results appear to be compatible with the capture of the alphaproteobacterial endosymbiont by a host that was already on the path to eukaryotic-like complexification, along with capture of genes from various bacteria at different stages of eukaryogenesis.