There are several historical records about the hypothesis of colonization and dispersion patterns of C. arabica worldwide (Haarer 1958; Ferreira et al. 2019). Nevertheless, these registers lack molecular support that validates the hypothesis. There is an emptiness about how they arrived and spread these varieties on different territories, specifically for Latin-American ones. Likewise, we tested the hypothesis that there is an underestimation of the current diversity within C. arabica, given its wide geographic distribution and cultivation in different environments worldwide. In this study, it was made data mining and bioinformatic approaches, in which, there were collected the DNA sequences of the internal transcribed spacer (ITS) region available up to date for C. arabica, including data for localities from Africa, Asia, South East Asia, and Latin America, several of them never before considered in the historical hypotheses of (i) colonization and (ii) diversity. As a result, due to the limited availability of sequences, the dispersal and colonization hypothesis could be partially corroborated, however, the current diversity underestimation hypothesis could be fully verified.
Dispersion patterns and ancestral area for C. arabica
It was interesting to note that the molecular hypothesis corroborated the ancestral area of origin for C. arabica, currently reported by the historical hypotheses. However, there were some differences between them regarding ancestral areas and the dispersion routes of C. arabica. (i) The area of ancestral origin reported previously by Haarer (1958) and Ferreira et al. (2019), comprises three countries (Kenya, Ethiopia and South Sudan), whereas the molecular data just suggested two (Ethiopia and South Sudan). This is probably because, within our dataset of sequences, we do not have those from Kenya, partly due to the low availability in data banks and the poor quality of the few that were available. However, we did not have sequences from South Sudan either, and yet the Random Walk algorithm was able to reconstruct South Sudan as part of the ancestral area of origin of C. arabica. (ii) In the historical hypothesis, simultaneous colonization of other geographical areas from the ancestral area is not contemplated, but only one towards Yemen, from where C. arabica reached other areas, following an almost non-overlapping chronological order of dispersal and colonization events on the geographical areas worldwide, which is opposed to the results of molecular hypothesis. In this last one, there were found six independent dispersion events from the ancestral area of origin. This may be due to the smuggling of seeds by man, thus forming the genetic basis of arabica coffee varieties cultivated worldwide (Montagnon et al. 2022) (iii) Furthermore, from the colonization events reported for Brazil by Ferreira et al. (2019), this study adds another colonization route that arrives in Brazil sharing a common ancestor with Vietnam (Yellow route). (iv) Lastly, in the historical data C. arabica of India was introduced by way of Yemen, even so, the molecular data reports that they have a direct common ancestor that belongs to Chad (Blue route).
Also, this study contributes with new routes and points of relevance. As it is possible to visualize in the proposed map of Ferreira et al. (2019), the direct ancestor of the Réunion Islands is Yemen, conversely, in this study it was determined that Chad open the way for the distribution of C. arabica towards Sierra Leone, finishing by the Réunion Islands colonization (Blue route). Additionally, it was possible to relate different unpublished localities, with the ones which were previously reported (Tanzania, Sierra Leone, Chad, and Vietnam). By the historical data, it was already known that C. arabica was introduced by one colonization to the French Guiana and Surinam, spreading to the rest of America. This study attempted to remake the time-space route of that dispersion. In this way, were determined five independent events of colonization, describing the dispersion throughout Latin America (Blue, purple, pink, yellow and red routes). These outcomes provide more information about the ancestral origin and dispersion routes of C. arabica through time and experimentally confirm the existing theories.
The dissemination of C. arabica species from Ethiopia until Latin America is ligated to the successive reduction of the diversity inside the two principal populations of wild coffee, Typica and Bourbon (Anthony et al. 2002). With these two narrow genetic bases, emerge some derived crops, all of them with similar agronomic characteristics. In Latin America, 80% of the crops are from Arabic coffee, based on selection lines inside the Typica and Bourbon varieties, leading to domestication and a stable genetic base (Scalabrin et al. 2020).
Finally, it is important to mention that it is possible that the dispersal and colonization patterns that we are observing from the DNA, are associated with events from a recent past and not a distant past like the one described by Haarer (1958) or Ferreira et al. (2019). This would explain the multiple signs of dispersion.
Cryptic diversity within C. arabica
In this study, a large cryptic diversity was found within C. arabica using the delimitation method SLSD. There were detected a total of 18 unique delimitations, estimating to Latin America several lineages. However, this is contradictory regarding the studies that were made using different nuclear molecular markers, based on the allelic frequencies, which have reported that there is a low genetic diversity within C. arabica; justified by its tetraploid origin and its inherent reproductive mechanisms (Lashermes et al. 1997; Lashermes et al. 1999; Combes et al. 2000; Anthony et al. 2002; Bhat et al. 2005; Scalabrin et al. 2020).
Nevertheless, some studies are consistent with the obtained results, since in the last years have been developed studies in an interspecific level. For instance, Combes et al. (2000), report that exists a higher diversity of the introgressed varieties, different to the selection lines. On the other hand, recently Montagnon et al. (2021) reported the existence of another genetic group never described before, that comes from Yemen, and is different from those clusters already previously found for C. arabica in Ethiopia and Yemen. This endorses that the diversity of the species is not homogeneous; increasing in that way, the possibility of finding a higher genetic diversity inside C. arabica than the one which was already reported.
Likely, the explication of the contradiction of these results might be explained by four principal hypotheses: (i) the implementation of the SLSD method, which has already permitted the estimation of cryptic diversity in several taxa using phylogenetic information (Machado et al. 2018; Carvalho et al. 2019; Ximenes et al. 2021) revealing cryptic diversity. This differs from other studies that commonly use algorithms based on allelic frequencies; which require more demanding conditions to detect population differences, structure or existence of genetic clusters: being in Hardy-Weinberg equilibrium, having samples of robust size for the calculation of frequencies, as well as homogeneity of sample sizes between populations (Pritchard et al. 2000; Puechmaille 2016) (ii) The utilization of more variable and informative molecular markers such as the ITS gene has been reported to allow the analysis of intraspecific variation in some taxa and to determine relationships between closely related congeners and genera (Hao et al. 2015; Walker et al. 2022) (iii) There were included sequences of geographic regions that have never been sampled before, increasing the geographic area in the study, and expanding the possibilities of having a higher genetic diversity (Hensen and Oberprieler 2005; Hague and Routman 2016). (iv) The implementation of studies of genetic improvement by the means of new gene introgressions on C. arabica specie (Lashermes et al. 2010); these introgressed varieties are mixed with traditional varieties (Montagnon et al. 2019; Pruvot-Woehl et al. 2020). Moreover, the crops of coffee are currently put under different techniques of culture and abiotic changes, that provoke the action of micro evolutive forces, that over time module the genetic diversity and abilities of coffee species, which are nominated by Ferreira et al. (2020), like the humidity, altitude and pest resistance. Likely, it may be contributing to the formation of new lineages barely studied with better cup profiles, resistance to plagues and climate change, as it already happens nowadays in Colombia to the native varieties that are not genetically proven, as the varieties denominated colloquially Caturra Chiroso (Ruiz A.F, 2019; Personal communication. Coffee Laboratory Director in Chief, Servicio Nacional de Aprendizaje, Caldas, Colombia).