Web interface and technical specificities
All COCONUT data is stored with MongoDB, a cross-platform document-oriented NoSQL database program. The smallest unit in MongoDB is a document, composed of key and value pairs that are similar to JSON objects. Documents of the same nature are organized in collections, which are the equivalent of the SQL-based databases tables. MongoDB is particularly adapted to big and complex data, supports multiple indexing, including text indexing allowing enhanced text search in text-indexed fields and contains a wide range of in-build search and analysis functions.
Two major collections are present in the COCONUT database: SourceNaturalProduct, which contains the original NP data collected from the open sources, and UniqueNaturalProduct, the unified and curated collection of NPs. The full version of COCONUT with all the calculated features can be accessed as a MongoDB dump in the Downloads section of the website. Requests for displaying additional crucial features in the web interface and making them searchable through the advanced search interface are welcome via the COCONUT GitHub tracker (see below).
The COCONUT online front-end is developed entirely with React.js [7], a JavaScript library to build responsive and efficient user interfaces. The OpenChemLib library [8] is used to handle the chemical editor for the search functions. The COCONUT back-end, allowing to process the front-end requests and to communicate with the database is written in Kotlin and Java 11 using the Spring framework. The CDK [9] library is used to process chemical information and formats.
COCONUT web interface, back-end and database are entirely Dockerised, allowing a quick and easy deployment on local servers and cloud. All the code, for both front-end and back-end, is available on GitHub (https://github.com/mSorok/NaturalProductsOnline).
Data provenance, model and content
Data present in COCONUT has been retrieved from 53 various data sources and some manually collected NPs from literature, shown in Table 1. In the current COCONUT release (August 2020), there are 426,916 unique “flat” (with no stereochemistry) natural products, and a total of 746626 NPs where stereochemistry has been preserved when available.
Every molecule collected from external sources must pass a quality control and registration procedure, where its structure is checked for size (between 5 and 210 heavy atoms), connectivity (only the biggest connected structure is kept), pseudo-atoms, if implicit and explicit hydrogens are correct, and if the bonds are correct and the valences are conserved. The Kekulé representation is also assigned to the aromatic systems of each compound. Then, NPs from different provenance are unified based on the identity of their InChI keys without stereochemistry. This unification step is performed in this way as in different data sources stereochemistry is not uniformly present and can be represented differently. When available, the original molecular structure with stereochemistry is carefully preserved and can be visualized for each NP entry.
The authors are well aware that different stereoisomers of a compound can have very different biological activity. The procedure described above was a necessary step to create a unified resource out of distributed databases of varying quality. Further curation will gradually improve stereochemical assignments and linkage to original source articles.
Each unique NP is then assigned a unique identifier, composed of the “CNP” prefix and 7 digits. An automatic curation for NP metadata is performed, which comprises the retrieval of its official name, synonyms, cross-references to other major chemical databases. Then, a range of molecular properties, descriptors and fingerprints (full list in Table 2) are computed using the in-build CDK libraries. As the number of the computed properties is quite big (73 fields in each document corresponding to one unique NP), only a selected fraction of them is displayed on the COCONUT web interface. Finally, the first round of automatic curation of NP metadata, in particular the molecular name synonyms, cross-references with other major chemical databases, correction of the literature references (PubMed identifiers and DOIs) and taxonomy is performed. All original data, unified NPs and the derived and calculated information are stored in MongoDB. The chemical classification of all NPs in COCONUT is performed with ClassyFire [10] and, when available, is present in the corresponding section of the compound page. Additionally, frameworks facilitating NP analyses for their chemical and therapeutic properties are computed for NPs, such Murcko frameworks [11], Ertl Functional Groups [12] and deep SMILES [13].
Last, the annotation level of each NP in COCONUT is computed. It is a 5-star-based system, where 1 star is the lowest annotation quality (no verified common name, no organism annotation, no literature reference and no trusted data source) and 5 stars is the highest quality, with all the intermediate annotation qualities reflected by 2, 3 and 4 stars. A “trusted” data source here is one that has a high curation level for NPs: ChEBI [14], KNApSAcK [4], ChEMBL [15], CMAUP [5], NP Atlas [3] and, of course, the manually picked data. The annotation level represented with stars is visible for each NP on its page.
Natural product naming
NP common names in COCONUT have been retrieved, when available from their databases of origin. The remaining NPs were searched by InChI in major chemical databases (PubChem, ChEMBL and ChEBI) and common names and synonyms were retrieved when the compound was present there. IUPAC names were systematically computed with ChemAxon, and when no common name was available for the compound, the latter were assigned as one. Therefore, all NPs in COCONUT have an assigned molecular name. An IUPAC name is computed for each NP using ChemAxon’s MolCovert [16], and when any name for the molecule could be found, the IUPAC name is assigned as the main one.
Computed molecular features
Figure 2 demonstrates the distributions and relationships of a small selection of computed molecular features within COCONUT. Sugar moieties are one frequent, but not mandatory, feature of NPs. To track their influence on other features, their absence and presence are colour-mapped (no sugar moiety in the molecular structure in blue, and the presence of at least one sugar moiety in orange). The wide molecular weight range is typical for NPs; it is, however, interesting to notice its correlation with the number of oxygen atoms in the molecule, regardless of the presence and absence of sugar. Another interesting correlation to be noted is between the molecular weight and the nitrogen atom number in sugar-free molecules. The NP-likeness score [17] has a typical distribution for an NP set, where most molecules have a positive score.
Counting rings in a molecule can be a complex task, as the outer perimeter of two fused rings can be counted as one big ring. With more condensed rings, the number of fused ring perimeters (aka as the set of all rings) can grow steeply. In Fig. 2, only the minimal ring count (the minimal cycle base) is represented.
Natural product annotation
In addition to their structure and computable structural properties, NPs need to be annotated with at least one literature reference, mentioning where, when and from which organism the NP was isolated. As a consequence, an NP entry should be associated with at least one organism, preferentially with an NCBI taxonomy identifier and the geographic location where the organism was collected. Unfortunately, this metadata is often omitted in public databases and datasets from which COCONUT was assembled. Therefore, only 31.7% (135,352) of NPs in COCONUT are annotated with at least one organism taxa, for 15.4% (66,068) of NPs the geographic location (on the continent level) of the organism collection is known and 16.6% (70,730) of NPs have at least one literature reference. These numbers combine both the retrieval of the original NP annotations from their sources and our efforts to retrieve more extensive information from major trusted chemical databases, PubChem [18], ChEMBL [15], ChEBI [14], CMAUP [5] and KnapSacK [4]. Despite our efforts, most of the links between the original publication of the structure elucidation of an NP and its reference, source organism and its geographical location are still missing. A possible solution to fill these gaps is manual curation, but the amount of data in COCONUT is redhibitory for even considering this approach. Another solution is to use unsupervised machine learning for optical recognition approaches, to parse modern peer-reviewed literature and books to re-establish links between NP structures and their provenance.
We analysed the taxonomic classification of known NP producers together with overlaps in NP production between superkingdom for the 31% of the NPs in COCONUT for which the provenance organism is known (Fig. 3). Here are distinguished five taxonomic categories: plants, bacteria, fungi, animals and marine. The last one is not a proper monoclade classification, but rather reflects a group of organisms that are found only in marine and oceanic environments, and therefore can overlap in terms of its species and NP content with other categories, which are more stringent taxonomically. A large part (65%) of these annotated NPs are produced only by plants, and only very few (0.5%) are from animal origin. Main overlaps in terms of NP production between the taxonomic kingdoms are between plants and marine organisms (which is unsurprising, as there can be real plants among the marine entities) and surprisingly between plants and fungi. The other overlaps between taxonomic kingdoms are not as significant. It needs to be pointed out here that multicellular organisms, such as plants, animals and some of the fungi are most of the time in symbiosis with microorganisms, in particular bacteria. Therefore, NPs isolated from such a multicellular organism can be synthesized and secreted by their symbionts or microbiomes, and therefore mistakenly assigned to an incorrect organism.
The geographic location of the collection or the natural presence of the NP-producing organism is a piece of information that is even more difficult to obtain. Nowadays, a range of organisms, and in particular plants, can be found in different parts of the planet due to globalisation and their success in human consumption (e.g. garlic, tomatoes, curcuma or ginger). It is, therefore, difficult, if not impossible, to determine their original provenance. Also, the geographical information is often omitted in literature and most NP databases. When available, the geographical provenance is stored in the MongoDB dump of COCONUT, but not displayed on the website.
For NPs where geographical information is available, it appears that most of them are produced by organisms that have been isolated in Asia (Fig. 4). This bias is introduced by the intensive study by scientists of the traditional Chinese and Indian medicines and by the big efforts in isolation and elucidation of NPs from medicinal plants. NPs from the African continent are also well represented in COCONUT (Fig. 4), mainly due to the scientific interest in African traditional medicines and African biodiversity. There is, for now, no data from the biodiversity of the Australian continent, and only very little data for NPs isolated from endemic European organisms. NPs from the Americas are mainly extracted and solved while Brazilian and Mexican biodiversity exploration. Only a few NPs are present in more than one continent, mainly in Asia and Africa, and the overlap values are biased by the very different NP set sizes between the different continents.
Searching the database
COCONUT online is intended to be a full-fledged chemical database, with all the subsequent functions, in particular the chemical search. At the moment, the chemical search is uncommon with MongoDB, therefore several approaches have been implemented to run molecular substructure and similarity searches.
Simple search
First, a simple search can be performed using the header search bar. The query can be performed on molecule names (e.g. “curcumin”), SMILES, InChI, InChi key, COCONUT id or molecular formula. Name search uses native MongoDB text indexing, allowing fuzzy flexible search in the “name” and “synonyms” fields. First, the input string type is identified using regular expressions, then the DB is queried against the appropriate fields, and the result, when exists, is returned to the front-end.
Substructure search implementation
Searching for an exact substructure in a MongoDB database of molecules appeared to be surprisingly easy. Each molecule in the database needs to have their fingerprints of choice (in COCONUT are used the PubChem fingerprints) to be precomputed and stored as a list of bytes (BinData type in MongoDB). The query molecule (substructure) then needs to have its fingerprint to be also computed and to be matched against the database using the $allBitsSet function [19]. This native to MongoDB function allows to select documents in a collection where a BinData field has all the query bits set to “on” (but can have bits set to “on” that are not present in the query). Then, to confirm the substructure match, the Ullmann pattern matching [20] is performed using CDK methods.
Similarity search implementation
Similarity search with MongoDB was implemented following the excellent ChEBML blog post tutorial on LSH-based similarity search in MongoDB [21] and adapting it to Java, Kotlin and Spring data. In this approach, the MongoDB aggregation framework is used to perform inverted indexing search against PubChem fingerprints stored in a separate table and referencing COCONUT identifiers that contain the molecular features encoded by each bit.
Advanced search
The advanced search allows searching for NPs in COCONUT according to a range of parameters, such as molecular formula, molecular descriptor values, number of rings, type of sugar moieties present in them, etc.
Querying COCONUT through the API
An API is also available to programmatically query COCONUT. It relies on Kotlin API functionalities and it’s usage, together with some examples, is described in detail in the documentation section of the website (https://coconut.naturalproducts.net/documentation).
Documentation
Complete documentation describing COCONUT, its data and functionalities are available at the documentation section of the website https://coconut.naturalproducts.net/documentation.