Database Content
The current release 1.0 of MarineMetagenomeDB contains metadata of 11 449 marine metagenomes. Of these, 9 202 (80.37%) samples originated from SRA, and the remaining 2 247 (19.63%) were obtained from MG-RAST. Illumina was the most frequently used sequencing technology 9 655 samples (84.33%), followed by LS454, used in 962 samples (8.40%), and ion torrent, used in 216 samples (1.87%) (Figure 2A). The Pacific Ocean, with 3 406 (29.75%), and the Atlantic Ocean, with 3 020 (26.38%) represent the water bodies with the highest number of samples. Other water bodies such as the Indian Ocean and the Mediterranean Sea account for 872 (7.62%) and 681 (5.95%) of the samples, respectively (Figure 2B). For samples collected from water bodies within boundaries of countries, the United States of America contributed the highest number of samples with 891 (7.78%), followed by Israel, Australia, and Brazil with 253 (2.21%), 189 (1.65%), and 187 (1.63%) samples, respectively (Figure 2C). Strikingly, about 50% of the samples carried no information related to their associated biome. For samples with biome information, the term “ocean” had the highest number of occurrences with 4 624 (40.39%). In comparison, “estuarine” and “marine benthic” were the second and third most abundant biomes with a frequency of occurrence of 308 (2.69%) and 290 (2.53%), respectively (Figure 2D). The co-occurrence network between the biomes and water bodies in Figure 3A depicts the biomes where samples were collected. Our network may guide the design of sampling expeditions, as it gives an idea of the explored and unexplored part of our oceans and seas.
The most populated MarineMetagenomeDB attribute is ‘MarMDB_geographic_feature’ with 8 125 (70.97%) of the present values annotated in the metadata. Other attributes with more than 10% of their values annotated in the metadata are ‘MarMDB_biome’, ‘MarMDB_water_type’, ‘MarMDB_sediment’, ‘MarMDB_oceanic_zone’, ‘MarMDB_marine_ecosystem’ and ‘MarMDB_other_material’. Some of the marine terms identified in the metadata co-occurred for the same metagenome. The frequency and co-occurrence of all the MarineMetagenomeDB marine attributes were visualized in Figure 3B. The remaining ten categories of MarineMetagenomeDB attributes defined in this work appeared in a lower frequency, with the least populated being ‘MarMDB_anthropogenic_phenomenon’ with 73 (0.64%) occurrences and ‘MarMDB_man_made_structures’ with 78 (0.68%) occurrences. Additional file 8 depicts the percentage of missing values per attribute in the current MarineMetafgenomeDB data.
Usage and functionalities of the web app
The MarineMetagenomeDB user interface provides easy access with different functionalities to aid in selecting and downloading samples of interest. The user interface has three main sections for users to choose from: ‘Quick Search’, ‘Advanced Search’ and ‘Interactive Map’. Briefly, the ‘Quick Search’ section holds the full content of the current database. Also, it provides options to filter samples by their main characteristics, including Biome, Environmental Material, and Geographic features. The ‘Advanced search’ section contains filters of all attributes in the dataset, allowing users to dynamically search for more specific attributes of interest such as ‘MarMDB Material’, ‘Collection_date’, ‘Assembled’, among others. The ‘Interactive Map’ section provides a graphical method of selecting samples by location directly from the world map. However, it is limited to samples with valid geographical coordinates. Sample identification information (‘sample_id’, ‘project_id’ and ‘library_id’, ‘PubMed ID’ and ‘BioProject ID’) are hyperlinked to the source databases (when available). All tabs include features to visualize the distribution of selected data. Under the ‘Visualize’ button, the user can see a pie chart showing the percentage of the data selected from the complete dataset. An interactive histogram for all available attributes is generated to help users better understand the selected data distribution. A summary table for the selected attributes is also available to help users better understand the selected data's distribution in the attribute explored. Figure 4 shows an overview of the MarineMetagenomeDB user interface. You may find our video tutorials on how to use the MarineMetagenomeDB in the link (https://www.youtube.com/channel/UCZlcoI8xiWno0mD9V954qRA).
Quick search
The ‘Quick search’ tab provides users with access to the complete content of the MarineMetagenomeDB. Here, filtering of the dataset can be achieved based on the main available attributes. All metagenomes, including those without valid coordinates, are shown. Users can filter entries using 30 available filters or by typing in the search box placed at the top (right) of the table. Afterward, metadata of selected entries can be downloaded as a comma-separated values (.csv) file. If the user does not apply any filter, the metadata of the entire dataset can be downloaded. Steps necessary to obtain raw sequencing data are described in the section ‘Downloading raw data from selected metagenomes’.
Advanced search
The tab ‘Advanced search’, allows the generation of dynamic features of all available attributes in the dataset. A checkbox was implemented to allow users to exclude samples with missing values for the chosen attributes. The user can click on the ‘Search and add filters’ button and open a window. Searches for attributes can be done by name, but they are organized using the following categories: ‘Sample Attributes’, ‘Environmental Material’, ‘Geographic Feature’, ‘Sample Location’, and ‘Sequencing Features’. After selecting filters and associated values, metadata of selected entries can be downloaded as comma-separated values (.csv) files.
Interactive map
The interactive map allows users to identify samples from locations of interest on a world map. The map displays locations of only those samples with valid coordinates. We implemented drawing tools (rectangular or polygon shapes) to help users select samples on the map. It is important to note that individual points marked in the map may represent more than one sample since multiple samples can come from the same coordinate position. After selecting samples on the map, their respective metadata are shown on the dataset table below the map. Users can then further limit the entries using filters present in the ‘Quick search’ tab or by typing in the search box placed at the top of the table. After filtering, the resulting metadata table can be downloaded as a comma-separated values (.csv) file.
Downloading raw data from selected metagenomes
We developed a simple download procedure to obtain raw data from SRA. Unfortunately, MG-RAST does not allow public download anymore. Our python scripts enable simple installation of the SRAtoolkit (https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software), and download of the comma-separated values (CSV) exported metagenomes using two user-friendly commands. To support less experienced users, a script with a graphical user interface (GUI) is available. Although most users may operate on Linux systems, we provide Windows executables to allow instant execution without installation. The download scripts are [AB1] compatible with the CSV exports from the TerrestrialMetagenomeDB [18]and the HumanMetagenomeDB [17] and are provided at https://github.com/mdsufz/downloadtool.
Usage example
The scientific community interested in finding differences in metagenomes of estuary biomes of different countries may use the MarineMetagenomeDB to find the samples needed to answer this question. On the quick search tab, under ‘Quick filters’, the user can search for estuarine under ‘MarMDB Biome’, resulting in a list of 266 samples. The user can use the ‘More filters’ tab to select samples from countries of interest under the ‘Sample Location Country’ filter. After, the user may select samples, for example, ‘United States of America’ and ‘Australia’, decreasing the number of samples to 66. At this stage, the user can click ‘Visualize’ to explore the selection. A simple exploration of the metadata shows that the “water type” where the samples were collected was either determined as “brackish water”, “saline water”, “sea water” or “NA”. Finally, the user can download the selected metadata dataset as a CSV file for further analysis and use our provided tool to download the raw sequence data of the selected samples.
Database update plan
As the number of metagenomic experiments submitted to public repositories (e.g., SRA) is continuously rising, we are planning to update the database with newly submitted samples twice every year in February and September, respectively. Moreover, new features could be added or existing features modified at any time if justified. The server that maintains the website will be supported continuously. Any request(s) or suggestion(s) can be submitted to the database administrator through the contact tab of the website.
Suggestions for good practices
The most significant goal of this work was to provide unifying ontologies to facilitate meta-analyses. To this end, we included a guide to help the scientific community to annotate their metadata better when submitting novel metagenome samples to public repositories. Suggested ontologies can be located under Point 7 in the ‘Help’ tab of the MarineMetagenomeDB website under the title ‘What should I do to include my metagenomes in MainerMetagenomeDB?’.