curatedFoodMetagenomicData (cFMD) is a resource that comprehends curated metadata, taxonomic profiles, as well as reconstructed genomes from food (shotgun) metagenomes. The first version of cFMD consists in a total of 2,533 metagenomes associated with 59 datasets: 45 datasets and 583 samples are coming from publicly available studies, and the remaining 14 datasets and 1,950 samples are produced within the EU H2020 MASTER project (https://www.master-h2020.eu/index.html).
From this GitHub repository you can access to these files (more details are provided in the section "Detailed description of data" below):
-
cFMD_datasets: summary of the datasets included in the current release, with reference to the publication (if available)
-
cFMD_metadata: metadata information, in addition to statistics about reconstructed MAGs at sample level. The table has samples as row indices and type of information as column headers. These includes:
- categorization of the samples,
- accession codes to retrieve public metagenomes,
- technical information (e.g. dna extraction kit, sequencer, etc.),
- basic statistics (number of reads, number of bases, number of MAGs, etc.). The unique key for querying the database is represented by the dataset_name and sample_id. Food samples were classified according to their composition and production using three levels of detail (category, type and subtype).
-
cFMD_metadata_rules: description of the syntactic rules to define the metadata fields of the above file "cFMD_metadata"
-
cFMD_mags: the reconstructed MAGs in fasta format (hosted externally due to large size)
-
cFMD_mags_list: the list of the reconstructed MAGs with information in terms of:
- sample origin,
- assigned taxonomy at species-level genome bin (SGB) level,
- known/unknown status of the SGB,
- basic statistics (number of contigs, N50, completeness, contamination, etc.).
-
cFMD_sgbs_prokaryotic: for each prokaryotic food SGB (i.e., having at least one MAG reconstructed from food) information in terms of:
- taxonomy, known/unknown status of the SGB,
- level of the assigned taxonomy,
- SGB statistics (number of included MAGs, number of included reference genomes, etc.).
-
cFMD_sgbs_eukaryotic: as the file "cFMD_sgbs_prokaryotic" but referred to eukaryotic SGBs.
-
cFMD_taxonomic_profiles: taxonomic profiles with samples as row indices, basic metadata are column headers, and values are espressed in relative abundances (%).
More description about the fields for some of the files presented above:
-
cFMD_metadata (unique key= dataset_name+sample_id)
- dataset_name: name of dataset. It is formed as i) “first author surname + initial letter of first author name(s) + _ + year of publication” for public datasets ii) “first author surname + initial letter of first author name(s) + _ + “xxxx” for not already public datasets (among those there are also MASTER partners datasets) iii) “MASTER + WPn + sampling partner + increasing number” for datasets produced inside MASTER
- sample_id: name of the sample
- macrocategory: highest-level description of the sample type (food, controls, food processing, environment, or animal)
- category: second highest-level description of the sample type
- type: third highest-level description of the sample type
- subtype: lowest level of description of the sample type (can be blank if not necessary/available)
- commercial_name: name of the commercialized product
- fermented/non-fermented: categorizing samples across and within categories based on fermentation presence
- country: country of origin of the sample as defined by ISO3 international convention
- sample_accession: code identificative of the sample if present in public databases
- run_accession: code identificative of the sequencing run if present in public databases
- experiment_accession: code identificative of the experiment if present in public databases
- study_accession: code identificative of the study if present in public databases
- project_accession: code identificative of the sample if present in public databases
- database_origin: name of the public database from which the reads of the sample have been downloaded
- library_layout: layout of the sequencing library (e.g. paired, single )
- sequencing_platform: sequencer used to read DNA basis
- DNA_extraction_kit: extraction kit used to isolate DNA in the sample
- collection_date: day (DD/MM/YYYY) or month (MM-YYYY) or year (YYYY) of sample collection
- n_of_bases: # of nucleaotides forming the reads of the sample after pre-processing
- n_of_reads: # of reads of the sample after pre-processing
- min_read_len: minimum number of basis among the reads of the sample
- median_read_len: median number of basis among the reads of the sample
- mean_read_len: mean number of basis among the reads of the sample
- max_read_len: max number of basis among the reads of the sample
- n_contigs: # of contigs with length > 1000 bp assembled from the reads of the sample
- n_MAGs_MQ_prok: # of prokaryotic MAGs with 50%<=completeness<90% and contamination <5% according to CheckM
- n_MAGs_HQ_prok: # of prokaryotic MAGs with completeness >=90% and contamination <5% according to CheckM
- n_MAGs_MQ_euk: # of eukaryotic MAGs with 50%<=completeness<90% and contamination <5% according to BUSCO
- n_MAGs_HQ_euk: # of eukaryotic MAGs with completeness >=90% and contamination <5% according to BUSCO
- filtered: food samples with less than 1e08 basis excluded from following analysis
- curator: name of the curator
-
cFMD_mags_list (unique key= mag)
- MAG_id: name of the MAG formed by “${dataset_name}_${sample_id}_bin.${bin_number}”
- dataset_id: name of the dataset from which the MAG has been reconstructed
- sample_id: name of the sample from which the MAG has been reconstructed
- SGB_id: identification number of the SGB in MetaRefSGB to which the MAG has been assigned
- unknown: can have three values, kSGB (short for knownSGB, i.e. a cluster containing at least one isolate genome) uSGB (unknownSGB, cluster containing only reconstructed genomes), or ufSGB (unknownfoodSGB, cluster containing only reconstructed genomes from food samples and hence newly introduced)
- assigned_taxonomy_level: species if containing at least one reference genome, otherwise lowest taxonomic rank assignable
- superkingdom: superkingdom of the assigned taxonomy
- phylum: phylum of the assigned taxonomy
- class: class of the assigned taxonomy
- family: family of the assigned taxonomy
- genus: genus of the assigned taxonomy
- species: species of the assigned taxonomy
- genome_size: # of nucleotides (including unknowns specified by N's) in the genome (ChekM)
- n_contigs: number of contigs within the genome as determined by splitting scaffolds at any position consisting of more than 10 consecutive ambiguous bases (CheckM)
- N50: N50 statistics as calculated over all contigs (CheckM)
- completeness: percentage value of the estimated completeness of the genome as determined from the presence/absence of marker genes and the expected colocalization of these genes (CheckM)
- contamination: percentage value of the estimated contamination of genome as determined by the presence of multi-copy marker genes and the expected colocalization of these genes (CheckM)
- GC_content: percentage of G+C nucleotides with respect to genome length
-
cFMD_sgbs_prokaryotic and cFMD_sgbs_eukaryotic (unique key= sgb_id)
- sgb_id: identification number of the SGB in MetaRefSGB
- Unknown: can have three values, kSGB (short for knownSGB, i.e. a cluster containing at least one isolate genome) uSGB (unknownSGB, cluster containing only reconstructed genomes), or ufSGB (unknownfoodSGB, cluster containing only reconstructed genomes from food samples and hence newly introduced)
- Level of assigned taxonomy: species if containing at least one reference genome, otherwise lowest taxonomic rank assignable
- Assigned taxonomy: taxonomy assigned to the bin according to the prevalent taxonomy of the reference genomes inside it. Each level is separated by a pipe character “|”
- superkingdom: superkingdom of the assigned taxonomy
- phylum: phylum of the assigned taxonomy
- class: class of the assigned taxonomy
- family: family of the assigned taxonomy
- genus: genus of the assigned taxonomy
- species: species of the assigned taxonomy
- MAGs: #of reconstructed genomes that are contained in the SGB
- isolates: #of reference genomes in the bin
- MAGs_filtered: number of reconstructed genomes discarded by MetaRefSGB (for being too similar to another included MAG) that would be assigned to the SGB
- Food: # of MAGs in the bin retrieved from food samples
- Human: # of MAGs in the bin retrieved from human samples
- Animal: # of MAGs in the bin retrieved from animal samples
- Other_categories: # of MAGs in the bin retrieved from samples of various origin ( soil, environmental, etc...)
- NA: # of MAGs in the bin for which metadata about the original samples are not available
- The number of MAGs for each food category is also reported
The data here provided were mainly generated through the following tools:
- Pre-processing of raw-reads: validated pipeline available here
- Reconstruction and taxonomic assignment of MAGs: assembly-based pipeline available here
- Taxonomic profiling: MetaPhlAn4-based pipeline, with full tutorial available here
- Strain-level profiling: StrainPhlAn-based pipeline, with full tutorial available here
Further information and requests should be directed to Niccolò Carlino ([email protected]), Nicola Segata ([email protected]), Edoardo Pasolli ([email protected])
Carlino et al., "Analysis of 2,500 food metagenomes reveals unexplored microbial diversity and links with the human microbiome", under review.