Collins, R.A., Trauzzi, G., Maltby, K.M., Gibson, T.I., Ratcliffe, F.C., Hallam, J., Rainbird, S., Maclaine, J., Henderson, P.A., Sims, D.W., Mariani, S. & Genner, M.J. (2021). Meta-Fish-Lib: A generalised, dynamic DNA reference library pipeline for metabarcoding of fishes. Journal of Fish Biology. https://doi.org/10.1111/jfb.14852.
This repository hosts a comprehensive multi-locus mitochondrial DNA reference library dataset for fish species of the United Kingdom (UK), derived from the NCBI GenBank and Barcode of Life BOLD databases. The dataset includes both freshwater and marine species, and can be employed in a variety of applications. e.g. DNA barcoding of human food products using full COI barcodes, to metabarcoding of gut or environmental samples using fragments of 12S. The library will be updated with each new GenBank release. Both common and rare UK fish species are included. A species coverage report for all primer sets can be found at assets/reports-tables.md. This UK reference library is curated and ready-to-use, but the code provided here can easily generate a new reference library for a different location (see FAQ).
In addition to providing quality controlled and curated fish references, this reference library has several unique features that make it useful to the wider DNA barcoding and DNA metabarcoding communities:
- Flexible - the library is not limited to any particular metabarcode locus or primer set. I have included the most popular ones (Table 1), but new ones can be added as required.
- Comprehensive - seaching by single gene names can often miss critical results due to poorly annotated records, but using hidden Markov models it is simple to extract homologous DNA fragments from large dumps of sequence data.
- Exhaustive - searching by species names can exclude potential hits because of changes in taxonomy, but here we search for all species synonyms, and then subsequently validate those names to provide a taxonomically up-to-date reference library.
- Reliable - sequence data on GenBank are frequently misannotated with incorrect species names, but we have created a list of dubious quality accessions that are automatically excluded when the reference library is loaded each time. We use phylogenetic quality control methods to assist in screening each new GenBank version and update this list accordingly.
- Dynamic - it's easy to update to each new GenBank release (see code below), and the versioning of this repository reflects the GenBank release on which it was made.
- Quick - the final reference library can be downloaded onto your computer in just a few seconds with only two packages loaded and seven lines of R code (below). Generating this reference library from scratch (below) takes a couple of hours, with the phylogenetic quality control steps completing overnight.
- Customisable - by forking or cloning the repository, custom modifications can be made, e.g. excluding particular species, making taxonomic changes, or using a completely different list of species.
- Self contained - to recreate the reference libraries, all code and R package versions are found within in this self contained project, courtesy of renv. This means less risk of clashing installations or broken code when packages and R versions upgrade.
- Citable - DOIs are issued with each new GenBank release.
This README outlines the contents of the repository and a brief description of the workflow involved in creating/updating a metabarcoding reference library, as well instructions to simply access the current data immediately. If an error is apparent, raise a ticket in Issues or submit a pull request.
The work is part of the NERC funded SeaDNA Project, and should be cited using version appropriate DOIs that are found in Releases, or the Collins et al. (2021) Journal of Fish Biology article (https://doi.org/10.1111/jfb.14852) describing the software.
If you require simply the final reference library file for immediate use, it can be downloaded directly using the R code below, and converted into FASTA and CSV formats for any of the available primer sets in Table 1.
### START A FRESH R SESSION ###
# load packages (install if required)
library("tidyverse")
library("ape")
# load REMOTE references and scripts (requires internet connection)
source("https://raw.githubusercontent.com/genner-lab/meta-fish-lib/main/scripts/references-load-remote.R")
source("https://raw.githubusercontent.com/genner-lab/meta-fish-lib/main/scripts/references-clean.R")
source("https://raw.githubusercontent.com/legalLab/protocols-scripts/master/scripts/tab2fas.R")
# choose a metabarcode fragment (primer set) from the following:
print(c("coi.lerayxt","coi.ward","12s.miya","12s.riaz","12s.valentini","12s.taberlet","16s.berry","cytb.minamoto"))
# change 'frag' argument as appropriate:
reflib.sub <- subset_references(df=reflib.orig, frag="12s.miya")
# convert to fasta file
# uses the standard database id field ('dbid') as a label
# 'dbid' is the GenBank GI number, or the BOLD processID number
# custom labels can be created with 'mutate()' and 'paste()' using other fields and changing 'namecol' argument - see FAQ
reflib.fas <- tab2fas(df=reflib.sub, seqcol="nucleotides", namecol="dbid")
# write out fasta file
ape::write.FASTA(reflib.fas, file="references.fasta")
# write out corresponding csv table
# this requires readr v1.4 or higher; change to 'path="references.csv"' for older versions, or use write.csv() instead
readr::write_csv(reflib.sub, file="references.csv")
Particular attention should be paid to cleaning steps in scripts/references-clean.R
; sequences flagged as unreliable (using phylogenetic quality control) are listed in assets/exclusions.csv
and excluded, while sequences flagged by NCBI as "unverified" are also removed. Taxonomic changes are also made, automatically via validating names against FishBase, and also custom changes (Cottus perifretum relabelled as Cottus cottus, Atherina presbyter relabelled as Atherina boyeri, and Pungitius laevis relabelled as Pungitius pungitius. Where are changes are made, both the original GenBank names and the validated FishBase names are provided (see Table 2).
Table 1: Available primer sets
Study | Official name | Nickname | Locus |
---|---|---|---|
Miya et al. (2015) | MiFish U/E | 12s.miya | 12S |
Taberlet et al. (2018) | Tele02 | 12s.taberlet | 12S |
Valentini et al. (2016) | L1848/H1913 | 12s.valentini | 12S |
Riaz et al. (2011) | 12S-V5 | 12s.riaz | 12S |
Wangensteen et al. (2018) | Leray-XT | coi.leray | COI |
Ward et al. (2005) | FishF1/R1 | coi.ward | COI |
Berry et al. (2017) | Fish16sF/D | 16s.berry | 16S |
Kitano et al. (2007) | L2513/H2714 | 16s.kitano | 16S |
Minamoto et al. (2012) | L14912-CYB | cytb.minamoto | cytb |
You don't need to run this code below if you just want a copy of the reference library (run code above). This code below is if you want to update it yourself or want to modify and make a new library. I will endeavour to keep this repository up-to-date with GenBank, but if hasn't been updated, email me.
System requirements: R, git, hmmer, mafft and raxml-ng need to be installed on your system, and available on your $PATH. With the exception of raxml-ng, the programs can be installed from Ubuntu repositories using sudo apt install
. In case of difficulties, check the developer's website and update to newer versions if required. Unfortunately, these scripts are optimised for a Unix system, and I'm unable to offer any Windows support here (Windows is now able to run Ubuntu Linux ).
You will also require an API key from NCBI in order to access GenBank data at a decent rate. See info here for how to get a key: ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
### admin - clone the repository and create temporary directories ###
git clone https://github.com/genner-lab/meta-fish-lib.git meta-fish-lib
cd meta-fish-lib
mkdir -p reports temp/fasta-temp
### create NCBI API key (this is ignored by git) ###
# substitute the 'my-ncbi-key' part for your actual key obtained from NCBI
echo 'ncbi.key <- "my-ncbi-key"' > assets/ncbi-key.R
### install required R packages using renv (only need to run this once) ###
Rscript -e "renv::restore()"
### check the genbank versions remotely (github), locally (your machine), and on genbank itself ###
# you can update if the remote or local versions are behind genbank
scripts/check-genbank.R
### replace the species table with your custom list ###
# change to file location on your machine
# this will overwrite the current table
cp ~/path/to/my-species-table.csv assets/species-table.csv
### search GenBank ###
# argument "-q" [2000] is the max search batch query length (in characters)
# the maximum search query length allowed by NCBI is 2500
# bigger queries will be faster, but more prone to server errors
# argument "-t" [4] is the number of processing threads to run in parallel
# more threads are faster, but more prone to errors by overloading the server with requests
# most laptops and desktops are hyperthreaded with two virtual CPUs (threads) for each physical CPU (cores)
# run "lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('" to obtain info on available cores/threads
# make sure not to request more threads than are present on your system
# argument "-e" [true] is to run an exhaustive ("true") or simple search ("false")
# the simple search just uses the terms "mitochondrion,mitochondrial"
# the exhaustive search in addition uses "COI,CO1,cox1,cytb,cytochrome,subunit,COB,CYB,12S,16S,rRNA,ribosomal"
# the simple search will pick up 99% of mtDNA sequences, but may miss older sequences that were not well annotated
# the simple search is faster, less prone to error, and may help if you need to search for a lot of species
scripts/sequences-download.R -q 2000 -t 4 -e true
### assemble the reference library with hidden Markov models and obtain metadata ###
# argument "-t" [4] is the number of processing threads to run in parallel
# do not request more threads than metabarcode markers
# do not request more threads than are present on your system
# argument "-m" [all] is the metabarcode marker, here choosing "all" eight available metabarcodes
# if you don't require all metabarcodes, it's strongly recommended to specify only the one(s) you want
# to choose a specific metabarcode marker(s), use the codes in Table 1 and separate with a comma and no space
# e.g. "-m 12s.miya,coi.ward"
# note this script overwrites the local master reference library 'assets/reference-library-master.csv.gz'
scripts/references-assemble.R -t 4 -m all
### phylogenetic quality control (QC) ###
# argument "-t" [4] is the number of processing threads to run in parallel
# note that the threads argument doesn't influence the performance operation of raxml-ng or mafft
# these applications automatically determine the optimum number of threads for your system
# the argument only applies to preparing/plotting
# do not request more threads than metabarcode markers
# do not request more threads than are present on your system
# argument "-v" [false] is verbosity (information printed to screen) from the alignment and phylogenetic steps
# a value of "true" prints program info, a value of "false" prevents printing of info
# seeing the output on screen is useful if you run into problems and need to debug, otherwise it's not required
scripts/qc.R -t 4 -v false
# now manually review the phylogenetic tree PDFs output into 'reports/qc_GBVERSION_MONTH-YEAR'
# if found, add suspect accessions manually to 'assets/exclusions.csv'
### compile species coverage reports ###
make -f scripts/Makefile
### update GitHub repository (updates remote version - only works if you have forked the repository rather than cloned) ###
# change x to actual version
cp reports/reports-tables.md assets/reports-tables.md
git add assets/reference-library-master.csv.gz assets/exclusions.csv assets/reports-tables.md
git commit -m "updated master to genbank version x"
git push
git tag -a vX -m "GenBank vX"
git push --tags
# now make a release using this tag on GitHub
# if Zenodo has been set up correctly, a DOI should become available immediately
### confirm changes are made ###
scripts/check-genbank.R
### if all worked, you can clean up to save disk space ###
# be sure that you want to do this!
rm -r temp
### your reference library is now located at 'assets/reference-library-master.csv.gz' ###
# if you need it in fasta format, please use the R code below
# see FAQ for changing the label format
### TO FORMAT LOCAL REFERENCE LIBRARY AS FASTA ###
# load packages in new R session
library("here")
library("tidyverse")
library("ape")
# load LOCAL references and clean blacklisted sequences
source("https://raw.githubusercontent.com/genner-lab/meta-fish-lib/main/scripts/references-load-local.R")
source("https://raw.githubusercontent.com/genner-lab/meta-fish-lib/main/scripts/references-clean.R")
source("https://raw.githubusercontent.com/legalLab/protocols-scripts/master/scripts/tab2fas.R")
# subset marker - change 'frag' argument as appropriate:
reflib.sub <- subset_references(df=reflib.orig, frag="12s.miya")
# convert to fasta file and write out fasta file
reflib.fas <- tab2fas(df=reflib.sub, seqcol="nucleotides", namecol="dbid")
ape::write.FASTA(reflib.fas, file="references.fasta")
- How do I cite the reference library? - Zenodo DOIs for each version are in see Releases. An important note: the reference library and code presented here supercedes a previous iteration hosted at github.com/boopsboops/reference-libraries. The new version here starts at v241, but I have archived only the final reference library file (
assets/reference-library-master.csv.gz
) for the previous versions here also. Therefore, while the library files are here, the old code used to generate these libraries prior to v241 are not archived together with that library version here. You can also cite the Collins et al. (2021) Journal of Fish Biology article (https://doi.org/10.1111/jfb.14852) describing the software. - Can I make a reference library for fishes of my country/region? - Yes, very easily. Just change the list of species in
assets/species-table.csv
. You can provide this list yourself, but make sure the format of the table is the same. If not interested in synonyms, you use the same species name for 'speciesName' and 'validName' and set 'status' set to "accepted name". The 'commonSpecies' field can be all set to TRUE if that is not of interest either, and the other information can be obtained from FishBase ('fbSpecCode' is the FishBase species code). Alternatively, follow the tutorial here to generate an annotated species/synonyms list using the rfishbase package from scratch. - What if I don't know which species I need? - This is a common problem in diverse tropical regions where there is often poorly resolved taxonomy and lots of undescribed species. Here you will want to search for genera instead of species. Copy this format below for the
assets/species-table.csv
table:
speciesName | status | fbSpecCode | validName | class | order | family | genus | commonName | commonSpecies |
---|---|---|---|---|---|---|---|---|---|
Anguilla | accepted name | NA | Anguilla | Actinopterygii | Anguilliformes | Anguillidae | Anguilla | NA | NA |
- Can I make a reference library for a taxonomic group that isn't fishes? - At the moment I use the rfishbase package to generate higher taxonomic ranks and validate scientific names etc, because this is the best source of data for fishes. However, more general solutions could easily be employed using the taxize package or taxadb package, with minimal changes to the code. One thing to bear in mind is that the hidden Markov models were designed on fishes, and while these would probably work well for other vertebrates, they would need to be recreated for other groups (see below).
- How do I get synonyms? - I used the
synonyms()
function in the rfishbase package, but the taxize package or taxadb package would achieve similar results for other groups. - How do I add a new metabarcode? - First I downloaded the fish mitochondrial genomes and annotations from Prof. Masaki Miya's MitoFish website at mitofish.aori.u-tokyo.ac.jp/, and extracted the genes of interest and aligned them with mafft. Then I searched for the primer sequences and cut out the fragments of interest using Geneious and exported as fasta. Then I ran the hmmer function
hmmbuild
to create the hidden Markov models. Unfortunately, I did not include the code to perform these steps as it is not really general enough to be useful (requires manual actions and checking). Please contact me if you need specific help with these. - What if I want more than fishes? - Indeed, for many metabarcoding applications you would want to identify 'off-target' reads, so a wider reference library is required as a supplement to the one presented here. I use the NCBI RefSeq mitochondrial DNA database (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion), which should have a sufficiently broad coverage to roughly classify most eukaryote mtDNA.
- How does the exclusions blacklist file work? - The file
assets/exclusions.csv
is permanently available on this GitHub repository, and contains all the accessions that have been flagged by me as potentially erroneous, as part of the work on the UK fish reference library. New records are added manually each time a new GenBank version becomes available and the quality control steps are performed. When thescripts/references-clean.R
script is run, the exclusion file is called and these blacklisted accessions are removed from the library. The user does not need to regenerate or interact with this exclusions file if they are simply wanting to use the UK reference library as provided. If the user wishes to create their own custom reference library then they have the option of tailoring the contents of this exclusions file to their own requirements by keeping, deleting, or adding accessions to it. - Is the reference library guaranteed error free? - LOL, no! I have tried to curate a reliable reference library as best as I can. However, the phylogenetic quality control step is tedious and subjective and takes a lot of effort. Here, phylogenetic QC trees for each primer set need to be manually checked. To help with this tips are coloured by monophyly and haplotype sharing to visually assist identifying dubious accessions. This is a much easier task for loci where taxa are well differentiated and large numbers of sequences exist (such as for COI). It is not easy for ribosomal fragments with fewer informative nucleotides and fewer sequences. The choice of which accessions to blacklist in
assets/exclusions.csv
has been entirely at my discretion thus far. However, I hope I have caught the majority of the most egregious examples. As a rule of thumb, an accession is blacklisted if: (a) individual(s) of species x are identical to or nested within a cluster of sequences of species y, but with other individuals of species x forming an independent cluster; and (b) the putatively spurious sequences coming from a single study, while the putatively correct sequences of species x and y coming from multiple studies. It is important to note that this is far from foolproof, and many species will naturally be non-monophyletic and/or share haplotypes with other species. I tried to be conservative, and not remove too many sequences if there was doubt, and especially so for taxa that I am not familar with. Mistakes certainly remain, so I recommend running the QC step to check yourself (or email me for the trees). NCBI blast is also useful for checking for matches against species not in the reference library. - The search step takes too long, hangs, or errors! - The GenBank search step relies on NCBI servers, and if they are overloaded then the searches can fail. I suggest: (i) reducing the number of threads ("-t" option in the
scripts/sequences-download.R
step) to lower the frequency of requests; (ii) running searches at USA off-peak times; (iii) disabling the exhaustive search option (use "-e false" instead of "-e true" in thescripts/sequences-download.R
step), which searches for fewer search terms, will take less time and consume less RAM and disk space, but will give you 99% of the sequences; (iv) making each concatenated search string length shorter with the "-q" option; and (v) requesting only the metabarcodes that you are interested in (use the "-m" option in thescripts/references-assemble.R
step). - The phylogenetic quality control steps takes too long! - Making ML trees for many taxa can take a very long time. Here, the largest one (Ward COI) is over 9,000 haplotypes, and runs overnight. If your dataset is too big, I suggest: (i) skipping this step if you aren't sure you need it; (ii) requesting only the metabarcodes that you are interested in (use the "-m" option in the
scripts/references-assemble.R
step); or (iii) maybe break up the species input list into smaller chunks and merge the tables later. - Why not use sativa for automated quality control? - Good question. Software such as sativa is available to automate the process, and while I may investigate this option in the future, for the meantime I think it is always a good idea to eyeball and become familiar with your data and develop an informed judgement.
- Why are the sequence labels in the
references.fasta
file just numbers? - When you download the reference library as shown above, thereferences.fasta
file will use the 'dbid' column which is the database identification numbers. For NCBI these are 'GI' numbers (GenInfo Identifiers); these are equivalent to NCBI accession numbers, and will resolve accordingly on NCBI services; for BOLD, these are the 'processid' numbers. As there are many possible formats required for various taxonomy assignment software, I am unable to know which ones you will require, and have therefore chosen a sensible default. To make your own custom labels, just use the dplyrmutate()
andpaste()
functions to join columns in the table to make a new label column. Below I make a label of format 'dbid_family_genus_species'. Table 2 explains the fields in the reference library table.
reflib.label <- reflib.sub %>%
mutate(label=paste(dbid,family,str_replace_all(sciNameValid," ","_"),sep="_"))
reflib.fas <- tab2fas(df=reflib.label,seqcol="nucleotides", namecol="label")
Table 2: Key to reference library table fields
Field (column name) | Description |
---|---|
source | source of record (GenBank or BOLD) |
dbid | GenBank or BOLD database ID (GI, processid) |
gbAccession | GenBank accession |
sciNameValid | FishBase validated scientific name |
subphylum | taxonomic subphylum |
class | FishBase taxonomic class |
order | FishBase taxonomic order |
family | FishBase taxonomic family |
genus | FishBase taxonomic genus |
sciNameOrig | original scientific name from GenBank/BOLD |
fbSpecCode | FishBase species code |
country | sample voucher collection country |
catalogNumber | sample voucher catalogue number |
institutionCode | sample voucher institution code |
decimalLatitude | sample voucher collection latitude (decimal degrees) |
decimalLongitude | sample voucher collection longitude (decimal degrees) |
publishedAs | publication title |
publishedIn | publication journal |
publishedBy | publication lead author |
date | date of sequence publication |
notesGenBank | title of GenBank record |
genbankVersion | version of GenBank used to generate this reference library |
searchDate | date of reference library search |
length | number of nucleotides in full record |
nucleotides | nucleotides for full record |
nucleotidesFrag.GENE.FRAGMENT.noprimers | nucleotides for gene fragment primer subset |
lengthFrag.GENE.FRAGMENT.noprimers | number nucleotides in gene fragment primer subset |
assets/
- Required file and reference library.hmms/
- Hidden Markov models (HMMs) of gene markers of interest.exclusions.csv
- unreliable accessions to be excludedlogo.svg
- project logoreference-library-master.csv.gz
- master copy of the reference libraryreports-tables.md
- species coverage reportsspecies-list-synonyms.md
- tutorial and R code to generate species lists and synonymsspecies-table.csv
- list of species to search forspecies-table-testing.csv
- a test list of goby species
renv/
- Settings for the R environment.reports/
- Location of QC reports. Temporary directory that is not committed to the repository, but needs to be created locally to run the scripts. Ignored by git.scripts/
- R and shell scripts.check-genbank.R
- script to get genbank versionsload-libs.R
- script to load all required packages and functionsMakefile
- makefile to generate the species coverage reportsqc.R
- quality control a reference libraryreferences-assemble.R
- extract and annotate reference libraries from ncbi/bold dumpsreferences-clean.R
- quality filter reference libraryreferences-load-local.R
- load reference library locallyreferences-load-remote.R
- load reference library remotelyreports-tables.Rmd
- knitr file to prepare species coverage reportssequences-download.R
- pulls all mitochondrial DNA from NCBI and BOLD for a list of species
temp/
- Temporary file directory that is not committed to the repository, but needs to be created locally to run the scripts. Ignored by git.