oschwengers / referenceseeker Goto Github PK

View Code? Open in Web Editor NEW

88.0 6.0 5.0 20.15 MB

Rapid determination of appropriate reference genomes.

Home Page: https://doi.org/10.21105/joss.01994

License: GNU General Public License v3.0

Nextflow 10.33% Shell 6.76% Python 77.26% TeX 5.64%

bioinformatics reference-genomes mash refseq wgs microbiology ani

referenceseeker's Introduction

ReferenceSeeker: rapid determination of appropriate reference genomes

Description
Input & Output
Installation
- BioConda
- GitHub
Usage
Examples
Databases
- RefSeq
- Custom
Dependencies
Citation

Description

ReferenceSeeker determines closely related reference genomes following a scalable hierarchical approach combining an fast kmer profile-based database lookup of candidate reference genomes and subsequent computation of specific average nucleotide identity (ANI) values for the rapid determination of suitable reference genomes.

ReferenceSeeker computes kmer-based genome distances between a query genome and potential reference genome candidates via Mash (Ondov et al. 2016). For resulting candidates ReferenceSeeker subsequently computes (bidirectional) ANI values picking genomes meeting community standard thresholds by default (ANI >= 95 % & conserved DNA >= 69 %) (Goris, Konstantinos et al. 2007) ranked by the product of ANI and conserved DNA values to take into account both genome coverage and identity.

Custom databases can be built with local genomes. For further convenience, we provide pre-built databases with sequences from RefSeq (https://www.ncbi.nlm.nih.gov/refseq), GTDB and PLSDB copmrising the following taxa:

bacteria
archaea
fungi
protozoa
viruses

as well as plasmids.

The reasoning for subsequent calculations of both ANI and conserved DNA values is that Mash distance values correlate well with ANI values for closely related genomes, however the same is not true for conserved DNA values. A kmer fingerprint-based comparison alone cannot distinguish if a kmer is missing due to a SNP, for instance or a lack of the kmer-comprising subsequence. As DNA conservation (next to DNA identity) is very important for many kinds of analyses, e.g. reference based SNP detections, ranking potential reference genomes based on a mash distance alone is often not sufficient in order to select the most appropriate reference genomes. If desired, ANI and conserved DNA values can be computed bidirectionally.

Input & Output

Input

Path to a taxon database and a draft or finished genome in (zipped) fasta format:

$ referenceseeker ~/bacteria GCF_000013425.1.fna

Output

Tab separated lines to STDOUT comprising the following columns:

Unidirectionally (query -> references):

RefSeq Assembly ID
Mash Distance
ANI
Conserved DNA
NCBI Taxonomy ID
Assembly Status
Organism

#ID    Mash Distance    ANI    Con. DNA    Taxonomy ID    Assembly Status    Organism
GCF_000013425.1    0.00000    100.00    100.00    93061    complete    Staphylococcus aureus subsp. aureus NCTC 8325
GCF_001900185.1    0.00002    100.00    99.89     46170    complete    Staphylococcus aureus subsp. aureus HG001
GCF_900475245.1    0.00004    100.00    99.57     93061    complete    Staphylococcus aureus subsp. aureus NCTC 8325 NCTC8325
GCF_001018725.2    0.00016    100.00    99.28     1280     complete    Staphylococcus aureus FDAARGOS_10
GCF_003595465.1    0.00185    99.86     96.81     1280     complete    Staphylococcus aureus USA300-SUR6
GCF_003595385.1    0.00180    99.87     96.80     1280     complete    Staphylococcus aureus USA300-SUR2
GCF_003595365.1    0.00180    99.87     96.80     1280     complete    Staphylococcus aureus USA300-SUR1
GCF_001956815.1    0.00180    99.87     96.80     46170    complete    Staphylococcus aureus subsp. aureus USA300_SUR1
...

Bidirectionally (query -> references [QR] & references -> query [RQ]):

RefSeq Assembly ID
Mash Distance
QR ANI
QR Conserved DNA
RQ ANI
RQ Conserved DNA
NCBI Taxonomy ID
Assembly Status
Organism

#ID    Mash Distance    QR ANI    QR Con. DNA    RQ ANI    RQ Con. DNA    Taxonomy ID    Assembly Status    Organism
GCF_000013425.1    0.00000    100.00    100.00    100.00    100.00    93061    complete    Staphylococcus aureus subsp. aureus NCTC 8325
GCF_001900185.1    0.00002    100.00    99.89     100.00    99.89     46170    complete    Staphylococcus aureus subsp. aureus HG001
GCF_900475245.1    0.00004    100.00    99.57     99.99     99.67     93061    complete    Staphylococcus aureus subsp. aureus NCTC 8325 NCTC8325
GCF_001018725.2    0.00016    100.00    99.28     99.95     98.88     1280     complete    Staphylococcus aureus FDAARGOS_10
GCF_001018915.2    0.00056    99.99     96.35     99.98     99.55     1280     complete    Staphylococcus aureus NRS133
GCF_001019415.2    0.00081    99.99     94.47     99.98     99.36     1280     complete    Staphylococcus aureus NRS146
GCF_001018735.2    0.00096    100.00    94.76     99.98     98.58     1280     complete    Staphylococcus aureus NRS137
GCF_003354885.1    0.00103    99.93     96.63     99.93     96.66     1280     complete    Staphylococcus aureus 164
...

Installation

ReferenceSeeker can be installed via Conda and Git(Hub). In either case, a taxon database must be downloaded which we provide for download at Zenodo: For more information have a look at Databases.

BioConda

The preferred way to install and run ReferenceSeeker is Conda using the Bioconda channel:

$ conda install -c bioconda referenceseeker
$ referenceseeker --help

GitHub

Alternatively, you can use this raw GitHub repository:

install necessary Python dependencies (if necessary)
clone the latest version of the repository
install necessary 3rd party executables (Mash, MUMmer4)

$ pip3 install --user biopython xopen
$ git clone https://github.com/oschwengers/referenceseeker.git
$ # install Mash & MUMmer
$ ./referenceseeker/bin/referenceseeker --help

Test

To test your installation we prepared a tiny mock database comprising 4 Salmonella spp genomes and a query assembly (SRA: SRR498276) in the tests directory:

$ git clone https://github.com/oschwengers/referenceseeker.git

  # GitHub installation
$ ./referenceseeker/bin/referenceseeker referenceseeker/test/db referenceseeker/test/data/Salmonella_enterica_CFSAN000189.fasta

  # BioConda installation
$ referenceseeker referenceseeker/test/db referenceseeker/test/data/Salmonella_enterica_CFSAN000189.fasta

Expected output:

#ID    Mash Distance    ANI    Con. DNA    Taxonomy ID    Assembly Status    Organism
GCF_000439415.1    0.00003    100.00    99.55    1173427    complete    Salmonella enterica subsp. enterica serovar Bareilly str. CFSAN000189
GCF_900205275.1    0.01522    98.61     83.13    90370      complete    Salmonella enterica subsp. enterica serovar Typhi

Usage

Usage:

usage: referenceseeker [--crg CRG] [--ani ANI] [--conserved-dna CONSERVED_DNA]
                       [--unfiltered] [--bidirectional] [--help] [--version]
                       [--verbose] [--threads THREADS]
                       <database> <genome>

Rapid determination of appropriate reference genomes.

positional arguments:
  <database>            ReferenceSeeker database path
  <genome>              target draft genome in fasta format

Filter options / thresholds:
  These options control the filtering and alignment workflow.

  --crg CRG, -r CRG     Max number of candidate reference genomes to pass kmer
                        prefilter (default = 100)
  --ani ANI, -a ANI     ANI threshold (default = 0.95)
  --conserved-dna CONSERVED_DNA, -c CONSERVED_DNA
                        Conserved DNA threshold (default = 0.69)
  --unfiltered, -u      Set kmer prefilter to extremely conservative values
                        and skip species level ANI cutoffs (ANI >= 0.95 and
                        conserved DNA >= 0.69
  --bidirectional, -b   Compute bidirectional ANI/conserved DNA values
                        (default = False)

Runtime & auxiliary options:
  --help, -h            Show this help message and exit
  --version, -V         show program's version number and exit
  --verbose, -v         Print verbose information
  --threads THREADS, -t THREADS
                        Number of used threads (default = number of available
                        CPU cores)

Examples

Installation:

$ conda install -c bioconda referenceseeker
$ wget https://zenodo.org/record/4415843/files/bacteria-refseq.tar.gz
$ tar -xzf bacteria-refseq.tar.gz
$ rm bacteria-refseq.tar.gz

Simple:

$ # referenceseeker <REFERENCE_SEEKER_DB> <GENOME>
$ referenceseeker bacteria-refseq/ genome.fasta

Expert: verbose output and increased output of candidate reference genomes using a defined number of threads:

$ # referenceseeker --crg 500 --verbose --threads 8 <REFERENCE_SEEKER_DB> <GENOME>
$ referenceseeker --crg 500 --verbose --threads 8 bacteria-refseq/ genome.fasta

Databases

ReferenceSeeker depends on databases comprising taxonomic genome informations as well as kmer hash profiles for each entry.

Pre-built

We provide pre-built databases based on public genome data hosted at Zenodo: :

RefSeq

release: 221 (2024-01-09)

Taxon	URL	# Genomes	Size
bacteria	https://zenodo.org/record/4415843/files/bacteria-refseq.tar.gz	50,226	59.6 Gb
archaea	https://zenodo.org/record/4415843/files/archaea-refseq.tar.gz	905	897 Mb
fungi	https://zenodo.org/record/4415843/files/fungi-refseq.tar.gz	557	5.9 Gb
protozoa	https://zenodo.org/record/4415843/files/protozoa-refseq.tar.gz	90	1.1 Gb
viruses	https://zenodo.org/record/4415843/files/viral-refseq.tar.gz	14,012	1 Mb

GTDB

release: v214 (2024-01-11)

Taxon	URL	# Genomes	Size
bacteria	n.a. due to storage quota resitrctions	80,789	82 Gb
archaea	https://zenodo.org/record/4415843/files/archaea-gtdb.tar.gz	4,416	2.8 Gb

Plasmids

In addition to the genome based databases, we provide the following plasmid databases based on RefSeq and PLSDB:

DB	URL	# Plasmids	Size
RefSeq	https://zenodo.org/record/4415843/files/plasmids-refseq.tar.gz	81,674	2.6 Gb
PLSDB	https://zenodo.org/record/4415843/files/plasmids-plsdb.tar.gz	59,882	2.3 Gb

Custom database

If above mentiond RefSeq based databases do not contain sufficiently-close related genomes or are just too large, ReferenceSeeker provides auxiliary commands in order to either create databases from scratch or to expand existing ones. Therefore, a second executable referenceseeker_db accepts init and import subcommands:

Usage:

usage: referenceseeker_db [--help] [--version] {init,import} ...

Rapid determination of appropriate reference genomes.

positional arguments:
  {init,import}  sub-command help
    init         Initialize a new database
    import       Add a new genome to database

Runtime & auxiliary options:
  --help, -h     Show this help message and exit
  --version, -V  show program's version number and exit

If a new database should be created, use referenceseeker_db init:

usage: referenceseeker_db init [-h] [--output OUTPUT] --db DB

optional arguments:
  -h, --help            show this help message and exit
  --output OUTPUT, -o OUTPUT
                        output directory (default = current working directory)
  --db DB, -d DB        Name of the new ReferenceSeeker database

This new database or an existing one can be used to import genomes in Fasta, GenBank or EMBL format:

usage: referenceseeker_db import [-h] --db DB --genome GENOME [--id ID]
                                 [--taxonomy TAXONOMY]
                                 [--status {complete,chromosome,scaffold,contig}]
                                 [--organism ORGANISM]

optional arguments:
  -h, --help            show this help message and exit
  --db DB, -d DB        ReferenceSeeker database path
  --genome GENOME, -g GENOME
                        Genome path [Fasta, GenBank, EMBL]
  --id ID, -i ID        Unique genome identifier (default sequence id of first
                        record)
  --taxonomy TAXONOMY, -t TAXONOMY
                        Taxonomy ID (default = 12908 [unclassified sequences])
  --status {complete,chromosome,scaffold,contig}, -s {complete,chromosome,scaffold,contig}
                        Assembly level (default = contig)
  --organism ORGANISM, -o ORGANISM
                        Organism name (default = "NA")

Example:

If ReferenceSeeker is properly installed, clone this repository and change into its parent directoriy.

$ git clone https://github.com/oschwengers/referenceseeker.git
$ cd referenceseeker
$ referenceseeker_db init --db test-db --output ./
$ referenceseeker_db import --db ./test-db --genome test/db/GCF_000439415.1.fna.gz --id GCF_000439415.1 --taxonomy 28901 --status complete --organism "Salmonella enterica subsp. enterica serovar Bareilly str. CFSAN000189"
$ referenceseeker_db import --db ./test-db --genome test/db/GCF_002211925.1.fna.gz --id GCF_002211925.1 --organism "Salmonella bongori str. SA19983605"
$ referenceseeker -v ./test-db ./test/data/Salmonella_enterica_CFSAN000189.fasta

Dependencies

ReferenceSeeker needs the following dependencies:

Python (3.8, 3.9), Biopython (>=1.78), xopen(>=1.1.0)
Mash (2.3) https://github.com/marbl/Mash
MUMmer (4.0.0-beta2) https://github.com/gmarcais/mummer

ReferenceSeeker has been tested against aforementioned versions.

Citation

Schwengers et al., (2020). ReferenceSeeker: rapid determination of appropriate reference genomes. Journal of Open Source Software, 5(46), 1994, https://doi.org/10.21105/joss.01994

referenceseeker's People

Contributors

Stargazers

Watchers

Forkers

ag-computational-bio standage jckeller motroy

referenceseeker's Issues

Feature request: ability to include private data

Consider two hypothetical scenarios.

a food-borne illness lab with numerous well-annotated Salmonella genomes beyond what's available in RefSeq
an infectious disease lab with numerous well-annotated Yersinia pestis strains not publicly available

It would be helpful to scientists in labs like these to be able to include private data along with RefSeq genomes, either during the build process or by adding them later.

Database update

Hi,

The pre-built referenceseeker databases have not been updated since a while, causing some new species not to be detected. Would you please consider an update? I assume one can do this with the referenceseeker_db init but if you detail this in the README for the RefSeq Bacteria for example, that would be great

Thank you!

GTDB last update

Hi Oliver,

Thank you for this nice tool. Have you found a way to share the latest GTDB update? I'll be highlyinterested to get the last version.

Thanks,

no conda install for Apple Silicon

Install on M1 macbook:

mamba create -c bioconda -n refseek referenceseeker

Error:

Could not solve for environment specs
The following packages are incompatible
└─ referenceseeker   is uninstallable because there are no viable options
   ├─ referenceseeker [1.2.0|1.3|...|1.6.4] would require
   │  └─ mummer4 >=4.0.0beta2 , which does not exist (perhaps a missing channel);
   └─ referenceseeker [1.7.1|1.7.3|1.8.0] would require
      └─ xopen >=1.1.0 , which does not exist (perhaps a missing channel).

My .condarc:

channels:
  - conda-forge
  - bioconda
  - pytorch
  - defaults
  - qiime2

Mamba info:

active environment : arc-challenge
    active env location : /Users/nick/miniconda3/envs/arc-challenge
            shell level : 2
       user config file : /Users/nick/.condarc
 populated config files : /Users/nick/.condarc
          conda version : 23.3.1
    conda-build version : not installed
         python version : 3.10.11.final.0
       virtual packages : __archspec=1=arm64
                          __osx=13.4=0
                          __unix=0=0
       base environment : /Users/nick/miniconda3  (writable)
      conda av data dir : /Users/nick/miniconda3/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/osx-arm64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://conda.anaconda.org/bioconda/osx-arm64
                          https://conda.anaconda.org/bioconda/noarch
                          https://conda.anaconda.org/pytorch/osx-arm64
                          https://conda.anaconda.org/pytorch/noarch
                          https://repo.anaconda.com/pkgs/main/osx-arm64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/osx-arm64
                          https://repo.anaconda.com/pkgs/r/noarch
                          https://conda.anaconda.org/qiime2/osx-arm64
                          https://conda.anaconda.org/qiime2/noarch
          package cache : /Users/nick/miniconda3/pkgs
                          /Users/nick/.conda/pkgs
       envs directories : /Users/nick/miniconda3/envs
                          /Users/nick/.conda/envs
               platform : osx-arm64
             user-agent : conda/23.3.1 requests/2.28.1 CPython/3.10.11 Darwin/22.5.0 OSX/13.4
                UID:GID : 501:20
             netrc file : None
           offline mode : False

It appears that you need to add arm-arch support for mummer4 & xopen.

[BUG] IndexError: list index out of range

When running referenceseeker 1.6.3 (installed with conda) with a custom database I received an "IndexError: list index out of range".

Traceback (most recent call last):
File "/home/user/miniconda3/envs/ref_seeker/bin/referenceseeker", line 10, in
sys.exit(main())
File "/home/user/miniconda3/envs/ref_seeker/lib/python3.9/site-packages/referenceseeker/referenceseeker.py", line 96, in main
ref_genomes = util.read_reference_genomes(config)
File "/home/user/miniconda3/envs/ref_seeker/lib/python3.9/site-packages/referenceseeker/util.py", line 24, in read_reference_genomes
'name': cols[3]

I imported my genomes into the database without using the optional --organism flag and received this error. After checking the custom db.tsv and the util.py, it looks like I might have needed something in the name column other than the default "". Re-ran the database import with the --organism flag and it appears to be working now.

ReferenceSeeker and Metagenome-Assembled Genomes

This is neither a feature request nor a bug, although it's closer to the former.

Thank you for developing such a useful tool. If I understood correctly, ReferenceSeeker can be used with MAGs. If so, do you have approximate guidelines as to what would be appropriate in terms of, e.g., contamination and completeness? Thanks very much.

Reduce DB sizes

Use compressed DB reference genomes to reduce overall DB size.

Did you want help to create bioconda package

Hello @oschwengers,

I currently use referenceseeker and I plan to create a referenceseeker bioconda package. For this, I create the medusa bioconda package.

I see in your recent commit you plan to create a bioconda package for referenceseeker did you want some help?

[BUG] I can't make a custom db with referenceseeker_db init

Describe the bug

I am trying to use this software that seems to do exactly what I want which is to compare my 500 MAGs against a custom database of 18000 genomes. I tried to build a database with a few of these genomes (only 10 of them). I tried different ways and renamed the fasta headers and/or filenames. To discard any problem with my data now I went back to the example data.
Maybe I am confused in the way this works or for some reason I couldn't make referenceseeker_db init work.

Thank you very much for your help

what exactly happened
I picked the genomes inside the test/db folder, uncompressed them and put only these 3 in a new folder called 'DATA'

ls DATA
GCF_000439415.1.fna GCF_002211925.1.fna GCF_900205275.1.fna

Now I try to build the database and it leaves empty db.msh and db.tsv files.

what exact command was executed: just copy-paste the command line
referenceseeker_db init --output ./DATA/ --db db
what installation of ReferenceSeeker did you use: BioConda, GitHub, Pip
BioConda
what database was used: public, custom, which taxon
custom
which version of ReferenceSeeker was used
referenceseeker 1.8.0

[BUG] Updating database script errors

Hello, while I was trying to build the database for bacteria, I noticed a couple of bugs in the scripts to automate building the database:

In build-db.nf, $REFERENCE_SEEKER_HOME needs to be escaped since it's a shell variable: \$REFERENCE_SEEKER_HOME. Also, the script assumes share is available in this directory, which is not detailed in the instructions.

In build-db.sh, nextflow run is missing. This also assumes that $REFERENCE_SEEKER_HOME contains the build-db.nf script, which isn't mentioned in instructions.

It worked after these tweaks, but it would be great if the documentation was more precise on how to use these!

Test suite or instructions

Great package, thanks for making this available!

Surely you have tested ReferenceSeeker extensively to make sure it performs as expected on numerous inputs. However, unless I'm mistaken, the documentation doesn't include any instructions for running an automated test suite or commands to execute to ensure the software is installed correctly.

Bundling entire bacterial/viral/fungal/etc. components of RefSeq with the software to support an automated test suite isn't feasible—it would require an incredible amount of storage for an otherwise lightweight package. However, one possible approach would be to construct a smaller mock database with a small number of entries that could be more easily bundled with ReferenceSeeker.

At the very least, beginners would benefit from 1 or 2 test examples that very clearly demonstrate:

how to download the input data
how to run the test
what the expected output should be

Continuous integration

Following up on #4, if ReferenceSeeker had an automated test suite, it would also be possible to go a step further and set up a continuous integration environment using a service like GitHub Actions.

While software is under active development, CI services can be configured to run a test suite and time updates are made to the code (such as from a push to master or a pull request from the community). The repo can also be configured so that any pull request submitted by the core developer(s) or the community cannot be merged until the tests pass.
If/when software is in maintenance, CI services can be configured to run a test suite on a schedule, such as once per week or once per month. If future updates to package dependencies break ReferenceSeeker, CI would ensure a notification is issued to the authors early so that the problem can be addressed quickly. This would make it possible for the authors to provide long-term support for the software with minimal effort.

Pre-built databases for other RefSeq sections

For example, I'm interested in analyzing 1KP transcriptome data. It would be really nice to find the nearest reference genome for each plant transcriptome using this tool.

It doesn't appear that all of the RefSeq sections are represented in the down. Is this something that can be done? What about for all of the genomes in RefSeq? On a related not, is there a straightforward way to enter multiple genomes in a single fasta file at once?

ERROR: database file (db.tsv) not readable! on custom database

When running the following command with a custom database I get the error:

referenceseeker --threads 5 $genome $database
ERROR: database file (db.tsv) not readable!

The db.tsv is populated as expected and readable by me:

(referenceseeker) [robmur@g-12-l0002 termitomyces]$ ll db.tsv
-rw-rw---- 1 robmur ku_00014     1157 Apr 28 09:49 db.tsv


ANC-I   12908   contig  NA
ANC-H   12908   contig  NA
D10_T13_scaffolds       12908   contig  NA
IC0026  12908   contig  NA
D11_T132_scaffolds      12908   contig  NA
IC0020  12908   contig  NA
IC0033  12908   contig  NA
IC0038  12908   contig  NA
PseudaA 12908   contig  NA
IC0010  12908   contig  NA
IC0027  12908   contig  NA
IC0024  12908   contig  NA
IC0034  12908   contig  NA
IC0035-1        12908   contig  NA
IC0001  12908   contig  NA
IC0035-2        12908   contig  NA
D8_T73sscA_scaffolds    12908   contig  NA
D44_scaffolds   12908   contig  NA
GCA_003313055   12908   contig  NA
D9_T69sscA_scaffolds    12908   contig  NA
GCA_001263195   12908   contig  NA
GCA_001972325   12908   contig  NA
D28_Mi166-008_scaffolds 12908   contig  NA
Micro   12908   contig  NA
D27_T159_scaffolds      12908   contig  NA
T112    12908   contig  NA
T153    12908   contig  NA
GCA_003316525   12908   contig  NA
GCA_003313785   12908   contig  NA
D45_scaffolds   12908   contig  NA
D58_scaffolds   12908   contig  NA
GCA_003313075   12908   contig  NA
D46_scaffolds   12908   contig  NA
D59_scaffolds   12908   contig  NA
D13_T114_scaffolds      12908   contig  NA
D52_scaffolds   12908   contig  NA
D42_scaffolds   12908   contig  NA
D36_scaffolds   12908   contig  NA
D43_scaffolds   12908   contig  NA
D12_T123_scaffolds      12908   contig  NA
D53_scaffolds   12908   contig  NA

Version:
BioConda
referenceseeker 1.7.3

Debugging output with "base dir" and "share dir"

Just started noticing ReferenceSeeker printing what looks like debugging output. Appears to be coming from this line. I assume this was not intended to be included in the release. If so, it should be printed to stderr so it does not clutter the output.

Add internal checks during custom db creation

Inspired by Yair Motro:

Add an internal 'fool proof check' or a simple warning to check the sequences that will be used for the custom db.

Provide prebuilt GTDB databases

Add prebuilt DBs for GTDB:

bacteria
archaea

<NA/> is being added to the start of file path

Describe the bug
NA/ is being added to the start of file path resulting it it not being added to database

Input:
referenceseeker_db import --db /home/people/robmur/ku_00014/data/termitomyces/referenceSeeker_db/termitomyces --genome /home/people/robmur/ku_00014/data/termitomyces/sequences/D53_scaffolds.fasta

Output:
ERROR: could not import genome (NA//home/people/robmur/ku_00014/data/termitomyces/sequences/D53_scaffolds.fasta) into database (/home/people/robmur/ku_00014/data/termitomyces/referenceSeeker_db/termitomyces)!

Proof file exists:

(base) [robmur@g-12-l0002 phylogeny]$ ll /home/people/robmur/ku_00014/data/termitomyces/sequences/D53_scaffolds.fasta
-rw-rw-r-- 1 robmur ku_00014 196777164 Apr 21 11:00 /home/people/robmur/ku_00014/data/termitomyces/sequences/D53_scaffolds.fasta

The organism name is being added to the front of the --genome file path. It seems this is only added in the error report but not during the actual import so I am none the wiser as to what the actual issue is.

The sequence does seems to be getting added to the database under the name of the first record in the fasta file. Looks like it is in the mash or tsv creation steps then maybe?

Finding a suitable reference for a set of genomes

Hello, thanks for this great tool.
Just a question:
I wonder how to select the appropriate reference for a set of (diverse) genomes.
When I run the referenceseeker in this case, it gives different reference for each genome.

Add a sliding window parameter for the DNA fragment creation

Add a parameter to allow the adjustment of the sliding window size used in the DNA fragment creation process.

Empty matrix

Hello,

I'm trying to run referenceseeker and it seems to run fine without any error messages but the output is just an empty matrix like this (ive removed the full path names):

ReferenceSeeker v1.4
Options, parameters and arguments:
use bundled binaries: False
db path: home/fungi
genome path: final.contigs.fa
tmp path:
unfiltered: False
bidirectional: False
ANI: 0.95
conserved DNA: 0.69
# CRG: 100
# threads: 56

Estimate genome distances...
screened 0 potential reference genome(s)

Compute ANIs...

#ID Mash Distance ANI Con. DNA Taxonomy ID Assembly Status Organism

Any ideas what could be the problem? It seems to be screening 0 genomes. I have looked manually at the database folder and the sequences are all there. Here's my input code:

referenceseeker home/fungi final.contigs.fa --verbose

Thanks for your help.

Best wishes,
Theo

Allow user to set conserved_dna and ani threshold

Hello,

I have a set of Nanopore and Pacbio bacteria reads. I perform assemblies with wtdbg2 or redbean I run referenceseeker mash found some potential genome but I think the ANI threshold was to high for my uncorrected assemblies. It could be useful if user can set these two thresholds.

In fact I want the most nearest available reference genome not necessary the real one.

For example I with dataset ERR2531096 assembled with wtdbg2:

wtdbg2 -t 16 -g 2.1m -x sq -i ERR2531096.fasta -fo ERR2531096_wtdbg2
wtpoa-cns -t 16 -i ERR2531096_wtdbg2.ctg.lay.gz -fo ERR2531096_wtdbg2.fasta

In run referenceseeker against prebuild bacteria database in verbose mode I get this output:

ReferenceSeeker v1.2.0
Options, parameters and arguments:
	use bundled binaries: False
	db path: /scratch/bioinf/users/pmarijon/yacrd-and-fpa-upstream-tools-for-lr-genome-assembly/referenceseeker/bacteria
	genome path: /scratch/bioinf/users/pmarijon/yacrd-and-fpa-upstream-tools-for-lr-genome-assembly/assembly/ERR2531096_pb.raw.wtdbg2.fasta
	tmp path: /tmp/tmpmm8f0tg0
	unfiltered: False
	# CRG: 10
	# threads: 16

Estimate genome distances...
	screened 63 potential reference genome(s)
	reduce to best 10 hits...

Compute ANIs...

#ID	ANI	Con. DNA	Mash Distance	Taxonomy ID	Assembly Status	Organism

Empty results for a custom built DB

Hi!
I created a custom bacteria database and imported some fasta files in it, which seemed to work:
referenceseeker_db init --db DB
referenceseeker_db import --db DB --genome "$file" --status complete --organism "$organism" -t $taxid
(got a Successfully imported genome message for all)

but running referenceseeker for files (including ones contained in the DB) came back empty:
referenceseeker -v DB e_coli.fna
returned:
#ID Mash Distance ANI Con. DNA Taxonomy ID Assembly Status Organism

eliminating ani and conv. thresholds returned mash distance, but ani and conv.dna returned 0:
referenceseeker -v -a 0 -c 0 DB e_coli.fna
#ID Mash Distance ANI Con. DNA Taxonomy ID Assembly Status Organism
NC_002695.2 0.01899 0.00 0.00 83334 complete Escerichia coli
(changing crg value gave the same result)

I can't figure out why ani and conv.DNA values returns 0 for fasta files I know are identical/similar to reference genomes.

I am using a BioConda installation, version 1.6

(I attached a fasta file from the DB I used and the db files)
referenceseeker_issue.zip

Thanks

[Question] Using ReferenceSeeker to cluster draft fungal genomes

I am wanting to use ReferenceSeeker to cluster some draft reference genomes of a poorly defined and highly divergent genus into putative species.

To do this I have generated a database will all known genomes of this genus and then run all said genomes through referenceseeker.

Given the are draft genomes would it be best to take a subset of only homologous region. For example using BUSCO to identify common genes and then only using these genes in ReferenceSeeker?

Or would using a cutoff of Con. DNA effectively do a similar thing? My worry would be perhaps the closest genomes has is a bad draft genome so it lost through the Con. DNA cut off and using the above mentioned homologous region only approach would remedy this?

Secondly, given they are draft genomes, would this alter my interpretation of the output?

I want to build a network and I was planning on using Mash distance as the edges to this network where Con. DNA is above a certain threshold. I am unsure how ANI could be incorporated into this.

Lastly, I see many of the genomes are not returning themselves as the top match in the database. Would this be because I am using --unfiltered?

Does this seem valid?

Seemingly successful ReferenceSeeker run with no results.

My reference seeker runs are seeming running fine as I get no error reporting but I also get no results printed to STDOUT apart from the headers:

#ID Mash Distance ANI Con. DNA Taxonomy ID Assembly Status Organism

Input:

referenceseeker --threads 5 $database $genome