faylward / viralrecall Goto Github PK

Detection of NCLDV signatures in 'omic data

Python 100.00%

ncldv viruses genomics viral-signatures hmmer3 viral-regions nucleocytoviricota

viralrecall's Introduction

ViralRecall

ViralRecall is a flexible command-line tool for detecting signatures of giant viruses (NCLDV) in genomic data. Version 2 has been updated to focus more on NCLDV compared to version 1, but the original options are still available.

Dependencies

ViralRecall is written in Python 3.5.6 and requires biopython, matplotlib, numpy, and pandas. ViralRecall uses Prodigal and HMMER3 for protein prediction and HMM searches, respectively. Please ensure these tools are installed in your PATH before using.

A requirements.yml file is provided in this repo to solve some of the issues related to outdated pandas version. The requirements.yml file specifies conda environment dependencies so you don't have to install each separately. After cloning repository, please follow these steps:

cd viralrecall

and

conda env create -f requirements.yml

If you wish to proceed without the requirements.yml file, simply create a conda environment by typing conda create -n viralrecall. In that case you might have to install some dependencies yourself. On a Unix system you should be able to install these tools with:

sudo apt install prodigal

and

sudo apt install hmmer

or if you don't have sudo privileges, you can try with conda:

conda install prodigal -c bioconda

and

conda install hmmer -c bioconda

Installation and Database Download

Please ensure you are using > Python 3.5.2 and have the appropriate python modules installed. If this is an issue please create a Python environment using conda (see here https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)

To start: > git clone https://github.com/faylward/viralrecall

cd viralrecall

ViralRecall was tested on Ubuntu 16.04 and should work on most Unix-based systems. To see the help menu use: > python viralrecall.py -h

Viralrecall can be run using either of two viral HMM databases: 1) GVOGs, a custom set of Giant Virus Orthologous Groups that are fairly specific to Nucleo-Cytoplasmic Large DNA Viruses (NCLDV), or 2) VOGDB, which contains a wide collection of viral orthologous groups and is useful for broad characterization of viral signatures.

In addition to either the GVOG or VOGDB searches, ViralRecall matches proteins against the Pfam database (Pfam v. 32), which is a broad-specificity database that detects many protein families that are common in the genomes of cellular organisms.

The database files are available for download from Zenodo. To download and unpack, navigate to the folder that contains the viralrecall.py script and type:

wget -O hmm.tar.gz https://zenodo.org/records/12666277/files/hmm.tar.gz?download=1

and then

tar -xvzf hmm.tar.gz

This should create a hmm/ directory with the appropriate HMM files, including the gvog.hmm database and the vogdb.hmm database (downloaded from the vogdb.org website on 12/14/2020). This directory should be located in the same directory as the acc/ directory and the viralrecall.py script.

Note: The GVOG database was updated in May 2021, and it is recommended to use this latest version. The new database is substantially smaller, among other things, and this helps with runtime. But if you'd like to access the original GVOG database you can do so here: https://zenodo.org/record/4710691/files/hmm.tar.gz?download=1

After you have downloaded and unpacked the database files you should be able to run viralrecall.py. To see the help menu you can run: > python viralrecall.py -h

Basic Usage

To test if ViralRecall will run properly type: > python viralrecall.py -i examples/arm29B.fna -p test_outdir -t 2 -f

Results should be located in the test_outdir folder. The output folder will contain:

*.faa: The proteins predicted from the input file using Prodigal

*.fasta: The DNA coding sequence of the predicted proteins from Prodigal

*.full_annot.tsv: A full annotation table of the predicted ORFs. This includes descriptions of the GVOG and Pfam annotations, so it can be useful if you want to look at certain annoatations in more depth.

*.vregion.annot.tsv: An annotation of only the regions with some viral signatures

*summary.tsv: Summary statistics for the predicted viral regions (or contig-level stats if the -c flag was used). This also includes the NCLDV marker output (marker hit: bit score)

*.pfamout: Raw output of the Pfam HMMER3 search

*.vogout: Raw output of the GVOG or VOGDB HMMER3 search

*.markerout: Raw output of the NCLDV marker gene HMMER3 search

Additionally, for each viral region viralrecall will print out .faa and .fna files for the proteins and nucleotide sequences for the regions found. Please be sure to use only .fna files as input.

Options

There are several parameters you can change in viralrecall depending on your preferences and the data you're analyzing. The important parameters that will influence the results are:

-db, --database This is the database usef for viral detection. The default is the GVOG database (also can be specifified with "GVOG"). "VOG" can be specified for the vogdb database, and "marker" to only search against 10 NCLDV marker genes. GVOGs are more useful for NCLDV-specific searches. The "marker" option is much faster and may be useful for quickly screening large datasets.

-s, --minscore This is the mean score that a genomic regions needs to have in order to pass the filter and get reported as a viral region. The score is calculated from the HMMER3 scores, with higher scores indicating more and better matches to the GVOG database, and lower scores indicating more and higher matches to the Pfam database. The default is 10.

-w, --window Size of the sliding window to use for calculating moving averages. A smaller window may help predict short viral regions, but may split large viral regions into several pieces.

-m, --minsize Minimum size, in kilobases, of the viral regions to report.

-g, --minvog Minimum number of hits against the GVOG database that must be recorded in a region in order for it to be reported (larger values == higher confidence).

-c, --contiglevel If this option is used, mean ViralRecall scores will be provided for the input contigs rather than viral regions. This is useful for screening contigs for viral signatures.

-r, --redo If you have already run ViralRecall and you want to re-run it with different parameters, you can use the -r flag to avoid re-running Prodigal and HMMER, which are the most time-consuming steps.

-b, --batch Use this flag if the input is a folder of .fna files to search, rather than a single .fna file.

For example, if we wanted to recover regions of a eukaryotic contig with signatures of NCLDV, we could use the following command:

python viralrecall.py -i examples/arm29B.fna -p test_outdir -s 15 -m 30 -g 10

Here we are asking for only regions that have a mean score >= 15, are at least 30 kilobases long, and have at least 10 GVOG hits.

If we want to quickly re-do the above analysis with different parameters, but without re-doing gene predictions and HMMER3 searches, we can use the -r flag:

python viralrecall.py -i examples/arm29B.fna -p test_outdir -s 15 -m 15 -g 15 -w 20 -r

Maybe we want to re-do the analysis using a different e-value. The default is 1e-10, which is fairly stringent, so we can relax it a bit:

python viralrecall.py -i examples/arm29B.fna -p test_outdir -s 15 -m 15 -g 15 -w 20 -r -e 1e-5

So once you finished the hmmer searches you can easily re-calculate things with the -r flag.

Batch mode

If you have many sequences you wish to test you can put them all in a folder and use batch mode. Here the input (-i) should point to a folder with .fna files in it. Basic usage is:

python viralrecall.py -i examples/testfolder -p folderout -b

All of the output files should have their own folder in the folderout directory. You can also use the -b flag with the -r flag for quick re-calculations.

Miscellaneous

Prodigal is intended to predict genes on prokaryotic genomes, and it therefore will draw an error if used on very long eukaryotic contigs (> 32 Mbp in length). I've re-compiled a binary of Prodigal that will run on longer contigs, and this is available in the bin/ folder of this GitHub repo. This is NOT the default version of prodigal that is used - if you wish to use this binary you will need to make sure it is located in your PATH (and not any other version of prodigal you may have installed).

Some users have noticed errors or warnings involving Pandas, which uses slightly different syntax in different Python versions. These can can usually be resolved by changing the Python version used to 3.5.4.

References

ViralRecall: A Flexible Command-Line Tool for the Detection of Giant Virus Signatures in Omic Data, FO Aylward and M Moniruzzaman, Viruses, 2021; 13(2):150. https://doi.org/10.3390/v13020150.

For questions or comments feel free to email Frank Aylward at faylward at vt dot edu

This tool requires Prodigal and HMMER3. Their citations are:

Hyatt et al. “Prodigal: prokaryotic gene recognition and translation initiation site identification”. BMC bioinformatics, 2010.

Eddy, "A new generation of homology search tools based on probabilistic inference". Genome Informatics, 2009.

viralrecall's People

Contributors

Stargazers

Watchers

Forkers

jmeppley rsettlage siredwin pythseq liupfskygre ruixuan-zhang qazwsx1995 viromic banhbio abdealijivaji jduffyex xingxingshen

viralrecall's Issues

How to get the score of entire bin just like the result in your essay

I have some bins. And I want to use viralrecall to screen them.
The ideal output could be a csv(tsv), one column for the MAG, one column for scores.
I want to ask how to get the score of entire bin just like the result in your essay

And I don't understand the parameter "-c"
it also retrun many replicons

Running with "general" db requires vog.annotations.tsv which is absent from hmm/

Hi, there. Thank you for developing this wonderful tool.

I was trying to get Viral Recall up and running using the general VOG database, but ran into the error:

Traceback (most recent call last):
  File "viralrecall.py", line 733, in <module>
    status = main()
  File "viralrecall.py", line 728, in main
    run_program(input, project, database, window, phagesize, minscore, minhit, evalue, cpus, plotflag, redo, flanking, batch, 
summary_file, contiglevel)
  File "viralrecall.py", line 383, in run_program
    vdesc = get_annot(database)
  File "viralrecall.py", line 41, in get_annot
    input = open("hmm/vog.annotations.tsv", "r")
FileNotFoundError: [Errno 2] No such file or directory: 'hmm/vog.annotations.tsv'

As I understand, this should be present in the hmm/ folder from Zenodo as per the README, but it isn't there. Can you direct me where to find it, or help find an alternative?

Thank you for your help,

Regards,

Luke

missing the DNA fasta output file

Thank you for creating this tool! Helps me a lot.

I noticed I was missing this output file when I ran ViralRecall:
*.fasta: The DNA coding sequence of the predicted proteins from Prodigal

Running, for instance,
python viralrecall.py -i bin.10.fa -p bin.10 -t 8 -c

would only give me
bin.10.faa
bin.10.full_annot.tsv
bin.10.markerout
bin.10.pfamout
bin.10.summary.tsv
bin.10.vogout

Assigning taxonomy on NCLDVs? The missing link.

I have obtained hundreds of promising NCLDV genomes from the Lake Cadagno metagenomics dataset. However, I am struggling in assigning taxonomy to them. There are missing names in the "spreadsheet of annotated genomes" which are present in the GVDB databases.
https://faylward.github.io/GVDB/

Please suggest some steps/script for assigning taxonomy on prospective NCLDV genomes. Thank you.

/usr/bin/env: ‘python\r’: No such file or directory

Hi,

I want to point out a simple issue that I resolved. I am on a Linux machine running Ubuntu and the python script is interpreting "\n" characters as "\r" and causing issues. I resolved the issue by replacing the characters with sed.

sed -i 's/\r/\n/' viralrecall.py

Issue running test data

Hello,

I tried running the example data with the command python viralrecall.py -i examples/arm29B.fna -p test_outdir -t 2 -f in the ViralRecall conda environment and was returned this error message after it ran to completion:

Traceback (most recent call last): File "viralrecall.py", line 733, in <module> status = main() File "viralrecall.py", line 728, in main run_program(input, project, database, window, phagesize, minscore, minhit, evalue, cpus, plotflag, redo, flanking, batch, summary_file, contiglevel) File "viralrecall.py", line 655, in run_program plt.ylim(minval, numpy.nanmax(df2["rolling"])) File "/media/imperator/bross/miniconda3/envs/ViralRecall/lib/python3.5/site-packages/matplotlib/pyplot.py", line 1478, in ylim ret = ax.set_ylim(*args, **kwargs) File "/media/imperator/bross/miniconda3/envs/ViralRecall/lib/python3.5/site-packages/matplotlib/axes/_base.py", line 3470, in set_ylim if bottom == top: File "/media/imperator/bross/miniconda3/envs/ViralRecall/lib/python3.5/site-packages/pandas/core/generic.py", line 1576, in __nonzero__ .format(self.__class__.__name__))

Is this an issue regarding the installed versions of matplotlib (3.0.0) and pandas (0.23.4) in the environment?

HMMs database

Hello，

Will the databases of HMMs ( cellular organisms and GVOG) be updated?
If I want to build my own database, how to configure it to be compatible with viralrecall?
How can I skip the prodigal and use the .faa file as the input file directly? (the genome of some cellular organisms are too large）

Thanks for your time and consideration !

stuck at cumsum2 on metagenome

Hi
I tried to run your tool on a metagenome (~80 k sequences) with the marker option

python viralrecall.py -i MG.fna -p MG -db marker -t 20 -c

but it apparently gets stuck for a very long time at the cumsum2 function.
As this function is not useful in the case of a metagenome, I was wondering if there a way to treat the contigs independently from one another to just find signatures of viruses?
Thanks
Greg

AttributeError: 'DataFrame' object has no attribute 'ix'

Hello，
I am getting an error when trying to run viralrecall using the NCLDV database. The error is :

Traceback (most recent call last):
File "viralrecall.py", line 523, in
status = main()
File "viralrecall.py", line 518, in main
run_program(input, project, database, window, phagesize, minscore, minvog, evalue, cpus, plotflag, redo, flanking, batch, summary_file)
File "viralrecall.py", line 335, in run_program
subset = df3.ix[reg]
File "/public/work/jianfei/soft/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 5274, in getattr
return object.getattribute(self, name)
AttributeError: 'DataFrame' object has no attribute 'ix'

I find that panda "ix" is deprecated, My panda version is pandas-1.0.5. Any idea could solve this problem?

Thank you,

JianFei

Pandas treating booleans as ambiguous

Hello,

I recently tried running a local install of viralrecall. However, I got the following error message:

Traceback (most recent call last):
  File "viralrecall.py", line 733, in <module>
    status = main()
  File "viralrecall.py", line 728, in main
    run_program(input, project, database, window, phagesize, minscore, minhit, evalue, cpus, plotflag, redo, flanking, batch, summary_file, contiglevel)
  File "viralrecall.py", line 655, in run_program
    plt.ylim(minval, numpy.nanmax(df2["rolling"]))
  File "/scratch2/software/anaconda/envs/viralrecall-preq/lib/python3.5/site-packages/matplotlib/pyplot.py", line 1478, in ylim
    ret = ax.set_ylim(*args, **kwargs)
  File "/scratch2/software/anaconda/envs/viralrecall-preq/lib/python3.5/site-packages/matplotlib/axes/_base.py", line 3470, in set_ylim
    if bottom == top:
  File "/scratch2/software/anaconda/envs/viralrecall-preq/lib/python3.5/site-packages/pandas/core/generic.py", line 1576, in __nonzero__
    .format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

My understanding is that this is due to pandas treating booleans as ambiguous. I was wondering if there is a simple way to resolve this issue?

Thank you!
Cédric

Facing problem in running Viralrecall

After installing into the Conda environment i am facing following problem,

Traceback (most recent call last):
File "viralrecall.py", line 733, in
status = main()
File "viralrecall.py", line 728, in main
run_program(input, project, database, window, phagesize, minscore, minhit, evalue, cpus, plotflag, redo, flanking, batch, summary_file, contiglevel)
File "viralrecall.py", line 401, in run_program
df = pandas.concat([df, s1], axis=1, sort=True)
TypeError: concat() got an unexpected keyword argument 'sort'

NCLDV taxonomy

Hi，
Thanks for your amazing tool!
I have successfully run this tool. But how could I get the taxonomy of NCLDV seqs? It's not found around the result files.
Look forward to your favourable reply.

download from the Virginia Tech library system is not working

This problem is not related to the code, but the database is not available:
https://data.lib.vt.edu/downloads/6h440s637

Appropriate thresholds for taxonomic annotation of viruses.

Hi,
I would like to know the appropriate threshold (min score and minvog) that I should set to mine the signal of giant viruses from many viral contigs. Thank you!

Something wrong about python-----TypeError: concat() got an unexpected keyword argument 'sort'

My python version:
Python 3.5.6
The command that has been run:
python viralrecall.py -i examples/arm29B.fna -p test_outdir -t 2 -f
The output file did not generate the seven files in the README, only the following four files:
test_outdir.faa
test_outdir.markerout
test_outdir.pfamout
test_outdir.vogout
Error is as follows:
Traceback (most recent call last):
File "viralrecall.py", line 733, in <module>
status = main()
File "viralrecall.py", line 728, in main
run_program(input, project, database, window, phagesize, minscore, minhit, evalue, cpus, plotflag, redo, flanking, batch, summary_file, contiglevel)
File "viralrecall.py", line 401, in run_program
df = pandas.concat([df, s1], axis=1, sort=True)
TypeError: concat() got an unexpected keyword argument 'sort'

NCLDV database

Hello,

I am getting an error when trying to run viralrecall using the NCLDV database. The general database seems to be OK. The error is:

Error: File existence/permissions problem in trying to open HMM file hmm/pfam.for_ncvog.hmm.
HMM file hmm/pfam.for_ncvog.hmm not found (nor an .h3m binary of it)

I have hmm/NCVOG.hmm. Any idea what could be causing this problem?

Thank you,
Sarah

Issues with Apple M1 Pro chip?

Hi there,
For context, I'm new to python and shell coding in general.
I installed the required packages and directories alongside ViralRecall. As suggested, I tested to see if everything was working using python viralrecall.py -i examples/arm29B.fna -p test_outdir -t 2 -f .
Sadly, I get the following messages:
"Intel MKL FATAL ERROR: This system does not meet the minimum requirements for use of the Intel(R) Math Kernel Library.
The processor must support the Intel(R) Supplemental Streaming SIMD Extensions 3 (Intel(R) SSSE3) instructions.
The processor must support the Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) instructions.
The processor must support the Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions."

I suspect these messages may be due to the fact that I have a new gen MacBook Pro with an Apple M1 Pro chip. Rosetta is installed and working.

Has anyone else had similar issues? Is there perhaps an easy solution that I am not aware of yet?

viralrecall.py:400: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version.

When I use -db "marker" -c

 bacphagenetwork@watson:/mnt/raid7/Dachuang/Achuan/viralrecall$ python viralrecall.py -i examples/arm29B.fna -p test_outdir -t 2 -db "marker" -c
viralrecall.py:400: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  s1 = pandas.DataFrame(pandas.Series(i, name = names[index]))

Run from any directory

It's not an issue, but I just wanted to share an experience of making viralrecall run from any directory.
The solution is straightforward and consists of having one additional wrapper script that converts relative paths to the input and the project into absolute ones and launches viralrecall.py from its location, e.g. /opt/viralrecall/:

$ cat /opt/viralrecall/viralrecall
#!/bin/bash

SCRIPT_DIR="$( cd "$( dirname "$( realpath "${BASH_SOURCE[0]}" )" )" &> /dev/null && pwd )"

args=()
while test $# -gt 0; do
	arg=$1
	args+=("$arg")
	case "$arg" in
		-i|--input|-p|--project)
			args+=("$(realpath -ms "$2")")
			shift
		;;
	esac
	shift
done

cd "$SCRIPT_DIR"
python viralrecall.py "${args[@]}"

The wrapper can be soft-linked to a location on the PATH, and then viralrecall can be run intuitively as:

cd path/to/my/inputs/
viralrecall -i input.fasta -p project

One small problem is that viralrecall.py writes two files in the directory where it resides - err.txt and out.txt. To prevent this from happening in directories that are not supposed to have write permissions for regular users (and running viralrecall on different inputs compromises these files either way), they can be redirected preemptively to /dev/null:

$ readlink -f /opt/viralrecall/{out,err}.txt
/dev/null
/dev/null

although of course a more elegant solution would be to modify viralrecall.py to make it write those files relative to project.

How to be sure at recovering NCLDV sequences

Hello!

I am trying to recover NCLDV sequences from metagenomes and metaviromes. I was using virsorter2 with --include-groups "NCLDV" and I also want to use viral recall to ensure that the contigs are effectively from "NCLDV". I am not sure if filtering the contigs using score > 0 is enough (having in mind that I have also used virsorter2) or being more stringent and using 1 of the 10 marker genes presence + score > 0.

Thank you very much!

How to detect viruses specific to a host genome (an algae)?

Hello, sorry to open this as an issue but it is more of a query. Is it possible to use ViralRecall to get viruses of the specific algal genome (chlorella) from the metagenomics dataset? Thank you for your suggestions.

Unknown issue

Hello,
I've been trying to run ViralRecall, both using the test provided and with my own data, but it always gives me the following error:

Traceback (most recent call last):
File "/opt/viralrecall/viralrecall.py", line 733, in <module>
status = main()
File "/opt/viralrecall/viralrecall.py", line 725, in main
run_program(newinput, newproject, database, window, phagesize, minscore, minhit, evalue, cpus, plotflag, redo, flanking, batch, summary_file, contiglevel)
File "/opt/viralrecall/viralrecall.py", line 475, in run_program
summary = summary.append(data)
File "/opt/Anaconda3/envs/viralrecall/lib/python3.10/site-packages/pandas/core/generic.py", line 5989, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'append'. Did you mean: '_append'?

Any idea what the issue might be?

various output files lacking when running --db general

Hi team,

When trying to run viralrecall on some genomes with both --db GVOG and --db general, I noticed that several output files are lacking for the latter. Notably, it lacks the "*.full_annot.tsv", which is what I am most interested in. I am possibly also interested in viral regions that can be detected based on the generic VOG database, with default settings, but no regions are reported, although I'm not sure what this is caused by. For example, the '.summary.tsv' file is also lacking. Below you find an overview of what is produced with --db general. Note that the hmmscan outputs look okay, as well as the predicted protein sequences.

I hope you can help me finding out what might go wrong here.

Best,
Jolien

*.faa
*.markerout
*.pdf (I was using the '-f' flag)
*.pfamout
*.vogout

.vregion_annot.tsv (no regions reported)

.markerout files

Hello,

This is not an issue but more of a request that if the markerout files can be made into .csv it would make a lot of downstream analysis and counting a lot easier :)

I've been trying to make a python script that does this task (*.markerout -> .csv) and its proving to be above my skills.

Thanks for your time and consideration !

Project License

I've been using your tool and find it very useful. However, I couldn't find any information about the license. Could you please clarify under what license this project is distributed?

Thank you

faylward / viralrecall Goto Github PK

viralrecall's Introduction

ViralRecall

Dependencies

Installation and Database Download

Basic Usage

Options

Batch mode

Miscellaneous

References

viralrecall's People

Contributors

Stargazers

Watchers

Forkers

viralrecall's Issues

Recommend Projects

Recommend Topics

Recommend Org