medvir / virmet Goto Github PK
View Code? Open in Web Editor NEWSet of tools for viral metagenomics.
Set of tools for viral metagenomics.
Do the grouping only based on accession ID
. No need to group based on 'stitle', 'ssciname', 'species', 'tax_id'
. In some cases, ssciname
is nan
and as a results the corresponding reads are excluded from the grouping and final results.
root@116ebf9b765d:/home/ubuntu# virmet fetch --viral n
Traceback (most recent call last):
File "/opt/miniconda/envs/test-virmet/bin/virmet", line 9, in
load_entry_point('VirMet==1.0rc1', 'console_scripts', 'virmet')()
File "/opt/miniconda/envs/test-virmet/lib/python3.5/site-packages/VirMet-1.0rc1-py3.5.egg/virmet/main.py", line 110, in main
args.func(args)
File "/opt/miniconda/envs/test-virmet/lib/python3.5/site-packages/VirMet-1.0rc1-py3.5.egg/virmet/main.py", line 39, in fetch_db
fetch.main(args)
File "/opt/miniconda/envs/test-virmet/lib/python3.5/site-packages/VirMet-1.0rc1-py3.5.egg/virmet/fetch.py", line 31, in main
assert gids_1 == gids_2
AssertionError
covplot
reports number of reads from samtools idxstats
that counts the number of alignments, not the number of reads. For genomes with repetitive regions, this leads to overestimate the read counts.
Downloading takes time, a trivial parallelisation would be useful.
To re-analyse a sample (because of stoppen pipeline, new version), the output folder of that sample should first be deleted manually. The pipeline should skip samples that already have an output folder.
When running a single file, VirMet is trying to create an output folder with the whole path as name instead only the run/file name.
mihuber@timavo:~$ virmet wolfpack --file /data/MiSeq/MiSeqOutput/170426_M01274_0185_000000000-B4FP5/Data/Intensities/BaseCalls/1000251161-UR-10-1_S9_L001_R1_001.fastq.gz
Traceback (most recent call last):
File "/opt/miniconda/bin/virmet", line 11, in <module>
load_entry_point('VirMet==1.1', 'console_scripts', 'virmet')()
File "/opt/miniconda/lib/python3.5/site-packages/virmet/__main__.py", line 120, in main
args.func(args)
File "/opt/miniconda/lib/python3.5/site-packages/virmet/__main__.py", line 25, in wolfpack_run
od = wolfpack.main(args)
File "/opt/miniconda/lib/python3.5/site-packages/virmet/wolfpack.py", line 385, in main
os.chdir(out_dir)
FileNotFoundError: [Errno 2] No such file or directory: 'virmet_output_/data/MiSeq/MiSeqOutput/170426_M01274_0185_000000000-B4FP5/Data/Intensities/BaseCalls/1000251161-UR-10-1_S9_L001_R1_001'
As written in their blog, NCBI phased out GI. VirMet has to adapt, for sure changing how fetch
and update
work.
The command line is already copied.
Get rid of
good.fastq
bad.fastq
good_*_.fastq
Compress
unique.tsv
hq_decont_reads.fasta
Because viruses with a segmented genome share the same ssciname
in orgs_list.tsv
it's only possible to make a coverage plot of one of the segments.
If one could (alternatively) use stitle
for the --organism
argument, it would be possible to distinguish between the different segments.
Line 53 in 97351c2
This bug was detected because there were much less (23 Pepino mosaic virus reads) listed in orgs_list.tsv
compared to the amount of reads aligned to Pepino mosaic virus with virmet covplot
(874).
Sample:
/analysis/VirMetResults/virmet_output_181116_M01274_0246_000000000-C6W4V/1000420494-ST-RNA_S9
sseqid not written in orgs_list.tsv
:
sseqid written in orgs_list.tsv
(example where it works):
NCBI phased GI numbers out in September 2016. The pipeline must adapt to this by retrieving and annotating sequences and taxonomies by accession numbers rather than by GIs.
Include in VirMet the generation of a TidyData.csv Table.
Run, Sample, Name, Quality, Category, Species, Reads, Percent (of what?)
Touch a file when a step is done so that it can be rerun skipping it.
Hi,
I have modified some of the scripts in the docker container to use to right ftp addresses to fetch for genomes. I have seen that also the scripts in github need to be modified.
Comparing your original scripts (here are fetch_ori.py, common_ori.py, wolfpack_ori.py) and the ones I'm using at the moment, here are the differences:
diff /opt/miniconda/envs/test-virmet/lib/python3.5/site-packages/VirMet-1.0rc1-py3.5.egg/virmet/fetch_ori.py /opt/miniconda/envs/test-virmet/lib/python3.5/site-packages/VirMet-1.0rc1-py3.5.egg/virmet/fetch.py
27c27
< cml = 'cut -f 1,2 viral_seqs_info.tsv > viral_gi_taxid.dmp'
---
> cml = 'cut -f 1,2 viral_seqs_info.tsv | awk \'{print $2"\t"$1}\' > viral_gi_taxid.dmp'
29,31c29,31
< gids_1 = set(get_gids('viral_database.fasta'))
< gids_2 = set([l.split()[0] for l in open('viral_gi_taxid.dmp')])
< assert gids_1 == gids_2
---
> ##gids_1 = set(get_gids('viral_database.fasta'))
> ##gids_2 = set([l.split()[0] for l in open('viral_gi_taxid.dmp')])
> ##assert gids_1 == gids_2
132c132
< fasta_url = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/Bos_taurus/Assembled_chromosomes/seq/bt_ref_Bos_taurus_UMD_3.1.1_%s.fa.gz' % chrom
---
> fasta_url = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/Bos_taurus/ARCHIVE/BUILD.6.1/Assembled_chromosomes/seq/bt_ref_Bos_taurus_UMD_3.1_%s.fa.gz' % chrom
136c136
< gff3_url = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/Bos_taurus/GFF/ref_Bos_taurus_UMD_3.1.1_top_level.gff3.gz'
---
> gff3_url = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/Bos_taurus/ARCHIVE/BUILD.6.1/GFF/ref_Bos_taurus_UMD_3.1_top_level.gff3.gz'
diff /opt/miniconda/envs/test-virmet/lib/python3.5/site-packages/VirMet-1.0rc1-py3.5.egg/virmet/wolfpack_ori.py /opt/miniconda/envs/test-virmet/lib/python3.5/site-packages/VirMet-1.0rc1-py3.5.egg/virmet/wolfpack.py
22c22
< 'bt_ref': '/data/virmet_databases/bovine/fasta/bt_ref_Bos_taurus_UMD_3.1.1.fasta.gz'
---
> 'bt_ref': '/data/virmet_databases/bovine/fasta/ref_Bos_taurus_UMD_3.1_top_level.gff3.gz'
diff /opt/miniconda/envs/test-virmet/lib/python3.5/site-packages/VirMet-1.0rc1-py3.5.egg/virmet/common_ori.py /opt/miniconda/envs/test-virmet/lib/python3.5/site-packages/VirMet-1.0rc1-py3.5.egg/virmet/common.py
128c128
< x = gb['ftp_path'].apply(lambda col: col + '/' + col.split('/')[5] + '_genomic.fna.gz')
---
> x = gb['ftp_path'].apply(lambda col: col + '/' + col.split('/')[9] + '_genomic.fna.gz') ```
I saw that this line is correct in the scripts in github
Best regards,
Maia
I try to build the database with the subcommand fetch, I obtained the next problem:
Traceback (most recent call last):
File "./virmet", line 11, in
load_entry_point('VirMet==1.1.1', 'console_scripts', 'virmet')()
File "/home/sonia/anaconda3/lib/python3.6/site-packages/virmet/main.py", line 120, in main
args.func(args)
File "/home/sonia/anaconda3/lib/python3.6/site-packages/virmet/main.py", line 44, in fetch_db
fetch.main(args)
File "/home/sonia/anaconda3/lib/python3.6/site-packages/virmet/fetch.py", line 21, in main
cml_search = viral_query('n')
File "/home/sonia/anaconda3/lib/python3.6/site-packages/virmet/common.py", line 134, in viral_query
os.mkdir(target_dir)
FileNotFoundError: [Errno 2] No such file or directory: '/data/virmet_databases/viral_nuccore'
could you tell me which the mistake in the build of database? please
A warning is given by pandas: sort_values
should replace order
as this will be removed in a future release.
For large genomes sometimes the coverage plots (created with virmet covplot
) are not displayed well.
One example is:
/analyses/VirMetResults/virmet_output_171110_M02081_0000_000000000-BF24W/BE-spike-SSIV-1-1_S6/Human_herpesvirus_5/Human_herpesvirus_5_coverage.pdf
Looking at the depth.txt
file it looks like there is an even distribution of reads across a large part of the genome but in the .pdf nothing is visible.
Disable length trimming based on quality for NanoPore reads. NanoPore quality scores do not follow PHRED scores.
@2b89169b-74c9-4653-8877-c3f21ec7674d runid=2564fad3a9906330fcc0c6c90ac0c8128b89d9ff read=213 ch=124 start_time=2017-05-24T11:01:53Z
TTGTTCGTTCAGTTGGGTGTTTTATGGTTTCGTTTTTCGTGCGCCGCTTCAACAGATGAAGATGTGACATCCATTATTAATAAAGTAGACAGGCATGCTGTGGTCAAAATGGCAGTTTGTGGTAATATAGCACCATATAGCAAGAATTCTAACAAAGTTTTAACTGTAATTTAACTAATCATTATCAAATCTCTGTAGCTGCTGAAGAGAAAAGAAAAGTGGTACACTTCTGCAAATGAGACAAATCCTGTTCATCGCCATCGACAGCATGGTTGAAAAAACTCTTTGATGGTTGTTGCATTTGGAGTGCTCTGTAGTTGACATTAAACCAGCCGTCAAAAGAATTGGCCTATTAAGCCAATGGCTTGATGTTGACTAAGAGTGTAGGAGCAGC
+
$$$')7',+*$&%(+2/3/2=,)''.+)*.,,0:830,+612033).95:8*1239BDCA2461=;<A?=B:268184?7>?@928+(63+.126>[email protected]=:4*5134;F5/3,+'5*-.926:3:B;C<9..238./53;;11)249*.1DEA1?2/),(--74-++'128;1E0-.27E3.++)'(,17876355<.*,'-880'/46.,6.7570/0B1D?8202--,*-0&01*)?1..EFFBE@3254+/+,+34666=8:0/;EFFGEBD5>=/++1+35@,=A35.68++2)(+'),)213.369;>4*/+<?>=856?7>B?9C7.(.*5/245?//7:=8544.,.23/2*+(.*@.+*2(0/&*+/*))+0-,-*)+
Hi,
virmet wolfpack --run /home/Exp01
Traceback (most recent call last):
File "/home/miniconda2/envs/virmet1/bin/virmet", line 11, in <module>
load_entry_point('VirMet==1.1.1', 'console_scripts', 'virmet')()
File "/home/miniconda2/envs/virmet1/lib/python3.5/site-packages/virmet/__main__.py", line 120, in main
args.func(args)
File "/home/miniconda2/envs/virmet1/lib/python3.5/site-packages/virmet/__main__.py", line 25, in wolfpack_run
od = wolfpack.main(args)
File "/home/miniconda2/envs/virmet1/lib/python3.5/site-packages/virmet/wolfpack.py", line 410, in main
sd = hunter(fq)
File "/home/miniconda2/envs/virmet1/lib/python3.5/site-packages/virmet/wolfpack.py", line 126, in hunter
with open('prinseq.log') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'prinseq.log'
virmet.log
INFO 2017/12/01 14:22:40 __main__.py: main() 116: /home/miniconda2/envs/virmet1/bin/virmet wolfpack --run /home/Exp01
INFO 2017/12/01 14:22:40 wolfpack.py: main() 391: samples to run: S16 S7
INFO 2017/12/01 14:22:40 wolfpack.py: main() 409: running hunter on /home/Exp01/A-16_S16_L001_R1_001.fastq.gz
DEBUG 2017/12/01 14:22:40 wolfpack.py: hunter() 47: hunter will run on 16 processors
INFO 2017/12/01 14:22:40 common.py: run_child() 56: Running instance of gunzip
DEBUG 2017/12/01 14:22:45 common.py: run_child() 62: Completed
DEBUG 2017/12/01 14:22:45 wolfpack.py: hunter() 77: trimming with seqtk
INFO 2017/12/01 14:22:45 common.py: run_child() 56: Running instance of seqtk
DEBUG 2017/12/01 14:22:50 common.py: run_child() 62: Completed
INFO 2017/12/01 14:22:50 common.py: run_child() 56: Running instance of wc
DEBUG 2017/12/01 14:22:50 common.py: run_child() 62: Completed
INFO 2017/12/01 14:22:50 common.py: run_child() 56: Running instance of split
DEBUG 2017/12/01 14:22:52 common.py: run_child() 62: Completed
DEBUG 2017/12/01 14:22:52 wolfpack.py: hunter() 101: filtering with prinseq
INFO 2017/12/01 14:22:52 common.py: run_child() 56: Running instance of /usr/bin/seq
ERROR 2017/12/01 14:22:52 common.py: run_child() 64: Execution of /usr/bin/seq -f %03g 0 15 | xargs -P 16 -I {} prinseq -fastq splitted{}.fastq -lc_method entropy -lc_threshold 70 -log prinseq{}.log -min_qual_mean 20 -out_good ./good{} -out_bad ./bad{} > ./prinseq.err 2>&1 failed with returncode 127:
ERROR 2017/12/01 14:22:52 common.py: run_child() 65: /usr/bin/seq -f %03g 0 15 | xargs -P 16 -I {} prinseq -fastq splitted{}.fastq -lc_method entropy -lc_threshold 70 -log prinseq{}.log -min_qual_mean 20 -out_good ./good{} -out_bad ./bad{} > ./prinseq.err 2>&1
DEBUG 2017/12/01 14:22:52 wolfpack.py: hunter() 108: cleaning up
INFO 2017/12/01 14:22:52 common.py: run_child() 56: Running instance of rm
DEBUG 2017/12/01 14:22:53 common.py: run_child() 62: Completed
Something broke the website. Probably something with the sidebar, the top of each page is now cut off.
AttributeError: 'Series' object has no attribute 'order'
is thrown when no read is above the coverage/identity threshold. orgs_list
must be written anyway.
E.g., when update returns zero new sequences, exit and print a message to screen.
I'm not sure if this is the only line of code which causes this issue:
Line 504 in d6fd569
To reproduce issue try running virmet wolfpack --run /data/MiSeq/MiSeqOutput/200110_M01274_0316_000000000-CW5HM/
on timavo
Add in the documentation what files should be deleted to rerun parts of the analysis.
After a wolfpack run, user should be able to select a viral organism, realign viral reads to that and produce a coverage plot.
Already a point in #4
As @sschmutz pointed out, the viral genomes page at NCBI lists, as of today, 7477 genomes (9533 sequences). While many of these are already included in the database, there are some notable exceptions like Influenza B. The viral query search in common.py
returns 64165 sequences,
The query search given by NCBI to list all viral genomes is
"Viruses"[Organism] AND srcdb_refseq[PROP] NOT wgs[PROP] NOT "cellular organisms"[Organism] NOT AC_000001[PACC] : AC_999999[PACC]
The query search below should return both our old sequences and RefSeq genomes, and lists 67920 sequences
esearch -db nuccore -query "Viruses [orgn]
AND (\"complete genome\" [Title] OR srcdb_refseq[prop])
NOT txid131567 [orgn] NOT wgs[prop]
NOT AC_000001[PACC] : AC_999999[PACC]"
This seems to return a more complete database.
Happens sometime (negative controls) that there are sequences, but no viral ones. If virmet tidytable
is then used, following error is returned and it stops there
FileNotFoundError: File b'NTC-DNA_S3/stats.tsv' does not exist
This example is from the analysis virmet_output_180720_M01274_0227_000000000-BY9JM
@ozagordi
It would be good if one could easily retrieve the version of MinVar and the last update date of the virus reference database (e.g. with a --version subcommand).
This would be important for
citation in publications
follow-up of changes in virus taxonomy
This issue suggests two options to enhance the VirMet output.
Include a measurement of coverage for each entry in orgs_list.tsv
, that way one could faster/better distinguish false from true hits.
One example is on timavo /analyses/VirMetResults/virmet_output_171127_M01274_0200_000000000-BHDG5/1000378040-AR-1_S1
. If the coverage plots of the first three hits are compared (Moraxella phage, BeAn 58058 virus, Streptococcus phage) it's likely that the first two are false while the Streptococcus phage hits are true.
Add species level to each entry in orgs_list.tsv
so it can be summarized. Sometimes there are a lot of different entries coming from the same virus. We don't always know if it's actually the same virus (keeping the original information is important) but it would help to make the report more readable.
One example is on timavo /Volumes/analyses/VirMetResults/virmet_output_171110_M02081_0000_000000000-BF24W/BE-spike-SSIV-1-1_S6
.
sscinames: Human adenovirus 7d2 (staxids: 326124)
sscinames: Human adenovirus 7 (staxids: 10519)
...
they don't share the same staxids but if you look at the classification (in that case with the R package taxize), you see that they share the same species. Same with other adenovirus hits.
> library(taxize)
> classification("Human adenovirus 7d2", db="ncbi")
Retrieving data for taxon 'Human adenovirus 7d2'
$`Human adenovirus 7d2`
name rank id
1 Viruses superkingdom 10239
2 dsDNA viruses, no RNA stage no rank 35237
3 Adenoviridae family 10508
4 Mastadenovirus genus 10509
5 Human mastadenovirus B species 108098
6 Human adenovirus B1 no rank 565302
7 Human adenovirus 7 no rank 10519
8 Human adenovirus 7d2 no rank 326124
attr(,"class")
[1] "classification"
attr(,"db")
[1] "ncbi"
> classification("Human adenovirus 7", db="ncbi")
Retrieving data for taxon 'Human adenovirus 7'
$`Human adenovirus 7`
name rank id
1 Viruses superkingdom 10239
2 dsDNA viruses, no RNA stage no rank 35237
3 Adenoviridae family 10508
4 Mastadenovirus genus 10509
5 Human mastadenovirus B species 108098
6 Human adenovirus B1 no rank 565302
7 Human adenovirus 7 no rank 10519
attr(,"class")
[1] "classification"
attr(,"db")
[1] "ncbi"
We were wondering, why covplot
aligns all viral reads to the sequence with the highest number of reads mapped and not just the reads of the selected organism. Depending on the alignment thresholds in virmet
and covplot
, reads from different organism could be combined in this new alignment.
When there is no match, pipeline reports '1'
FutureWarning: read_table is deprecated, use read_csv instead.
We should probably replace this in tidytable.py
to avoid future issues.
VirMet/src/virmet/tidytable.py
Line 27 in 97351c2
VirMet/src/virmet/tidytable.py
Line 34 in 97351c2
Virmet covplot did run and generated a depth.txt file, but no plot because of missing ggplot package?
Error message: Error in library(ggplot2) : there is no package called ‘ggplot2’
Conda base environment was used.
Example here: /analyses/VirMetResults/virmet_output_191010_M01274_0292_000000000-CNPVH/1000458456-BE-DNA_S3/Torque/
...but it works with covplot.py (virmet covplot --outdir ./ --organism "..."
).
I don't know if this is something which needs to be fixed (since it works with the absolute path).
This is the output I get:
stschmu@timavo:/analyses/VirMetResults/virmet_output_171024_M01274_0196_000000000-BF7JF$ virmet tidytable --outdir ./
Traceback (most recent call last):
File "/opt/miniconda/bin/virmet", line 11, in <module>
load_entry_point('VirMet==1.1.1', 'console_scripts', 'virmet')()
File "/opt/miniconda/lib/python3.5/site-packages/virmet/__main__.py", line 120, in main
args.func(args)
File "/opt/miniconda/lib/python3.5/site-packages/virmet/__main__.py", line 16, in tidytable_run
tidytable.main(args)
File "/opt/miniconda/lib/python3.5/site-packages/virmet/tidytable.py", line 15, in main
run = outdir.rstrip('/').split('/')[-1].split('virmet_output_')[1]
IndexError: list index out of range
Write tests possibly.
When the subcommand covplot
is used, organism_coverage.pdf
and depth.txt
results as output.
If you run the subcommand more than once (e.g. for another organism), the organism_coverage.pdf
is unique, but the depth.txt
file is overwritten.
A suggestion would be to rename depth.txt
to organism_depth.txt
.
Add Gini index to coverage plot as an estimate for coverage equality. Gini index could be used to discard false positive hits. “GINI index < 0.8 with at least 75% of the genome covered” as in Burham et al, Nat Comm, 2018.
Download from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
If the directory passed as an argument ends with slash, it fails
mihuber@timavo$ virmet tidytable --outdir /analyses/VirMetResults_KidneyPairs/virmet_output_170407_M01274_0180_000000000-B4LLY/
Traceback (most recent call last):
File "/opt/miniconda/bin/virmet", line 11, in <module>
load_entry_point('VirMet==1.1', 'console_scripts', 'virmet')()
File "/opt/miniconda/lib/python3.5/site-packages/virmet/__main__.py", line 120, in main
args.func(args)
File "/opt/miniconda/lib/python3.5/site-packages/virmet/__main__.py", line 16, in tidytable_run
tidytable.main(args)
File "/opt/miniconda/lib/python3.5/site-packages/virmet/tidytable.py", line 15, in main
run = outdir.rstrip().split('/')[-1].split('virmet_output_')[1]
IndexError: list index out of range
This works
mihuber@timavo$ virmet tidytable --outdir /analyses/VirMetResults_KidneyPairs/virmet_output_170407_M01274_0180_000000000-B4LLY```
tax_orgs
correctly excludes low coverage and low identity hits, summarise
doesn't.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.