Giter VIP home page Giter VIP logo

virmet's People

Contributors

maryamzaheri avatar ozagordi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

virmet's Issues

AssertionError

root@116ebf9b765d:/home/ubuntu# virmet fetch --viral n
Traceback (most recent call last):
File "/opt/miniconda/envs/test-virmet/bin/virmet", line 9, in
load_entry_point('VirMet==1.0rc1', 'console_scripts', 'virmet')()
File "/opt/miniconda/envs/test-virmet/lib/python3.5/site-packages/VirMet-1.0rc1-py3.5.egg/virmet/main.py", line 110, in main
args.func(args)
File "/opt/miniconda/envs/test-virmet/lib/python3.5/site-packages/VirMet-1.0rc1-py3.5.egg/virmet/main.py", line 39, in fetch_db
fetch.main(args)
File "/opt/miniconda/envs/test-virmet/lib/python3.5/site-packages/VirMet-1.0rc1-py3.5.egg/virmet/fetch.py", line 31, in main
assert gids_1 == gids_2
AssertionError

Reporting

  • NA in species list
  • turn on/off reporting of human endogenous retroviruses
  • mark usual suspects (HIV-2, choristoneura, uncultured marine virus) as contamination?

Read counts in covplot is overestimated

covplot reports number of reads from samtools idxstats that counts the number of alignments, not the number of reads. For genomes with repetitive regions, this leads to overestimate the read counts.

Logging improvements

  • log more often which sample is being analysed
  • use different logging levels
  • move logging file to output directory

Re-analysis of samples

To re-analyse a sample (because of stoppen pipeline, new version), the output folder of that sample should first be deleted manually. The pipeline should skip samples that already have an output folder.

FileNotFoundError when running single files

When running a single file, VirMet is trying to create an output folder with the whole path as name instead only the run/file name.

mihuber@timavo:~$ virmet wolfpack --file /data/MiSeq/MiSeqOutput/170426_M01274_0185_000000000-B4FP5/Data/Intensities/BaseCalls/1000251161-UR-10-1_S9_L001_R1_001.fastq.gz
Traceback (most recent call last):
  File "/opt/miniconda/bin/virmet", line 11, in <module>
    load_entry_point('VirMet==1.1', 'console_scripts', 'virmet')()
  File "/opt/miniconda/lib/python3.5/site-packages/virmet/__main__.py", line 120, in main
    args.func(args)
  File "/opt/miniconda/lib/python3.5/site-packages/virmet/__main__.py", line 25, in wolfpack_run
    od = wolfpack.main(args)
  File "/opt/miniconda/lib/python3.5/site-packages/virmet/wolfpack.py", line 385, in main
    os.chdir(out_dir)
FileNotFoundError: [Errno 2] No such file or directory: 'virmet_output_/data/MiSeq/MiSeqOutput/170426_M01274_0185_000000000-B4FP5/Data/Intensities/BaseCalls/1000251161-UR-10-1_S9_L001_R1_001'

`virmet covplot` for viruses with a segmented genome

Because viruses with a segmented genome share the same ssciname in orgs_list.tsv it's only possible to make a coverage plot of one of the segments.

If one could (alternatively) use stitle for the --organism argument, it would be possible to distinguish between the different segments.

criterion = orgs_list.loc[:, 'ssciname'].str.startswith(org_name).fillna(False)

if ssciname in unique.tsv is N/A, no entry is written in orgs_list.tsv

This bug was detected because there were much less (23 Pepino mosaic virus reads) listed in orgs_list.tsv compared to the amount of reads aligned to Pepino mosaic virus with virmet covplot(874).

Sample:

  • /analysis/VirMetResults/virmet_output_181116_M01274_0246_000000000-C6W4V/1000420494-ST-RNA_S9

sseqid not written in orgs_list.tsv:

  • MF422611.1 (Pepino mosaic virus isolate CH_bpo160, complete genome)
  • MF422613.1 (Pepino mosaic virus isolate CH_bpo162, complete genome)

sseqid written in orgs_list.tsv (example where it works):

  • gb|FJ212288.1| (Pepino mosaic virus, complete genome)

Adapt to phase out of genbank identifiers

NCBI phased GI numbers out in September 2016. The pipeline must adapt to this by retrieving and annotating sequences and taxonomies by accession numbers rather than by GIs.

Output

  • tidy data format: table with columns run, sample_id, sample_name, quality, organism_category, species, reads, percent instead of separate stats.csv and orgs_list.csv
  • coverage plot

TidyData Table

Include in VirMet the generation of a TidyData.csv Table.
Run, Sample, Name, Quality, Category, Species, Reads, Percent (of what?)

Locking strategy

Touch a file when a step is done so that it can be rerun skipping it.

docker scripts point to wrong ftp addresses for fetching genomes

Hi,
I have modified some of the scripts in the docker container to use to right ftp addresses to fetch for genomes. I have seen that also the scripts in github need to be modified.

Comparing your original scripts (here are fetch_ori.py, common_ori.py, wolfpack_ori.py) and the ones I'm using at the moment, here are the differences:

diff /opt/miniconda/envs/test-virmet/lib/python3.5/site-packages/VirMet-1.0rc1-py3.5.egg/virmet/fetch_ori.py /opt/miniconda/envs/test-virmet/lib/python3.5/site-packages/VirMet-1.0rc1-py3.5.egg/virmet/fetch.py
27c27
<         cml = 'cut -f 1,2 viral_seqs_info.tsv > viral_gi_taxid.dmp'
---
>         cml = 'cut -f 1,2 viral_seqs_info.tsv | awk \'{print $2"\t"$1}\' > viral_gi_taxid.dmp'
29,31c29,31
<         gids_1 = set(get_gids('viral_database.fasta'))
<         gids_2 = set([l.split()[0] for l in open('viral_gi_taxid.dmp')])
<         assert gids_1 == gids_2
---
>         ##gids_1 = set(get_gids('viral_database.fasta'))
>         ##gids_2 = set([l.split()[0] for l in open('viral_gi_taxid.dmp')])
>         ##assert gids_1 == gids_2
132c132
<             fasta_url = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/Bos_taurus/Assembled_chromosomes/seq/bt_ref_Bos_taurus_UMD_3.1.1_%s.fa.gz' % chrom
---
>             fasta_url = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/Bos_taurus/ARCHIVE/BUILD.6.1/Assembled_chromosomes/seq/bt_ref_Bos_taurus_UMD_3.1_%s.fa.gz' % chrom
136c136
<         gff3_url = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/Bos_taurus/GFF/ref_Bos_taurus_UMD_3.1.1_top_level.gff3.gz'
---
>         gff3_url = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/Bos_taurus/ARCHIVE/BUILD.6.1/GFF/ref_Bos_taurus_UMD_3.1_top_level.gff3.gz'


diff /opt/miniconda/envs/test-virmet/lib/python3.5/site-packages/VirMet-1.0rc1-py3.5.egg/virmet/wolfpack_ori.py /opt/miniconda/envs/test-virmet/lib/python3.5/site-packages/VirMet-1.0rc1-py3.5.egg/virmet/wolfpack.py
22c22
<     'bt_ref': '/data/virmet_databases/bovine/fasta/bt_ref_Bos_taurus_UMD_3.1.1.fasta.gz'
---
>     'bt_ref': '/data/virmet_databases/bovine/fasta/ref_Bos_taurus_UMD_3.1_top_level.gff3.gz'


diff /opt/miniconda/envs/test-virmet/lib/python3.5/site-packages/VirMet-1.0rc1-py3.5.egg/virmet/common_ori.py /opt/miniconda/envs/test-virmet/lib/python3.5/site-packages/VirMet-1.0rc1-py3.5.egg/virmet/common.py
128c128
<     x = gb['ftp_path'].apply(lambda col: col + '/' + col.split('/')[5] + '_genomic.fna.gz')
---
>     x = gb['ftp_path'].apply(lambda col: col + '/' + col.split('/')[9] + '_genomic.fna.gz') ```
I saw that this line is correct in the scripts in github

Best regards,
Maia

Build database virus

I try to build the database with the subcommand fetch, I obtained the next problem:

Traceback (most recent call last):
File "./virmet", line 11, in
load_entry_point('VirMet==1.1.1', 'console_scripts', 'virmet')()
File "/home/sonia/anaconda3/lib/python3.6/site-packages/virmet/main.py", line 120, in main
args.func(args)
File "/home/sonia/anaconda3/lib/python3.6/site-packages/virmet/main.py", line 44, in fetch_db
fetch.main(args)
File "/home/sonia/anaconda3/lib/python3.6/site-packages/virmet/fetch.py", line 21, in main
cml_search = viral_query('n')
File "/home/sonia/anaconda3/lib/python3.6/site-packages/virmet/common.py", line 134, in viral_query
os.mkdir(target_dir)
FileNotFoundError: [Errno 2] No such file or directory: '/data/virmet_databases/viral_nuccore'

could you tell me which the mistake in the build of database? please

*coverage.pdf of large genomes not displayed properly

For large genomes sometimes the coverage plots (created with virmet covplot) are not displayed well.
One example is:
/analyses/VirMetResults/virmet_output_171110_M02081_0000_000000000-BF24W/BE-spike-SSIV-1-1_S6/Human_herpesvirus_5/Human_herpesvirus_5_coverage.pdf

Looking at the depth.txt file it looks like there is an even distribution of reads across a large part of the genome but in the .pdf nothing is visible.

Quality filtering of NanoPore reads

Disable length trimming based on quality for NanoPore reads. NanoPore quality scores do not follow PHRED scores.

@2b89169b-74c9-4653-8877-c3f21ec7674d runid=2564fad3a9906330fcc0c6c90ac0c8128b89d9ff read=213 ch=124 start_time=2017-05-24T11:01:53Z
TTGTTCGTTCAGTTGGGTGTTTTATGGTTTCGTTTTTCGTGCGCCGCTTCAACAGATGAAGATGTGACATCCATTATTAATAAAGTAGACAGGCATGCTGTGGTCAAAATGGCAGTTTGTGGTAATATAGCACCATATAGCAAGAATTCTAACAAAGTTTTAACTGTAATTTAACTAATCATTATCAAATCTCTGTAGCTGCTGAAGAGAAAAGAAAAGTGGTACACTTCTGCAAATGAGACAAATCCTGTTCATCGCCATCGACAGCATGGTTGAAAAAACTCTTTGATGGTTGTTGCATTTGGAGTGCTCTGTAGTTGACATTAAACCAGCCGTCAAAAGAATTGGCCTATTAAGCCAATGGCTTGATGTTGACTAAGAGTGTAGGAGCAGC
+
$$$')7',+*$&%(+2/3/2=,)''.+)*.,,0:830,+612033).95:8*1239BDCA2461=;<A?=B:268184?7>?@928+(63+.126>[email protected]=:4*5134;F5/3,+'5*-.926:3:B;C<9..238./53;;11)249*.1DEA1?2/),(--74-++'128;1E0-.27E3.++)'(,17876355<.*,'-880'/46.,6.7570/0B1D?8202--,*-0&01*)?1..EFFBE@3254+/+,+34666=8:0/;EFFGEBD5>=/++1+35@,=A35.68++2)(+'),)213.369;>4*/+<?>=856?7>B?9C7.(.*5/245?//7:=8544.,.23/2*+(.*@.+*2(0/&*+/*))+0-,-*)+

wolfpack

Hi,

virmet wolfpack --run /home/Exp01
Traceback (most recent call last):
  File "/home/miniconda2/envs/virmet1/bin/virmet", line 11, in <module>
    load_entry_point('VirMet==1.1.1', 'console_scripts', 'virmet')()
  File "/home/miniconda2/envs/virmet1/lib/python3.5/site-packages/virmet/__main__.py", line 120, in main
    args.func(args)
  File "/home/miniconda2/envs/virmet1/lib/python3.5/site-packages/virmet/__main__.py", line 25, in wolfpack_run
    od = wolfpack.main(args)
  File "/home/miniconda2/envs/virmet1/lib/python3.5/site-packages/virmet/wolfpack.py", line 410, in main
    sd = hunter(fq)
  File "/home/miniconda2/envs/virmet1/lib/python3.5/site-packages/virmet/wolfpack.py", line 126, in hunter 
    with open('prinseq.log') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'prinseq.log'

virmet.log
INFO 2017/12/01 14:22:40 __main__.py: main() 116:       /home/miniconda2/envs/virmet1/bin/virmet wolfpack --run /home/Exp01
INFO 2017/12/01 14:22:40 wolfpack.py: main() 391:       samples to run: S16 S7
INFO 2017/12/01 14:22:40 wolfpack.py: main() 409:       running hunter on /home/Exp01/A-16_S16_L001_R1_001.fastq.gz
DEBUG 2017/12/01 14:22:40 wolfpack.py: hunter() 47:     hunter will run on 16 processors
INFO 2017/12/01 14:22:40 common.py: run_child() 56:     Running instance of gunzip
DEBUG 2017/12/01 14:22:45 common.py: run_child() 62:    Completed
DEBUG 2017/12/01 14:22:45 wolfpack.py: hunter() 77:     trimming with seqtk
INFO 2017/12/01 14:22:45 common.py: run_child() 56:     Running instance of seqtk
DEBUG 2017/12/01 14:22:50 common.py: run_child() 62:    Completed
INFO 2017/12/01 14:22:50 common.py: run_child() 56:     Running instance of wc
DEBUG 2017/12/01 14:22:50 common.py: run_child() 62:    Completed
INFO 2017/12/01 14:22:50 common.py: run_child() 56:     Running instance of split
DEBUG 2017/12/01 14:22:52 common.py: run_child() 62:    Completed
DEBUG 2017/12/01 14:22:52 wolfpack.py: hunter() 101:    filtering with prinseq
INFO 2017/12/01 14:22:52 common.py: run_child() 56:     Running instance of /usr/bin/seq
ERROR 2017/12/01 14:22:52 common.py: run_child() 64:    Execution of /usr/bin/seq -f %03g 0 15 | xargs -P 16 -I {} prinseq             -fastq splitted{}.fastq -lc_method entropy -lc_threshold 70             -log prinseq{}.log -min_qual_mean 20             -out_good ./good{} -out_bad ./bad{} > ./prinseq.err 2>&1 failed with returncode 127:
ERROR 2017/12/01 14:22:52 common.py: run_child() 65:    /usr/bin/seq -f %03g 0 15 | xargs -P 16 -I {} prinseq             -fastq splitted{}.fastq -lc_method entropy -lc_threshold 70             -log prinseq{}.log -min_qual_mean 20             -out_good ./good{} -out_bad ./bad{} > ./prinseq.err 2>&1
DEBUG 2017/12/01 14:22:52 wolfpack.py: hunter() 108:    cleaning up
INFO 2017/12/01 14:22:52 common.py: run_child() 56:     Running instance of rm
DEBUG 2017/12/01 14:22:53 common.py: run_child() 62:    Completed

Coverage plot

After a wolfpack run, user should be able to select a viral organism, realign viral reads to that and produce a coverage plot.
Already a point in #4

Update search query to include all viral genomes

As @sschmutz pointed out, the viral genomes page at NCBI lists, as of today, 7477 genomes (9533 sequences). While many of these are already included in the database, there are some notable exceptions like Influenza B. The viral query search in common.py returns 64165 sequences,

The query search given by NCBI to list all viral genomes is

"Viruses"[Organism] AND srcdb_refseq[PROP] NOT wgs[PROP] NOT "cellular organisms"[Organism] NOT AC_000001[PACC] : AC_999999[PACC]

The query search below should return both our old sequences and RefSeq genomes, and lists 67920 sequences

esearch -db nuccore -query "Viruses [orgn]
AND (\"complete genome\" [Title] OR srcdb_refseq[prop])
NOT txid131567 [orgn] NOT wgs[prop]
NOT AC_000001[PACC] : AC_999999[PACC]"

This seems to return a more complete database.

Input

  • run entire run (i.e. all fast files in a given directory) or just a single file
  • not restricted to only Illumina file name (Name_SX_L001_R....), i.e. input can also be a file like "file1.fastq" from any other source

Version of VirMet and virus reference database

It would be good if one could easily retrieve the version of MinVar and the last update date of the virus reference database (e.g. with a --version subcommand).

This would be important for

citation in publications
follow-up of changes in virus taxonomy

add measured variable of coverage and species information

This issue suggests two options to enhance the VirMet output.

  1. Include a measurement of coverage for each entry in orgs_list.tsv, that way one could faster/better distinguish false from true hits.
    One example is on timavo /analyses/VirMetResults/virmet_output_171127_M01274_0200_000000000-BHDG5/1000378040-AR-1_S1. If the coverage plots of the first three hits are compared (Moraxella phage, BeAn 58058 virus, Streptococcus phage) it's likely that the first two are false while the Streptococcus phage hits are true.

  2. Add species level to each entry in orgs_list.tsv so it can be summarized. Sometimes there are a lot of different entries coming from the same virus. We don't always know if it's actually the same virus (keeping the original information is important) but it would help to make the report more readable.
    One example is on timavo /Volumes/analyses/VirMetResults/virmet_output_171110_M02081_0000_000000000-BF24W/BE-spike-SSIV-1-1_S6.
    sscinames: Human adenovirus 7d2 (staxids: 326124)
    sscinames: Human adenovirus 7 (staxids: 10519)
    ...
    they don't share the same staxids but if you look at the classification (in that case with the R package taxize), you see that they share the same species. Same with other adenovirus hits.

> library(taxize)
> classification("Human adenovirus 7d2", db="ncbi")

Retrieving data for taxon 'Human adenovirus 7d2'

$`Human adenovirus 7d2`
                         name         rank     id
1                     Viruses superkingdom  10239
2 dsDNA viruses, no RNA stage      no rank  35237
3                Adenoviridae       family  10508
4              Mastadenovirus        genus  10509
5      Human mastadenovirus B      species 108098
6         Human adenovirus B1      no rank 565302
7          Human adenovirus 7      no rank  10519
8        Human adenovirus 7d2      no rank 326124

attr(,"class")
[1] "classification"
attr(,"db")
[1] "ncbi"
> classification("Human adenovirus 7", db="ncbi")

Retrieving data for taxon 'Human adenovirus 7'

$`Human adenovirus 7`
                         name         rank     id
1                     Viruses superkingdom  10239
2 dsDNA viruses, no RNA stage      no rank  35237
3                Adenoviridae       family  10508
4              Mastadenovirus        genus  10509
5      Human mastadenovirus B      species 108098
6         Human adenovirus B1      no rank 565302
7          Human adenovirus 7      no rank  10519

attr(,"class")
[1] "classification"
attr(,"db")
[1] "ncbi"

Virus reads used for alignment covplot

We were wondering, why covplot aligns all viral reads to the sequence with the highest number of reads mapped and not just the reads of the selected organism. Depending on the alignment thresholds in virmet and covplot, reads from different organism could be combined in this new alignment.

virmet covplot not working because of missing ggplot package

Virmet covplot did run and generated a depth.txt file, but no plot because of missing ggplot package?
Error message: Error in library(ggplot2) : there is no package called ‘ggplot2’
Conda base environment was used.
Example here: /analyses/VirMetResults/virmet_output_191010_M01274_0292_000000000-CNPVH/1000458456-BE-DNA_S3/Torque/

./ or ../ as --outdir argument in tidytable.py leads to an error

...but it works with covplot.py (virmet covplot --outdir ./ --organism "...").
I don't know if this is something which needs to be fixed (since it works with the absolute path).
This is the output I get:

stschmu@timavo:/analyses/VirMetResults/virmet_output_171024_M01274_0196_000000000-BF7JF$ virmet tidytable --outdir ./
Traceback (most recent call last):
  File "/opt/miniconda/bin/virmet", line 11, in <module>
    load_entry_point('VirMet==1.1.1', 'console_scripts', 'virmet')()
  File "/opt/miniconda/lib/python3.5/site-packages/virmet/__main__.py", line 120, in main
    args.func(args)
  File "/opt/miniconda/lib/python3.5/site-packages/virmet/__main__.py", line 16, in tidytable_run
    tidytable.main(args)
  File "/opt/miniconda/lib/python3.5/site-packages/virmet/tidytable.py", line 15, in main
    run = outdir.rstrip('/').split('/')[-1].split('virmet_output_')[1]
IndexError: list index out of range

depth.txt is overwritten if you run covplot more than once

When the subcommand covplot is used, organism_coverage.pdf and depth.txt results as output.
If you run the subcommand more than once (e.g. for another organism), the organism_coverage.pdf is unique, but the depth.txt file is overwritten.
A suggestion would be to rename depth.txt to organism_depth.txt.

Gini index

Add Gini index to coverage plot as an estimate for coverage equality. Gini index could be used to discard false positive hits. “GINI index < 0.8 with at least 75% of the genome covered” as in Burham et al, Nat Comm, 2018.

Bug in tidytable

If the directory passed as an argument ends with slash, it fails

mihuber@timavo$ virmet tidytable --outdir /analyses/VirMetResults_KidneyPairs/virmet_output_170407_M01274_0180_000000000-B4LLY/
Traceback (most recent call last):
  File "/opt/miniconda/bin/virmet", line 11, in <module>
    load_entry_point('VirMet==1.1', 'console_scripts', 'virmet')()
  File "/opt/miniconda/lib/python3.5/site-packages/virmet/__main__.py", line 120, in main
    args.func(args)
  File "/opt/miniconda/lib/python3.5/site-packages/virmet/__main__.py", line 16, in tidytable_run
    tidytable.main(args)
  File "/opt/miniconda/lib/python3.5/site-packages/virmet/tidytable.py", line 15, in main
    run = outdir.rstrip().split('/')[-1].split('virmet_output_')[1]
IndexError: list index out of range

This works

mihuber@timavo$ virmet tidytable --outdir /analyses/VirMetResults_KidneyPairs/virmet_output_170407_M01274_0180_000000000-B4LLY```

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.