bede / hostile Goto Github PK
View Code? Open in Web Editor NEWAccurate host read removal
License: MIT License
Accurate host read removal
License: MIT License
Do you have plan to create the reference genome with virus masked?
Hi,
I've noticed that Hostile provides three options for reference genomes. I'm wondering which reference I should select in order to remove human genes from my microbiome samples before conducting metagenomics classification.
Could you please explain the advantages of including HLA, argos985, and mycob140 in the host removal process, particularly in the context of clinical samples with CNS infections?
Currently Bowtie2 is the default backend and is tried first regardless of the input read type, with Minimap2 as the fallback. As Bowtie2 is poorly suited to long reads and long read performance was evaluated with Minimap2 in the paper, Bowtie2 should probably not be the default for long reads.
How I think it should work:
bowtie2
backend by default--aligner minimap2
--> minimap2
with sr
presetmap-ont
preset--aligner bowtie2
--> bowtie2
unpaired modeI don't think there is a major need to be able to customise away from the map-ont
preset for other long read technologies given that we are simply throwing reads out, but this may need to be revisited in future
Samtools considers an empty SAM/BAM file invalid, irritatingly. Workaround is to create empty but valid gzip file e.g.:
with gzip.open('empty.fastq.gz', 'wb') as f:
pass
Running multiple instances of Hostile on the same FASTQs in the same directory corrupts decontamination statistics since they will write to the same count files. Could fix by putting these inside a tempfile.TemporaryDirectory
CM.
Hi @bede
Hostile looks pretty awesome! You've pretty much got everything already in place to submit to Bioconda.
Are you OK if I do this? Otherwise if you are planning to, I'll hold off.
Cheers!
Robert
For some datasets, Bowtie2's CPU utilisation decreases with increasing numbers of threads. Workaround is not to use more than 8-16 threads. Raised upstream BenLangmead/bowtie2#437
For unpaired short reads
% hostile dehost --fastq1 tests/data/h37rv_10.r1.fastq.gz --fastq2 tests/data/h37rv_10.r2.fastq.gz --out-dir test
INFO: Using Bowtie2
INFO: Using cached human index (/Users/bede/Library/Application Support/hostile/human-bowtie2)
Dehosting: 0%| | 0/1 [00:00<?, ?it/s]Exception occurred during executing command bowtie2 -x '/Users/bede/Library/Application Support/hostile/human-bowtie2' -1 'tests/data/h37rv_10.r1.fastq.gz' -2 'tests/data/h37rv_10.r2.fastq.gz' -k 1 --mm -p 10| tee >(samtools view -F 256 -c - > test/h37rv_10.r1.reads_in.txt)| samtools view --threads 5 -f 12 - | tee >(samtools view -F 256 -c - > test/h37rv_10.r1.reads_out.txt) | awk 'BEGIN{FS=OFS="\t"} {$1=int((NR+1)/2)" "; print $0}' | samtools fastq --threads 5 -c 6 -N -1 'test/h37rv_10.r1.dehosted_1.fastq.gz' -2 'test/h37rv_10.r2.dehosted_2.fastq.gz': Command '['/bin/bash', '-c', 'bowtie2 -x \'/Users/bede/Library/Application Support/hostile/human-bowtie2\' -1 \'tests/data/h37rv_10.r1.fastq.gz\' -2 \'tests/data/h37rv_10.r2.fastq.gz\' -k 1 --mm -p 10| tee >(samtools view -F 256 -c - > test/h37rv_10.r1.reads_in.txt)| samtools view --threads 5 -f 12 - | tee >(samtools view -F 256 -c - > test/h37rv_10.r1.reads_out.txt) | awk \'BEGIN{FS=OFS="\\t"} {$1=int((NR+1)/2)" "; print $0}\' | samtools fastq --threads 5 -c 6 -N -1 \'test/h37rv_10.r1.dehosted_1.fastq.gz\' -2 \'test/h37rv_10.r2.dehosted_2.fastq.gz\'']' returned non-zero exit status 1.
Dehosting: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.09it/s]
Traceback (most recent call last):
File "/Users/bede/miniconda3/envs/hostile/bin/hostile", line 8, in <module>
sys.exit(main())
File "/Users/bede/Research/Git/hostile/src/hostile/cli.py", line 68, in main
defopt.run(
File "/Users/bede/miniconda3/envs/hostile/lib/python3.10/site-packages/defopt.py", line 356, in run
return call()
File "/Users/bede/Research/Git/hostile/src/hostile/cli.py", line 28, in dehost
stats = lib.dehost_paired_fastqs(
File "/Users/bede/Research/Git/hostile/src/hostile/lib.py", line 140, in dehost_paired_fastqs
stats = gather_stats(fastqs, out_dir=out_dir)
File "/Users/bede/Research/Git/hostile/src/hostile/lib.py", line 89, in gather_stats
n_reads_in = util.parse_count_file(n_reads_in_path)
File "/Users/bede/Research/Git/hostile/src/hostile/util.py", line 55, in parse_count_file
with open(path, "r") as fh:
FileNotFoundError: [Errno 2] No such file or directory: 'test/h37rv_10.r1.reads_in.txt'
Currently a default ref/index is downloaded even if a custom index is specified. Thanks for raising @pvanheus
Notes
.mmi
, or .fasta
with or without compressionShould probably complain about output files already existing unless a --force
flag is given
Hi! I was just wondering if this would be suitable for removing human contaminants from metatranscriptomics (RNA-seq)? And, if so, would you recommend using the default human-t2t or the human-t2t + argos985?
Thank you!
Refactoring around Task and Batch classes could simplify things, and make it practically possible to use temporary directories. Initial GIL paranoia wrt parallelisation led to slightly unwieldy current implementation.
Would probably have to be for single reads only (nanopore)
gen_clean_cmd()
and gen_paired_clean_cmd()
both mutate Aligner.cmd
and Aligner.paired_cmd
, and since Aligner is instantiated once, templating fails after processing the first fastq / pair of fastqs leading to corruption of the output of subsequent samples when using the Python API. Templating should not mutate Aligner instance. Needs tests for single and paired reads.
e.g. Removed x/y reads (z%)
Needs a test case
Currently a custom database can be specified using --index
. However, this needs to be already available on a local filesystem. If no index is specified and the default (human-t2t-hla
) is not cached locally, it is downloaded. It would be useful if applications depending on Hostile could override the default database such that a custom database could be automatically downloaded on first run of not already present.
Another way to do this would be to implement a database / fetch subcommand
Masking instructions are buried in the supplement. Masking between arbitrary refs should be automated and integrated into the tool as a subcommand
Currently if a genome/index download is abandoned, Hostile may think it's present and correct leading to errors. Could download to a temp location and move into $XDG_DATA_DIR
or download and rename etc
Two calls to tee
in stream?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.