Giter VIP home page Giter VIP logo

hostile's People

Contributors

bede avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

hostile's Issues

Regarding the choice of Reference genomes

Hi,
I've noticed that Hostile provides three options for reference genomes. I'm wondering which reference I should select in order to remove human genes from my microbiome samples before conducting metagenomics classification.

Could you please explain the advantages of including HLA, argos985, and mycob140 in the host removal process, particularly in the context of clinical samples with CNS infections?

Automatically choose most appropriate alignment backend for the input read type

Currently Bowtie2 is the default backend and is tried first regardless of the input read type, with Minimap2 as the fallback. As Bowtie2 is poorly suited to long reads and long read performance was evaluated with Minimap2 in the paper, Bowtie2 should probably not be the default for long reads.

How I think it should work:

  • Paired input --> bowtie2 backend by default
  • Paired input with --aligner minimap2 --> minimap2 with sr preset
  • Unpaired input --> Minimap2 with map-ont preset
  • Unpaired input with --aligner bowtie2 --> bowtie2 unpaired mode

I don't think there is a major need to be able to customise away from the map-ont preset for other long read technologies given that we are simply throwing reads out, but this may need to be revisited in future

Adding to Bioconda

Hi @bede

Hostile looks pretty awesome! You've pretty much got everything already in place to submit to Bioconda.

Are you OK if I do this? Otherwise if you are planning to, I'll hold off.

Cheers!
Robert

out_dir / --out-dir broken

% hostile dehost --fastq1 tests/data/h37rv_10.r1.fastq.gz --fastq2 tests/data/h37rv_10.r2.fastq.gz --out-dir test
INFO: Using Bowtie2
INFO: Using cached human index (/Users/bede/Library/Application Support/hostile/human-bowtie2)
Dehosting:   0%|                                                                                                                | 0/1 [00:00<?, ?it/s]Exception occurred during executing command bowtie2 -x '/Users/bede/Library/Application Support/hostile/human-bowtie2' -1 'tests/data/h37rv_10.r1.fastq.gz' -2 'tests/data/h37rv_10.r2.fastq.gz' -k 1 --mm -p 10| tee >(samtools view -F 256 -c - > test/h37rv_10.r1.reads_in.txt)| samtools view --threads 5 -f 12 - | tee >(samtools view -F 256 -c - > test/h37rv_10.r1.reads_out.txt) | awk 'BEGIN{FS=OFS="\t"} {$1=int((NR+1)/2)" "; print $0}' | samtools fastq --threads 5 -c 6 -N -1 'test/h37rv_10.r1.dehosted_1.fastq.gz' -2 'test/h37rv_10.r2.dehosted_2.fastq.gz': Command '['/bin/bash', '-c', 'bowtie2 -x \'/Users/bede/Library/Application Support/hostile/human-bowtie2\' -1 \'tests/data/h37rv_10.r1.fastq.gz\' -2 \'tests/data/h37rv_10.r2.fastq.gz\' -k 1 --mm -p 10| tee >(samtools view -F 256 -c - > test/h37rv_10.r1.reads_in.txt)| samtools view --threads 5 -f 12 - | tee >(samtools view -F 256 -c - > test/h37rv_10.r1.reads_out.txt) | awk \'BEGIN{FS=OFS="\\t"} {$1=int((NR+1)/2)" "; print $0}\' | samtools fastq --threads 5 -c 6 -N -1 \'test/h37rv_10.r1.dehosted_1.fastq.gz\' -2 \'test/h37rv_10.r2.dehosted_2.fastq.gz\'']' returned non-zero exit status 1.
Dehosting: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.09it/s]
Traceback (most recent call last):
  File "/Users/bede/miniconda3/envs/hostile/bin/hostile", line 8, in <module>
    sys.exit(main())
  File "/Users/bede/Research/Git/hostile/src/hostile/cli.py", line 68, in main
    defopt.run(
  File "/Users/bede/miniconda3/envs/hostile/lib/python3.10/site-packages/defopt.py", line 356, in run
    return call()
  File "/Users/bede/Research/Git/hostile/src/hostile/cli.py", line 28, in dehost
    stats = lib.dehost_paired_fastqs(
  File "/Users/bede/Research/Git/hostile/src/hostile/lib.py", line 140, in dehost_paired_fastqs
    stats = gather_stats(fastqs, out_dir=out_dir)
  File "/Users/bede/Research/Git/hostile/src/hostile/lib.py", line 89, in gather_stats
    n_reads_in = util.parse_count_file(n_reads_in_path)
  File "/Users/bede/Research/Git/hostile/src/hostile/util.py", line 55, in parse_count_file
    with open(path, "r") as fh:
FileNotFoundError: [Errno 2] No such file or directory: 'test/h37rv_10.r1.reads_in.txt'

Support --custom-index

Notes

  • mm2 supports either .mmi, or .fasta with or without compression
  • bt2 requires a prebuilt index (takes forever) split across numerous files, specified as a path without an extension

Suitable for metatranscriptomics?

Hi! I was just wondering if this would be suitable for removing human contaminants from metatranscriptomics (RNA-seq)? And, if so, would you recommend using the default human-t2t or the human-t2t + argos985?

Thank you!

Refactor with more OOP

Refactoring around Task and Batch classes could simplify things, and make it practically possible to use temporary directories. Initial GIL paranoia wrt parallelisation led to slightly unwieldy current implementation.

Corrupted output if decontaminating more than one sample using Python API

gen_clean_cmd() and gen_paired_clean_cmd() both mutate Aligner.cmd and Aligner.paired_cmd, and since Aligner is instantiated once, templating fails after processing the first fastq / pair of fastqs leading to corruption of the output of subsequent samples when using the Python API. Templating should not mutate Aligner instance. Needs tests for single and paired reads.

Allow non-default databases in cloud bucket to be downloaded on first run

Currently a custom database can be specified using --index. However, this needs to be already available on a local filesystem. If no index is specified and the default (human-t2t-hla) is not cached locally, it is downloaded. It would be useful if applications depending on Hostile could override the default database such that a custom database could be automatically downloaded on first run of not already present.

Another way to do this would be to implement a database / fetch subcommand

Automatic masking

Masking instructions are buried in the supplement. Masking between arbitrary refs should be automated and integrated into the tool as a subcommand

Discard partially downloaded indexes

Currently if a genome/index download is abandoned, Hostile may think it's present and correct leading to errors. Could download to a temp location and move into $XDG_DATA_DIR or download and rename etc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.