Giter VIP home page Giter VIP logo

harpy's People

Contributors

pdimens avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

gmkov

harpy's Issues

Molecule lengths inferred from BX tags with BWA

Hi Pavel,

I tested the new conda version of Harpy installed with mamba, and it works great!
I just had to install the R package "circlize" myself in my Harpy environment because apparently it didn't come with the conda installation.

I'm now working with the BWA mapping pipeline instead of EMA following your suggestion. The alignment works well, as do the various sumstats output by Harpy align, with the exception of the reportGencov.Rmd script, which I had to modify in two places to match the format of the sample.cov.gz file produced with the BWA option (I replaced 'Barcodes' with 'alignments' and modified the circos.genomicTrack() function).

I was wondering if you were planning to integrate in the BWA pipeline the molecule lengths inferred from BX tags (and also possibly binned as a histogram) as you did with the EMA option. We are currently looking into how to extract this information directly from the sorted bam, I'll let you know if we find a solution on our side.

Best
Pierre-Alexandre

remove ema-h

Since ema on bioconda has been updated to 0.7.0, remove the bundled ema-h and likewise remove that part from the build script and add ema as a dep in the meta.yaml file necessary for conda-build.

Also, change references to ema-h to ema in the snakemake workflow

validate samplename

In steps with an input vcf, cross-reference the sample names in the vcf with the samplenames available in the input directory.

The goal is to catch sample name mismatching early, or to run without error and print some kind of log info saying "sample blah was in the directory but not listed in the vcf"

-x for module `qc` creates error

Describe the bug

When using -x (--extra-params) for the QC module, the workflow results in an error about config files not existing.

Harpy Version

0.8.0 and under

File that triggers the error (if applicable)

No response

Harpy error log

╭─ Harpy qc ──────────────────────────────────────────────────────────────╮
│ Samples: 2                                                              │
│ Output Directory: QC/                                                   │
╰─────────────────────────── Running Workflow ────────────────────────────╯
Traceback (most recent call last):

  File "/home/runner/micromamba/envs/harpy/lib/python3.12/site-packages/snakemake/cli.py", line 1940, in args_to_api
    config_settings=ConfigSettings(
                    ^^^^^^^^^^^^^^^

  File "<string>", line 6, in __init__

  File "/home/runner/micromamba/envs/harpy/lib/python3.12/site-packages/snakemake/settings.py", line 304, in __post_init__
    self.overwrite_config = self._get_overwrite_config()
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/runner/micromamba/envs/harpy/lib/python3.12/site-packages/snakemake/settings.py", line 312, in _get_overwrite_config
    update_config(overwrite_config, load_configfile(f))
                                    ^^^^^^^^^^^^^^^^^^

  File "/home/runner/micromamba/envs/harpy/lib/python3.12/site-packages/snakemake/common/configfile.py", line 38, in load_configfile
    config = _load_configfile(configpath)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/runner/micromamba/envs/harpy/lib/python3.12/site-packages/snakemake/common/configfile.py", line 12, in _load_configfile
    obj = open(configpath_or_obj, encoding="utf-8")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

FileNotFoundError: [Errno 2] No such file or directory: 'extra: --trim_poly_g\n'


### Before submitting

- [X] I have read the [relevant documentation](https://pdimens.github.io/harpy/).
- [X] I am using the latest release of Harpy.

mark duplicates with samtools

Using samtools markdup will allow marking duplicates in a barcode-aware fashion. The process follows:

samtools collate -o namecollate.bam example.bam

samtools fixmate -m namecollate.bam fixmate.bam

samtools sort -o positionsort.bam fixmate.bam

samtools markdup positionsort.bam markdup.bam

add minimap2 aligner

Description of feature

For reads >100bp, minimap2 is comparable to bwa in accuracy and much, much faster. Add it as an aligner option.

containerize

With the release of Snakemake 8.x, this would be a good time to reevaluate how Harpy is distributed. Instead of having one environment with all the dependencies (~1gb disk space), it may be a lot easier to strip the harpy conda package down to the essentials and containerize the rest and let snakemake use that container (or multiple) to do execute the workflows.

The essentials would likely just be:

  • snakemake
  • rich-click

It would then make sense to create a container for each analytical module so that the deps can be managed independent of the entire workflow

output prefix option

Description of feature

Provide a option like -o and --output-prefix to specify an output folder, otherwise default to harpy

ema-h preproc -p generating empty ema-bin-xxx files

Hello,

Thanks for this fantastic pipeline and the detailed notes that come with it, very useful!
I went through the quality trimming and read mapping modules, the snakemake rules are executed correctly, but a problem happens when running ema-h.

I checked the ema-h commands by hand one by one out of the Harpy pipeline. I found no problem with the ema-h count -p command, which outputs a ~50Mo ema-ncnt file. The ema-count Summary generated by Harpy also looks good, e.g.:

"Sample","ReadsIgnored","BarcodeValid","BarcodeTotal","PercentValid","PercentIgnored"
"Sample_1","302610","22730346","22730346","100","1.33"

But then the ema-h preproc -p command generates a list of empty ema-bin-xxx output files.

Here is the detailed command line I ran out of Harpy:
seqfu interleave -1 /my_path_to_/Trimming/Sample_1.R1.fq.gz -2 /my_path_to_/Trimming/Sample_1.R2.fq.gz | ema-h preproc -p -n 500 -t 2 -o ./ /my_path_to_/Alignments/ema/count/Sample_1.ema-ncnt

And the output:

:: Bucketing 1 inputs into 500 files with 2 threads
:: Loading known counts file /my_path_to_/Alignments/ema/count/Sample_1.ema-ncnt ...
:: Loading known counts ... done in 4.8 s
:: File assignment ... done in 2.0 s
:: [==================================================] 0 out of 22,730,346 (100.00%)
Writing barcodes ... done in 49.5 s

Looks like although the data contain 22,730,346 valid barcodes, none of them has been processed/written by ema-h preproc -p.

Do you have any idea what might be going on here?
Thanks!
Pierre-Alexandre

harpy impute error IsADirectoryError:

Describe the bug

Thank you for this great resource. I realise it still under development, so unsure about whether to expect modules to run smoothly.

After some initial teething issues with sample names in my own vcf file created with bcftools mpileup outside harpy, I ran into a cryptic conda issue. So I went a step back, and used the harpy snp mpileup module to obtain a bcf, which finished with an error (i think just unable to compile the bcf html report) but the files looked ok within SNP/mpileup/. So then I tried imputation with harpy impute and using this file as input.

my command is:

harpy impute --threads 10 --parameters stitch.params \
--vcf SNP/mpileup/variants.raw.bcf \
bams 2> log.impute

my stitch param file is:

model   usebx   bxlimit k       s       ngen
diploid TRUE    50000   30      1       500
diploid FALSE   50000   30      1       500
diploid TRUE    50000   40      1       500
diploid FALSE   50000   40      1       500

This produces the same cryptic conda error as before, which I paste below.

I also ran the harpy preflight bam bams and whole snakemake build a DAG correctly and seems to create/activate conda environments fine, and runs correctly (after I manually installed install.packages("flexdashboard"), this was the error inititally). So the preflight checks with the bam files were ok.

Harpy Version

0.9.1

File that triggers the error (if applicable)

No response

Harpy error log

(harpy) [mgm49@login-n-1 harpy]$ harpy impute --threads 10 --parameters stitch.params \
--vcf SNP/mpileup/variants.raw.bcf \
bams 2> log.impute

(harpy) [mgm49@login-n-1 harpy]$ less log.impute 
Traceback (most recent call last):
  File "/home/mgm49/miniconda3/envs/harpy/bin/harpy", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/harpy/__main__.py", line 383, in main
    cli()
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/rich_click/rich_command.py", line 126, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/harpy/impute.py", line 66, in impute
    fetch_file(f"{i}.Rmd", f"{workflowdir}/report/")
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/harpy/helperfunctions.py", line 124, in fetch_file
    shutil.copy2(result, destination)
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/shutil.py", line 475, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/shutil.py", line 260, in copyfile
    with open(src, 'rb') as fsrc:
         ^^^^^^^^^^^^^^^
IsADirectoryError: [Errno 21] Is a directory: '/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/Impute'

Before submitting

Add bam indexing to snp module

Description of feature

Currently, harpy snp expects the input bam files to be indexed. Create a rule to index bam files so as to avoid hiccups. #70

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.