dnanexus / indextools Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 3.0 468 KB

IndexTools is a toolkit for extremely fast NGS analysis based on index files.

License: MIT License

Makefile 2.60% Python 96.78% Dockerfile 0.61%

indextools's Introduction

DNAnexus

Dnanexus Apps and Scripts

applets

binning_step0: BioBin Pipeline
biobin_pipeline
binning_step1: BioBin Pipeline
biobin_pipeline
binning_step2: BioBin Pipeline
biobin_pipeline
binning_step3: BioBin Pipeline
biobin_pipeline
impute2_group_join: Impute2_group_join
This app can be used to merge multiple imputed impute2 files
plato_biobin: PLATO BioBin Regression Analysis
PLATO_BioBin
vcf_batch: VCF Batch effect tester
vcf_batch

apps

association_result_annotation: Annotate GWAS, PheWAS Assocaitions
association_result_annotation
biobin:
This app runs the latest development build of the rare variant binning tool BioBin.
generate_phenotype_matrix: Generate Phenotype Matrix
generate_phenotype_matrix
genotype_case_control: Generate Case/Control by Genotype
App provides case and control number by each genotype
impute2: imputation
This will perfrom imputation using Impute2
impute2_to_plink: Impute2 To PLINK
Convert Impute2 file to PLINK files
plato_single_variant: PLATO - Single Variant Analysis
Apps allows you to run single variant association testing against single phenotype (GWAS) or multiple phenotype (PheWAS) test
rl_sleeper_app: sleeper
This App provides some useful tools when working with data in DNANexus. This App is designed to be run on the command line with "dx run --ssh RL_Sleeper_App" in the project that you have data that you want to explore (use "dx select" to switch projects as needed).
shapeit2: SHAPEIT2
This app do phasing using SHAPEIT2
strand_align: Strand Align
Strand Align prior to phasing
vcf_annotation_formatter:
Extracts and reformats VCF annotations (CLINVAR, dbNSFP, SIFT, SNPEff)
QC_apps subfolder:
- drop_marker_sample: Drop Markers and/or Samples (PLINK)
  - drop_marker_sample
drop_relateds: Relatedness Filter (IBD)
- drop_relateds
extract_marker_sample: Drop Markers and/or Samples (PLINK)"
- extract_marker_sample
maf_filter: Marker MAF Rate Filter (PLINK)
- maf_filter
marker_call_filter: Marker Call Rate Filter (PLINK)
- marker_call_filter
missing_summary: Missingness Summary (PLINK)
- Returns missingness rate by sample
pca: Principal Component Analysis using SMARTPCA
- pca
sample_call_filter: Sample Call Rate Filter (PLINK)
- sample_call_filter

scripts

cat_vcf.py *
download_intervals.py *
download_part.py *
estimate_size.py *
interval_pad.py
- This reads a bed file from standard input, pads the intervals, sorts and then outputs the intervals guranteed to be non-overlapping
update_applet.sh *

sequencing

bcftools_view:
- Calls "bcftools view". Still in experimental stages.
calc_ibd:
- Calculates a pairwise IBD estimate from either VCF or PLINK files using PLINK 1.9.
call_bqsr: Base Quality Score Recalibration
- Call GATK BaseRecalibrator and return the tables for use in HaplotypeCaller
call_genotypes:
- Obsolete, do not use; use geno_p instead. Calls GATK GenotypeGVCFs.
call_hc:
- Call GATK HaplotypeCaller and return gVCF files
call_vqsr:
- Calls GATK VariantRecalibrator and returns the files needed to apply the recalibration
cat_variants: combine_variants
- Combines non-overlapping VCF files with the same subjects. A reimplementation of GATK CatVariants (GATK CatVariants available upon request)
combine_variants: combine_variants
- Calls GATK CombineVariants to merge VCF files
gen_ancestry:
- Determine Ancestry from PCA. Uses an eigenvector file and training dataset listing known ancestries. Runs QDA to determine posterior ancestries for all samples, even those in the training set.
gen_related_todrop:
- Uses a PLINK IBD file to determine the minimal set of samples to drop in order to generate an unrelated sample set. Uses a minimum vertex cut algorithm of the related samples to get
geno_p:
- Calls GATK GenotypeGVCFs in parallel by chromosome
merge_gvcfs:
- Calls GATK CombineGVCFs
plink_merge:
- Merge PLINK bed/bim/fam files using PLINK 1.9
select_variants: VCF QC
- Calls GATK SelectVariants
variant_annotator: VCF QC
- Calls GATK VariantAnnotator
vcf_annotate: Annotate VCF File
- Use a variety of tools to annotate a sites-only VCF.
vcf_concordance: VCF Concordance
- Generate concordance metrics from VCF file(s) using GATK GenotypeConcordance. Not recommended for large files.
vcf_gen_lof:
- Subset a VCF from vcf_annotate based on the given annotations to get a sites-only VCF of loss-of-function variants.
vcf_pca:
- Uses PLINK 1.9 and eigenstrat 6.0 to calculate principal components from VCF or PLINK bed/bim/fam files.
vcf_qc:
- Calls GATK ApplyRecalibration and GATK VariantFiltration to apply filters to VCF files.
vcf_query:
- Calls "bcftools query" to extract annotations from the VCF file. Used in the stripping of files for MEGAbase
vcf_sitesonly: VCF QC
- Generates a sites-only file from full VCF files.
vcf_slice: Slice VCF File(s)
- Return a small section of a VCF file (similar to tabix). For large output, many small regions, or subsetting samples, use subset_vcf instead.
vcf_summary: VCF Summary Statistics
- Generate summary statistics for a VCF file (by sample and by variant)
vcf_to_plink:
- Uses PLINK 1.9 to convert VCF files to PLINK bed/bim/fam files

indextools's People

Contributors

Stargazers

Watchers

Forkers

rcooper47 jaydosunmu jdidion

indextools's Issues

Add unit tests for utils.py

Test all functions/classes in utils.py

Build Docker image that contains IndexTools

Currently, the Docker image just sets up a development environment. Create a Docker image that actually contains IndexTools, and add documentation.

Add ability to partition around N's

Reimplement ScatterIntervalsByNs or just provide bed files for each common reference genome.

Use RefGet to fetch contig information

When the primary file is not available or does not contain contig information, use RefGet to fetch the information on contig names and sizes, rather than requiring a .genome file. If the primary file is supplied and it specified the ID or hash of the reference genome, use that to lookup the metadata; otherwise, require the genome ID or hash to be passed as a command line option.

RefGet spec: http://samtools.github.io/hts-specs/refget.html

Server list:

https://www.ebi.ac.uk/ena/cram/

Python client: https://github.com/ga4gh/refget-client

Replace InterLap with a library

Options:

https://github.com/brentp/quicksect
https://github.com/lh3/cgranges (has no equivalent to closest, but that is not currently being used in IndexTools)

These other options are less interesting

https://biocore-ntnu.github.io/pyranges/ - optimized for many-vs-many queries
https://github.com/hunt-genes/ncls/ - optimized for many-vs-many queries (this library is a dependency of pyranges)
https://github.com/dcjones/coitrees - no python binding

Add unit tests for regions.py

Adding this issue to start work on regions.py

IndexTools broken on python 3.7

There appears to be a difference between 3.6 and 3.7 with the attributes available on types:

root@39d3db5b72a7:/# which indextools 
/usr/local/bin/indextools
root@39d3db5b72a7:/# indextools
Traceback (most recent call last):
  File "/usr/local/bin/indextools", line 6, in <module>
    from indextools.console import indextools
  File "/usr/local/lib/python3.7/site-packages/indextools/console/__init__.py", line 37, in <module>
    ac.conversion(decorated=parse_region)
  File "/usr/local/lib/python3.7/site-packages/autoclick/types/__init__.py", line 68, in conversion
    return decorator(decorated)
  File "/usr/local/lib/python3.7/site-packages/autoclick/types/__init__.py", line 63, in decorator
    click_type = ParamTypeAdapter(_dest_type.__name__, target)
  File "/usr/local/lib/python3.7/typing.py", line 702, in __getattr__
    raise AttributeError(attr)
AttributeError: __name__

Set up travis CI

Add travis config file to run tests on commit.

InterLap and Quicksect give different results for find nearest operation

See the benchmarking branch. I create InterLap and Quicksect interval trees with the same intervals, and query with the same set of query intervals. They return the same results for the find operation, but not for nearest_after. There is probably a bug in the InterLap code.

Add option for using LPT algorithm to compute interval groups

Switch to using scm-based plugin for version management

Currently blocked by this open issue in poetry: python-poetry/poetry#693 (comment)

Add invert option to partition

Provide an invert option so e.g. if given a blacklist BED partitions could be done for the non-blacklist regions.

Aggregate data across multiple samples

It may be useful to generate a single BAM file for parallelization across many samples, rather than one per sample. To do that, we can simply sum the volumes across samples for each interval, and then split on the aggregate data.

Add integration tests for partition command

To start, this should just be a single test that runs the partition command on an index file and ensures it produces a BED file with the expected number of partitions.

In the long-term, I'd like this to grow into a parameterized test run across a bunch of index files from different sources.

Replace pysam with a pure-python library

pysam cannot be compiled on windows, whereas bamnostic and pybam are pure python and platform-agnostic.

Add option to partition to specify problematic regions

Some regions end up being more challenging to variant callers, and so there should be smaller splits in those regions than others.

Add unit tests for index.py

Motivation

Adding tests allow developers to refactor or add functionality and make sure the module still works correctly. They're lightweight but extremely helpful.

Issue

index.py needs unit test coverage

Merge multiple test_data.json files in directory hierarchy

Support the following use case:

tests
|_test_data.json
|_module1
  |_test_module1.py
  |_test_data.json

Merge data from the two test_data.json files, with the one at lower level of nesting overriding the higher one if any keys collide.

https://www.ebi.ac.uk/ena/cram/