pgxcentre / genipe Goto Github PK

Genome-wide imputation pipeline

License: Other

Python 97.51% TeX 1.82% Makefile 0.04% R 0.31% Shell 0.32%

genipe's Introduction

genipe - A Python module to perform genome-wide imputation analysis

The genipe module (standing for GENome-wide Imputation PipelinE) includes a script (named genipe-launcher) that automatically runs a genome-wide imputation pipeline using Plink, shapeit and impute2.

If you use genipe in any published work, please cite the paper describing the tool:

Lemieux Perreault LP, Legault MA, Asselin G, Dubé MP: genipe: an automated genome-wide imputation pipeline with automatic reporting and statistical tools. Bioinformatics 2016, 32 (23): 3661-3663 (DOI:10.1093/bioinformatics/btw487).

Documentation

Full documentation is available at http://pgxcentre.github.io/genipe/.

Installation

Version 1.5.0 of genipe should work with the most recent versions of the packages.

We recommend installing the package in a Python 3.7 (or latest) virtual environment. There are two ways to install: pip or conda.

# Using pip
pip install genipe

# Using conda
conda install genipe -c http://statgen.org/wp-content/uploads/Softwares/genipe

The installation process should install all required dependencies to run the main imputation pipeline. Optional dependencies can also be installed manually in order to perform statistical analysis and data management (see below).

The complete installation procedure is available in the documentation.

Dependencies

The tool requires a standard Python 3.4 (or latest) installation with the following modules:

numpy version 1.11.3 and latest
Jinja2 version 2.9 and latest
pandas version 0.19.2 and latest
setuptools version 12.0.5 and latest

The tool requires the binaries for Plink, shapeit and impute2.

Optional dependencies

In order to perform data management and statistical analysis (linear, logistic and Cox's regressions), genipe requires the following Python modules:

Matplotlib
scipy
patsy
statsmodels
lifelines
Biopython
pyfaidx
drmaa
pyplink

Note that statsmodels (specifically MixedLM analysis) version 0.6 is not compatible with numpy version 1.12 and latest.

Finally, the tool requires a LaTeX installation to compile the automatically generated report in PDF format.

Testing

Basic testing was implemented for the script to merge impute2 files resulting from different segments of the same chromosome along with some utility functions of the main package. To test the package (once installed), launch Python and execute the following command:

>>> import genipe
>>> genipe.test()

Some functionalities are difficult to test, since they mostly use external tools to perform analysis (i.e. Plink, shapeit and impute2) and it is assumed that they were properly tested by their author.

We will try to add further tests in the future.

Basic usage

The following options are available when launching a genome-wide imputation analysis.

$ genipe-launcher --help
usage: genipe-launcher [-h] [-v] [--debug] [--thread THREAD] --bfile PREFIX
                       [--reference FILE] [--chrom CHROM [CHROM ...]]
                       [--output-dir DIR] [--bgzip] [--use-drmaa]
                       [--drmaa-config FILE] [--preamble FILE]
                       [--shapeit-bin BINARY] [--shapeit-thread INT]
                       [--shapeit-extra OPTIONS] [--plink-bin BINARY]
                       [--hap-template TEMPLATE] [--legend-template TEMPLATE]
                       [--map-template TEMPLATE] --sample-file FILE
                       [--hap-nonPAR FILE] [--hap-PAR1 FILE] [--hap-PAR2 FILE]
                       [--legend-nonPAR FILE] [--legend-PAR1 FILE]
                       [--legend-PAR2 FILE] [--map-nonPAR FILE]
                       [--map-PAR1 FILE] [--map-PAR2 FILE]
                       [--impute2-bin BINARY] [--segment-length BP]
                       [--filtering-rules RULE [RULE ...]]
                       [--impute2-extra OPTIONS] [--probability FLOAT]
                       [--completion FLOAT] [--info FLOAT]
                       [--report-number NB] [--report-title TITLE]
                       [--report-author AUTHOR]
                       [--report-background BACKGROUND]

Execute the genome-wide imputation pipeline. This script is part of the
'genipe' package, version 1.4.2.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --debug               set the logging level to debug
  --thread THREAD       number of threads [1]

Input Options:
  --bfile PREFIX        The prefix of the binary pedfiles (input data).
  --reference FILE      The human reference to perform an initial strand check
                        (useful for genotyped markers not in the IMPUTE2
                        reference files) (optional).

Output Options:
  --chrom CHROM [CHROM ...]
                        The chromosomes to process. It is possible to write
                        'autosomes' to process all the autosomes (from
                        chromosome 1 to 22, inclusively).
  --output-dir DIR      The name of the output directory. [genipe]
  --bgzip               Use bgzip to compress the impute2 files.

HPC Options:
  --use-drmaa           Launch tasks using DRMAA.
  --drmaa-config FILE   The configuration file for tasks (use this option when
                        launching tasks using DRMAA). This file should
                        describe the walltime and the number of
                        nodes/processors to use for each task.
  --preamble FILE       This option should be used when using DRMAA on a HPC
                        to load required module and set environment variables.
                        The content of the file will be added between the
                        'shebang' line and the tool command.

SHAPEIT Options:
  --shapeit-bin BINARY  The SHAPEIT binary if it's not in the path.
  --shapeit-thread INT  The number of thread for phasing. [1]
  --shapeit-extra OPTIONS
                        SHAPEIT extra parameters. Put extra parameters between
                        single or normal quotes (e.g. --shapeit-extra '--
                        states 100 --window 2').

Plink Options:
  --plink-bin BINARY    The Plink binary if it's not in the path.

IMPUTE2 Autosomal Reference:
  --hap-template TEMPLATE
                        The template for IMPUTE2's haplotype files (replace
                        the chromosome number by '{chrom}', e.g.
                        '1000GP_Phase3_chr{chrom}.hap.gz').
  --legend-template TEMPLATE
                        The template for IMPUTE2's legend files (replace the
                        chromosome number by '{chrom}', e.g.
                        '1000GP_Phase3_chr{chrom}.legend.gz').
  --map-template TEMPLATE
                        The template for IMPUTE2's map files (replace the
                        chromosome number by '{chrom}', e.g.
                        'genetic_map_chr{chrom}_combined_b37.txt').
  --sample-file FILE    The name of IMPUTE2's sample file.

IMPUTE2 Chromosome X Reference:
  --hap-nonPAR FILE     The IMPUTE2's haplotype file for the non-
                        pseudoautosomal region of chromosome 23.
  --hap-PAR1 FILE       The IMPUTE2's haplotype file for the first
                        pseudoautosomal region of chromosome 23.
  --hap-PAR2 FILE       The IMPUTE2's haplotype file for the second
                        pseudoautosomal region of chromosome 23.
  --legend-nonPAR FILE  The IMPUTE2's legend file for the non-pseudoautosomal
                        region of chromosome 23.
  --legend-PAR1 FILE    The IMPUTE2's legend file for the first
                        pseudoautosomal region of chromosome 23.
  --legend-PAR2 FILE    The IMPUTE2's legend file for the second
                        pseudoautosomal region of chromosome 23.
  --map-nonPAR FILE     The IMPUTE2's map file for the non-pseudoautosomal
                        region of chromosome 23.
  --map-PAR1 FILE       The IMPUTE2's map file for the first pseudoautosomal
                        region of chromosome 23.
  --map-PAR2 FILE       The IMPUTE2's map file for the second pseudoautosomal
                        region of chromosome 23.

IMPUTE2 Options:
  --impute2-bin BINARY  The IMPUTE2 binary if it's not in the path.
  --segment-length BP   The length of a single segment for imputation. [5e+06]
  --filtering-rules RULE [RULE ...]
                        IMPUTE2 filtering rules (optional).
  --impute2-extra OPTIONS
                        IMPUTE2 extra parameters. Put the extra parameters
                        between single or normal quotes (e.g. --impute2-extra
                        '-buffer 250 -Ne 20000').

IMPUTE2 Merger Options:
  --probability FLOAT   The probability threshold for no calls. [<0.9]
  --completion FLOAT    The completion rate threshold for site exclusion.
                        [<0.98]
  --info FLOAT          The measure of the observed statistical information
                        associated with the allele frequency estimate
                        threshold for site exclusion. [<0.00]

Automatic Report Options:
  --report-number NB    The report number. [genipe automatic report]
  --report-title TITLE  The report title. [genipe: Automatic genome-wide
                        imputation]
  --report-author AUTHOR
                        The report author. [Automatically generated by genipe]
  --report-background BACKGROUND
                        The report background section (can either be a string
                        or a file containing the background. [General
                        background]

Real life example

The documentation provides a quick and easy tutorial for the complete pipeline. Refer to this page

Automatic report

The pipeline provides a report containing relevant imputation statistics. This report is located in the report directory. It can be compiled into a PDF file using the following make command:

$ make && make clean
...

This report uses LaTeX and the *.tex file can be modified to add project specific information.

Statistical analysis

Once the genome-wide imputation analysis is performed and the quality metrics provided by the automatic report have been reviewed, it is possible to perform different statistical analyses (e.g. linear or logistic regression, or survival analysis using Cox's proportional hazard model) using the provided script named imputed-stats.

$ imputed-stats --help
usage: imputed-stats [-h] [-v] {cox,linear,logistic,mixedlm,skat} ...

Performs statistical analysis on imputed data (either SKAT analysis, or
linear, logistic or survival regression). This script is part of the 'genipe'
package, version 1.4.2.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

Statistical Analysis Type:
  The type of statistical analysis to be performed on the imputed data.

  {cox,linear,logistic,mixedlm,skat}
    cox                 Cox's proportional hazard model (survival regression).
    linear              Linear regression (ordinary least squares).
    logistic            Logistic regression (GLM with binomial distribution).
    mixedlm             Linear mixed effect model (random intercept).
    skat                SKAT analysis.

See the tutorials for more information: http://pgxcentre.github.io/genipe/tutorials.html

About

This project was initiated at the Beaulieu-Saucier Pharmacogenomics Centre of the Montreal Heart Institute. The aim was to speed up (and automatize) the imputation process for the whole genome.

genipe's People

Contributors

Stargazers

Watchers

Forkers

bryketos lybird300 xtmgah nicolabarban geneticresources yzharold dinovski

genipe's Issues

Add a check for the drmaa module

It would be better to add a test for the drmaa module instead of failing with an ImportError. The main pipeline could tell the user that the drmaa module is missing when checking for arguments.

Add filtering options for imputation into automatic report

It is important to add any filtering options for imputation into the automatic report (even though it is in the run log).

For example, the option --filtering-rules 'ALL.maf<0.01' should automatically add a sentence in the automatic report showing that "reference sites were excluded if their ALL.maf value were <0.01".

Error when using a high number of threads for shapeit

I installed genipe on a CentOS system with 20 physical cores and 40 logical cores. I tested genipe as described in the documentation on http://pgxcentre.github.io/genipe/installation.html and executed genipe_tutorial. I added the option --shapeit-thread to the generated script and was able to execute it. But after I increased the value of that parameter, I got an error without an informative mesage about the reason.

In detail I did the following:

Installed genipe and tested the installation
Executed the following commands

cd
wget http://statgen.org/wp-content/uploads/Softwares/genipe/supp_files/hg19.tar.bz2
wget https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.tgz

mkdir $HOME/genipe_tutorial

mkdir $HOME/genipe_tutorial/hg19
cd $HOME/genipe_tutorial/hg19
tar -jxf $HOME/hg19.tar.bz2

cd $HOME/genipe_tutorial
tar -zxf $HOME/1000GP_Phase3.tgz
touch 1000GP_Phase3/genipe_tut_done

cd
source genipe_pyvenv/bin/activate
genipe-tutorial
deactivate

In genipe_tutorial/execute.sh I did the following changes:

Replaced --chrom autosomes by --chrom 1 (to impute only the SNPs on the first chromosome for a test)
After the line with --thread I added a line with --shapeit-thread 20 \

Then I started the imputation:

source genipe_pyvenv/bin/activate
genipe_tutorial/execute.sh
deactivate

The imputation was successful.

Then I changed the value of --shapeit-thread in genipe_tutorial/execute.sh from 20 to 40, removed the geneated directory and started the imputation again:

rm -r genipe_tutorial/genipe/
source genipe_pyvenv/bin/activate
genipe_tutorial/execute.sh

I got the following messages:

[... INFO] Phasing markers
[... ERROR] Task 'SHAPEIT phase chr1': did not finish...
[... ERROR] the following task did not work: ['SHAPEIT phase chr1']
usage: genipe-launcher [-h] [-v] [--debug] [--thread THREAD] --bfile PREFIX
                       [--reference FILE] [--chrom CHROM [CHROM ...]]
                       [--output-dir DIR] [--bgzip] [--use-drmaa]
                       [--drmaa-config FILE] [--preamble FILE]
                       [--shapeit-bin BINARY] [--shapeit-thread INT]
                       [--shapeit-extra OPTIONS] [--plink-bin BINARY]
                       [--hap-template TEMPLATE] [--legend-template TEMPLATE]
                       [--map-template TEMPLATE] --sample-file FILE
                       [--hap-nonPAR FILE] [--hap-PAR1 FILE] [--hap-PAR2 FILE]
                       [--legend-nonPAR FILE] [--legend-PAR1 FILE]
                       [--legend-PAR2 FILE] [--map-nonPAR FILE]
                       [--map-PAR1 FILE] [--map-PAR2 FILE]
                       [--impute2-bin BINARY] [--segment-length BP]
                       [--filtering-rules RULE [RULE ...]]
                       [--impute2-extra OPTIONS] [--probability FLOAT]
                       [--completion FLOAT] [--info FLOAT]
                       [--report-number NB] [--report-title TITLE]
                       [--report-author AUTHOR]
                       [--report-background BACKGROUND]
genipe-launcher: error: the following task did not work: ['SHAPEIT phase chr1']

Some flaws of the documentation

I found the following flaws of the documentation:

On the web pages of the four programs linked on the page http://pgxcentre.github.io/genipe/parameters.html there is no program name (e.g. genipe-launcher for "Main pipeline" and impute2-extractor for "Impute2 Extractor") and no description of the function of the respective program.
So far as I can see the impute2-merger is automatically called from "genipe-launcher", but not impute2-extractor (e.g. when I execute genipe_tutorial/execute.sh). What is the function of impute2-merger? Where are the parameters (e.g. for probability, completion rate and info value) defined for its call by genipe-launcher? These information are missing in the documentation (as far as I can see).
On the page http://pgxcentre.github.io/genipe/tutorials/tutorial_extract.html only the output formats impute2, disage and calls are describe but not bed (that is only mentioned in the description of the option --format).

Add a way to skip finding exclusions if it was already performed

It would be useful to add a way to skip finding exclusions if the task was already performed.

Add chromosome X imputation in the pipeline

Now that chromosome X reference (1000G phase 3) is available from IMPUTE2 website, we should add chromosome X (PAR and nonPAR) imputation in the pipeline.

Module statsmodels, [... WARNING] interaction term is categorical

After I installed genipe (on a fully updated Fedora 25 system) as described in the documentation and after correcting one error (see below) I executed the test and get an error message.

In detail I did the following as root:

dnf install python3 python3-devel python3-numpy python3-jinja2 python3-pandas python3-setuptools

And then I did the following as a user:

pyvenv $HOME/genipe_pyvenv
source $HOME/genipe_pyvenv/bin/activate
pip install genipe
pip install pyfaidx
pip install statsmodels
deactivate

Because of another error message (see ticket #41) I then replaced the character 'd' in line 247 of the file genipe/tools/imputed_stats.py by the character 'f'.

Then I did:

source $HOME/genipe_pyvenv/bin/activate
python
import genipe
genipe.test()

After that I get the following warning (and an error message which is described in ticket #42):

[... WARNING] interaction term is categorical: the last category will be used in the results
[... WARNING] when using interaction, mixedlm optimization cannot be performed, analysis will be slow

The output of the command source $HOME/genipe_pyvenv/bin/activate; python --version; deactivate is: Python 3.5.4

Optimize MixedLM analysis

One drawback of the MixedLM (statistical) analysis is that it takes a lot of time to complete. There are ways to improve the computation time drastically.

We can first approximating the results, and then compute the real MixedLM results only for loci with a p-value lower than a user-defined threshold.

Perform initial strand check using reference genome

It would be useful to perform an initial strand check using the reference genome (fasta format) and pyfaidx (for fast random access). This is useful since shapeit cannot check sites which are absent from the imputation reference (e.g. 1000G project, phase 3).

This step could be performed during the initial split by chromosome, while removing ambiguous genotypes sites.

SKAT - Improve error message when a SNP set is empty

The current error message is not very intuitive (it's the R index error).

Too many client events (DRMAA)

At the imputation step, when launching a DRMAA analysis on SGE with multiple threads, the jobs might fail with the following error: code 2: cannot register event client. Only 99 event clients are allowed in the system.

It might be due to the fact that an event client listener is created for each launched task, instead of only one for each task group (i.e. impute2).

We should initialize only one DRMAA event client listener, if possible and pass it to pool (so that only one is created).

Pandas deprecation warning for sort

We will need to use the sort_values function instead of the sort one, since the latter will be deprecated in the future.

.../test_pyvenv/lib/python3.4/site-packages/lifelines/fitters/coxph_fitter.py:285: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)

We will also need to increase the pandas version in the dependencies, so that the scripts don't fail because the sort_values function is not implemented. The sort_values function is new in version 0.17.0.

There is no Q value for SKAT-O

SKAT-O does not output a Q value, so we should not expect one.

Windows compatibility for paths in the tests module

I think we will run into path compatibility problem when using:

        data_filename = resource_filename(
            __name__,
            "data/regression_sim.txt.bz2",
        )

in the tests module. We should rewrite those bits to:

        data_filename = resource_filename(
            __name__,
            os.path.join("data", "regression_sim.txt.bz2"),
        )

I am not sure if we were going for Windows compatibility in the first place...

Sex problem ;-)

When sex is missing for a sample, imputation of chromosome X does not work.

Add preamble to temporary script (DRMAA)

When using DRMAA, it is wrong to copy the entire environment variables to the job. We should use a "preamble" file containing what to load before executing the job (e.g. module load ..., source python_env ... etc.).

Define an option for the directory with the IMPUTE2 reference files

It seems to me that the paths of the IMPUTE2 reference files, in particular those many for chromosome X, have to be specified individually by options of genipe-launcher (e.g. by --hap-nonPAR and --legend-nonPAR). But all these files are contained in the tar files 1000GP_Phase3.tgz and 1000GP_Phase3_chrX.tgz and are extracted into the directory 1000GP_Phase3 by default.

Therefore I want to suggest that there should be a option --impute2-ref-dir (or something similar) to specify the directory with all IMPUTE2 reference files (for the autosomes and for the allosomes).

ERROR missing optional module: pyplink

I installed and tested genipe as described in the documentation on http://pgxcentre.github.io/genipe/installation.html (without the optional modules lifelines and drmaa) and executed genipe_tutorial. I modified the generated script slightly to only impute chromosome 1 and run it.

Then I executed the following command (to convert the imputed varianats to plink files):

impute2-extractor --format bed --info 0.9 --impute2 genipe_tutorial/genipe/chr1/final_impute2/chr1.imputed.impute2

After that I got the following error message:

[... ERROR] missing optional module: pyplink
usage: impute2-extractor [-h] [-v] [--debug] --impute2 FILE [--index]
                         [--out PREFIX] [--format FORMAT [FORMAT ...]]
                         [--long] [--prob FLOAT] [--extract FILE]
                         [--genomic CHR:START-END] [--maf FLOAT]
                         [--rate FLOAT] [--info FLOAT]
impute2-extractor: error: missing optional module: pyplink

The resolution was easy. I executed: pip install pyplink
Please add that command to the list of commands for optional dependencies in the documentation.

Chromosome 25 (PAR1 and PAR2) should be skipped if no markers available

If there is no marker left in one of the two pseudo-autosomal regions, this region should be skipped for downstream analysis (just before splitting chromosomes).

More options to change Shapeit parameters

It would be useful to be able to change more parameters (such as --states or --window) instead of them being hard coded in the code.

Default values could be use (without hard coding them in the code), since we can just omit the options if the user don't set them.

Subset of chromosomes

It would be useful to perform the analysis on only a subset of chromosomes (e.g. --chr 1 5 22).

Custom report templates

User should be able to use custom templates for all of the sections of the automatic report (instead of using the one provided by the package).

Error with module lifelines, name 'dmatrices' is not defined

After I installed genipe (on a fully updated Fedora 25 system) as described in the documentation I executed the test and get an error message.

In detail I did the following as root:

dnf install python3 python3-devel python3-numpy python3-jinja2 python3-pandas python3-setuptools

And then I did the following as a user:

pyvenv $HOME/genipe_pyvenv
source $HOME/genipe_pyvenv/bin/activate
pip install genipe
pip install pyfaidx
pip install lifelines
python
import genipe
genipe.test()

After that I get 8 errors with the message NameError: name 'dmatrices' is not defined, e.g.:

ERROR: test_fit_cox_interaction_snp1 (genipe.tests.test_imputed_stats.TestImputedStatsCox)
Tests the 'fit_cox' function with interaction, first SNP.

Traceback (most recent call last):
File "/home/user/genipe_pyvenv/lib64/python3.5/site-packages/genipe/tests/test_imputed_stats.py", line 866, in test_fit_cox_interaction_snp1
formula=formula,
File "/home/user/genipe_pyvenv/lib64/python3.5/site-packages/genipe/tools/imputed_stats.py", line 1128, in fit_cox
y, X = dmatrices(formula, data=data, return_type="dataframe")
NameError: name 'dmatrices' is not defined

The output of the command source $HOME/genipe_pyvenv/bin/activate; python --version; deactivate is: Python 3.5.4

The output of the command source $HOME/genipe_pyvenv/bin/activate; pip list; deactivate is:

cycler (0.10.0)
genipe (1.3.3)
Jinja2 (2.10)
lifelines (0.12.0)
MarkupSafe (1.0)
numpy (1.13.3)
pandas (0.20.0)
pip (8.1.2)
pyfaidx (0.5.1)
pyparsing (2.2.0)
python-dateutil (2.6.1)
pytz (2017.3)
scipy (1.0.0)
setuptools (25.1.1)
six (1.11.0)

Get chromosome length from legend file

It would be better to get chromosome length from the legend file instead of Ensembl.

Statistical analysis on imputed genotypes (dosage)

Add a statistical analysis suite (using statsmodels and lifelines to process imputed genotype data (dosage). The following analysis should at least be added to the package:

linear regression
logistic regression
Cox's regression (survival)

CoxPH and categorical variables

Categorical variables are not properly taken care of for CoxPH analysis (when more than two factors). This needs to be addressed.

Add execution time to automatic report

Add a table in report appendix summarizing execution time for each task. This will be helpful to estimate walltime for future project.

More options to change Impute2 parameters

It would be useful to be able to change more parameters (such as -k_hap, -buffer or -Ne) instead of them being hard coded in the code.

Default values could be use (without hard coding them in the code), since we can just omit the options if the user don't set them.

options to use bfile instead of impute2 for SKAT?

I need to run SKAT using plink bfile directly without imputation any options for the same?
Thanks

Statistical computation fails when no males

When computing statistics using imputed-stats on chromosome 23, a step is to exclude males with heterozygous calls. If there are no males, the script fails with the following error:

Traceback (most recent call last):
  File "/opt/Python-3.4.3/lib/python3.4/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/opt/Python-3.4.3/lib/python3.4/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File ".../genipe/tools/imputed_stats.py", line 901, in process_impute2_site
    dosage_columns[1]
  File ".../genipe/tools/imputed_stats.py", line 996, in samples_with_hetero_calls
    return data[data.idxmax(axis=1) == hetero_c].index
  File ".../test_pyvenv/lib/python3.4/site-packages/pandas/core/ops.py", line 726, in wrapper
    res = na_op(values, other)
  File ".../test_pyvenv/lib/python3.4/site-packages/pandas/core/ops.py", line 682, in na_op
    raise TypeError("invalid type comparison")
TypeError: invalid type comparison

This is due to the fact that we have an empty pandas.DataFrame.

Walltime should be read from configuration file

The walltimes (and the node configurations) should be read from a configuration file, instead of been hard-coded in the package.

Add the "BED" format for extraction

It would be useful to add the BED format (binary plink file) as output type for impute2-extractor. It makes easier to load the data for downstream analysis.

Bug in multiprocessing on Mac OS

When I run the tests, the test_full_fit_linear_multiprocess test hangs. This could be caused by a previously described bug caused by OpenBLAS and Python multiprocessing (numpy/numpy#654).

The error report indicates the following:

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x0000000000000110

and the start of the dispatch queue looks like:

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libdispatch.dylib               0x00007fff8aab2c13 dispatch_group_async + 533
1   libBLAS.dylib                   0x00007fff915fd228 APL_dgemm + 1100
2   libBLAS.dylib                   0x00007fff916347aa cblas_dgemm + 1420
3   libBLAS.dylib                   0x00007fff9152d93d DGEMM + 254
4   libLAPACK.dylib                 0x00007fff9082b5bd DGESDD + 15747

Report R-squared for linear analysis

It would be important to add the r² value for linear regression.

MixedLM - Failed test

Two failed tests for MixedLM due to approximation. The accuracy used is too stringent (10 decimals). Reducing to 9 decimals should make the tests pass.

======================================================================
FAIL: test_fit_mixedlm (genipe.tests.test_imputed_stats.TestImputedStatsMixedLM)
Tests the 'fit_mixedlm' function.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lemieuxl/Softwares/Python-3.4.3_virtualenv/lib/python3.4/site-packages/genipe/tests/test_imputed_stats.py", line 2192, in test_fit_mixedlm
    self.assertAlmostEqual(expected_coef, observed_coef, places=10)
AssertionError: 0.12265168980987724 != 0.12265168988301459 within 10 places

======================================================================
FAIL: test_fit_mixedlm_use_ml (genipe.tests.test_imputed_stats.TestImputedStatsMixedLM)
Tests the 'fit_mixedlm' function (using ML instead of REML).
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lemieuxl/Softwares/Python-3.4.3_virtualenv/lib/python3.4/site-packages/genipe/tests/test_imputed_stats.py", line 2322, in test_fit_mixedlm_use_ml
    self.assertAlmostEqual(expected_coef, observed_coef, places=10)
AssertionError: 0.12265168980987724 != 0.12265168988294683 within 10 places

Survival Tutorial: There is a TTE when `event=0`

Perhaps it is me who misunderstood, but in the tutorial on survival analysis, we can see this line:

NA06994 569.4273004275149 0 48 20.2946857226 1

Where 569.4273004275149 is the time to event and 0 means that the event was not observed.

If I understand correctly, there shouldn't be a TTE if the event was unobserved and this line would be incorrect.

Hence my proposed fix is to correct this in the tutorial and to add a check in genipe to warn the user about such ambiguities.

genipe-tutorial fails when broken link

Hi,
when running genipe tutorial it failed when it tried to download PLINK from http://pngu.mgh.harvard.edu/~purcell/plink/dist/plink-1.07-x86_64.zip because the link is broken. Is there any way to skip this or fix it or add a custom link (e.g. https://www.cog-genomics.org/static/bin/plink/plink1_linux_x86_64.zip)?
Thanks,
Klev

Add the "lgen" format for extraction

It would be useful to add the lgen format as output type for impute2-extractor. It makes easier to load the data into SAS.

ERROR: Reference and Main panels are not well aligned

I installed and tested genipe as described in the documentation on http://pgxcentre.github.io/genipe/installation.html and executed genipe_tutorial. I modified the generated script execute.sh slightly to only impute chromosome 1. After I run the script I found the following messages in the files genipe_tutorial/genipe/chr1/chr1.alignments.log and genipe_tutorial/genipe/chr1/chr1.to_exclude.alignments.log:

ERROR: Reference and Main panels are not well aligned:

#Missing sites in reference panel = 443

#Misaligned sites between panels = 56

#Multiple alignments between panels = 0

In detail I did the following:

Installed genipe and tested the installation
Executed the following commands

cd
wget http://statgen.org/wp-content/uploads/Softwares/genipe/supp_files/hg19.tar.bz2
wget https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.tgz

mkdir $HOME/genipe_tutorial

mkdir $HOME/genipe_tutorial/hg19
cd $HOME/genipe_tutorial/hg19
tar -jxf $HOME/hg19.tar.bz2

cd $HOME/genipe_tutorial
tar -zxf $HOME/1000GP_Phase3.tgz
touch 1000GP_Phase3/genipe_tut_done

cd
source genipe_pyvenv/bin/activate
genipe-tutorial
deactivate

In genipe_tutorial/execute.sh I did the following change:

Replaced --chrom autosomes by --chrom 1 (to impute only the SNPs on the first chromosome for a test)

To speedup the test you can also do the following:

After the line with --thread add a line with --shapeit-thread 4 \ (if your system has 4 cores)

Then I did the imputation and output the files named above:

source genipe_pyvenv/bin/activate
genipe_tutorial/execute.sh
deactivate

cat genipe_tutorial/genipe/chr1/chr1.alignments.log
cat genipe_tutorial/genipe/chr1/chr1.to_exclude.alignments.log

The error messages are at the end of these files.

This error was already reported in issue #48 but that was mainly about another problem (which is resolved now).

Deprecation warning for barh

Matplotlib version 2.0 has a deprecation warning for barh.

/home/lemieuxl/genipe_pyvenv/lib64/python3.6/site-packages/genipe/pipeline/cli.py:2516: MatplotlibDeprecationWarning: The *bottom* kwarg to `barh` is deprecated use *y* instead. Support for *bottom* will be removed in Matplotlib 3.0

Multiple phenotypes

It would be nice to have an option to run a batch of phenotypes in the imputed-stats tool. In the current implementation, the user needs to call the script one time per phenotype, but this is inefficient because the parsing of the input files is repeated between iterations.

Having a batch mode where only the outcome vector is recomputed would be ideal. Maybe this could be included in a future release.

Bug when testing only one chromosome

There is a bug when we test only one chromosome. After the final ShapeIt, I got the following message:
phased sample files are different...

Update documentation

The documentation needs to be updated for the next release.

Unknown format code 'd' for object of type 'float'

After I installed genipe (on a fully updated Fedora 25 system) as described in the documentation I executed the test and get an error message.

In detail I did the following as root:

dnf install python3 python3-devel python3-numpy python3-jinja2 python3-pandas python3-setuptools

And then I did the following as a user:

pyvenv $HOME/genipe_pyvenv
source $HOME/genipe_pyvenv/bin/activate
pip install genipe
pip install pyfaidx
python
import genipe
genipe.test()

After that I get the following error message:

ERROR: test_read_phenotype (genipe.tests.test_imputed_stats.TestImputedStats)
Tests the 'read_phenotype' function.

Traceback (most recent call last):
File "/home/user/genipe_pyvenv/lib64/python3.5/site-packages/genipe/tests/test_imputed_stats.py", line 248, in test_read_phenotype
filename, args,
File "/home/user/genipe_pyvenv/lib64/python3.5/site-packages/genipe/tools/imputed_stats.py", line 251, in read_phenotype
(sex_counts.index == 2))].sum()
ValueError: Unknown format code 'd' for object of type 'float'

I get one more error message which is described in another ticket.

It seems to me that the character 'd' in line 247 of the file genipe/tools/imputed_stats.py has to be replaced by the character 'f'.

The output of the command source $HOME/genipe_pyvenv/bin/activate; python --version; deactivate is: Python 3.5.4

Error in file genipe/pipeline/cli.py

After I installed genipe (on a fully updated Fedora 25 system) as described in the documentation and after correcting one error (see below) I executed the test and get an error message.

In detail I did the following as root:

dnf install python3 python3-devel python3-numpy python3-jinja2 python3-pandas python3-setuptools

And then I did the following as a user:

pyvenv $HOME/genipe_pyvenv
source $HOME/genipe_pyvenv/bin/activate
pip install genipe
pip install pyfaidx
deactivate

Because of another error message (see ticket #41) I then replaced the character 'd' in line 247 of the file genipe/tools/imputed_stats.py by the character 'f'.

Then I did:

source $HOME/genipe_pyvenv/bin/activate
python
import genipe
genipe.test()

After that I get the following error message:

ERROR: test_gather_maf_stats (genipe.tests.test_main_pipeline.TestMainPipeline)
Tests the 'gather_maf_stats' function.

Traceback (most recent call last):
File "/home/user/genipe_pyvenv/lib64/python3.5/site-packages/genipe/tests/test_main_pipeline.py", line 588, in test_gather_maf_stats
o_dir=self.output_dir.name,
File "/home/user/genipe_pyvenv/lib64/python3.5/site-packages/genipe/pipeline/cli.py", line 2480, in gather_maf_stats
raise GenipeError("something went wrong")
genipe.error.GenipeError: something went wrong

The output of the command source $HOME/genipe_pyvenv/bin/activate; python --version; deactivate is: Python 3.5.4

Existing directory 1000GP_Phase3 is not detected

After installing and testing genipe as described in the documentation on http://pgxcentre.github.io/genipe/installation.html I created the directory genipe_tutorial and extracted the file 1000GP_Phase3.tgz therein as described on http://pgxcentre.github.io/genipe/tutorials/tutorial_genipe.html#genipe-tut-more-details

But the script genipe_tutorial didn't detected that the directory is already there and downloaded the huge file 1000GP_Phase3.tgz again. Even if I copy the file 1000GP_Phase3.tgz itself into the directory genipe_tutorial/ it is not detected and downloaded again.

In detail I did the following:

Installed genipe and tested the installation
Executed the following commands

cd
wget http://statgen.org/wp-content/uploads/Softwares/genipe/supp_files/hg19.tar.bz2
wget https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.tgz

mkdir $HOME/genipe_tutorial

mkdir $HOME/genipe_tutorial/hg19
cd $HOME/genipe_tutorial/hg19
tar -jxf $HOME/hg19.tar.bz2

cd $HOME/genipe_tutorial
cp ../1000GP_Phase3.tgz .
tar -zxf 1000GP_Phase3.tgz

cd
source $HOME/genipe_pyvenv/bin/activate
genipe-tutorial

Answer the question for preparing $HOME/genipe_tutorial with "y"
Get the message "Downloading IMPUTE2's reference files"

Is there a workaround to avoid the downloading?

Optimize the dosage computation

I optimized some aspects of the dosage and MAF computation in gepyto. Maybe we could use it in this project (see this commit).
🐢 ➡️ 🐎

Add a Makefile to generate report

It would be useful to add a Makefile to generate the report (for users not used to using pdflatex).

An example would be (taken from here):

PROJECT=report
TEX=pdflatex
BIBTEX=bibtex
BUILDTEX=$(TEX) $(PROJECT).tex

all:
    $(BUILDTEX)
    $(BIBTEX) $(PROJECT)
    $(BUILDTEX)
    $(BUILDTEX)

clean-all:
    rm -f *.aux *.bbl *.blg *.log *.out *.toc *.pdf

clean:
    rm -f *.aux *.bbl *.blg *.log *.out *.toc

Exception raised when start or end time is None

When start or end time is None (see below), an exception is raised. This should never happen, and getting execution time for such a task should just warn the user.

This happen when launching with the --bgzip flag when gbzip isn't in the $PATH, then removing the --bgzip flag and rerun genipe-launcher. An entry (with end time equals None since it never ended) for bgzip for each chromosome is created, but not used afterwards.

Add basic burden test

This is mostly to look at the direction of effect or for simple burden analyses.

pgxcentre / genipe Goto Github PK

genipe's Introduction

genipe - A Python module to perform genome-wide imputation analysis

Documentation

Installation

Dependencies

Optional dependencies

Testing

Basic usage

Real life example

Automatic report

Statistical analysis

About

genipe's People

Contributors

Stargazers

Watchers

Forkers

genipe's Issues

ERROR: test_fit_cox_interaction_snp1 (genipe.tests.test_imputed_stats.TestImputedStatsCox) Tests the 'fit_cox' function with interaction, first SNP.

ERROR: test_read_phenotype (genipe.tests.test_imputed_stats.TestImputedStats) Tests the 'read_phenotype' function.

ERROR: test_gather_maf_stats (genipe.tests.test_main_pipeline.TestMainPipeline) Tests the 'gather_maf_stats' function.

Recommend Projects

Recommend Topics

Recommend Org

ERROR: test_fit_cox_interaction_snp1 (genipe.tests.test_imputed_stats.TestImputedStatsCox)
Tests the 'fit_cox' function with interaction, first SNP.

ERROR: test_read_phenotype (genipe.tests.test_imputed_stats.TestImputedStats)
Tests the 'read_phenotype' function.

ERROR: test_gather_maf_stats (genipe.tests.test_main_pipeline.TestMainPipeline)
Tests the 'gather_maf_stats' function.