genoml / genoml Goto Github PK

Core pipeline of GenoML

License: Apache License 2.0

Python 10.00% R 55.39% Jupyter Notebook 34.60%

genoml's Introduction

GenoML-core

GenoML is an Automated Machine Learning (AutoML) for Genomic. This is the core package of GenoML. this repo is under development, please report any issues (bug, performance, documentation) on the GenoML issues page.

Here are some quick "get started" exmaples, please checkout the additional options and details in the Usage and CLI. In general, use linux or mac with python > 3.5 for best results.

Install

Run:

pip install genoml

Train the ML model

You can use the IPDGC (International Parkinson's Disease Genomics Consortium) test data. This data is a simulation of the genetic and clinical data used for Parkinson's diagnosis in previous publications. You can find it at IPDGC example data.

Download and unzip data:

wget https://github.com/ipdgc/GenoML-Brief-Intro/raw/master/exampleData.zip
unzip exampleData.zip

To train, run:

genoml-train --geno-prefix=./exampleData/training --pheno-file=./exampleData/training.pheno --model-dir=./exampleModel

Final tuned model and performance metrics are stored in the --model-dir directory.

Using the trained ML model for inference

genoml-inference --model-dir=./exampleModel --valid-dir=./exampleData --valid-geno-prefix=./exampleData/validation --valid-pheno-file=./exampleData/validation.pheno

Valdiation results and model performance metrics are stored in the --valid-dir directory.

For debugging purposes, include the -v or -vvv flags at the end of a command.

Report issues

Please report any issue or suggestions on the GenoML issues page.

genoml's People

Contributors

Stargazers

Watchers

Forkers

darwinbandoy dsaffo

genoml's Issues

Tab delimited inputs only for additional, cov and phenobarbitals

Please make sure that this is a documentation issue.

System information

GenoML version: 1
Doc Link: Usage and CLI

Describe the documentation issue

Only tab delimited data can be used

We welcome contributions by users. Will you be able to fix the doc Issue?

Mike specifying this in docs instead of white space.

Broken Link on Homepage

This link in the footer aka sitemap is broken.

<a href="/docs/overview.html">Getting Started</a>

Training the ML model: [Failed]

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux
GenoML installed from (source or binary): installed through pip
Python version: 3.7.4

Describe the current behavior
I've tried running the model on three separate datasets(including the provided sample dataset), and all three times came up with this error:
The main failure points are simply hardware related at this
phase of work. Is your data way too big for your computer?
Also, the code implemented here occasionally bugs if you have
too many zero variance predictors in the dataset, but you
probably already removed those before starting your analyses,
right?

All dependencies have been installed successfully.

Simple note

ran test in command line on non-nih computer today during install of genoml: required install of twisted 18.7.0 requires PyHacrest>=1.9.0
not a problem - ran after installed

Support for python 3.5, the string issue

turn out the person was running it with python 3.5 then he updated it to 3.6 and now it works. There is an f string issue

VIF filtering

Please make sure that this is a feature request.

System information

GenoML version (you are using): >1.04
Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state.
VIF (variance inflation factor) prefiltering, set at NA, 5, 10 for predictors on the fly.

Will this change the current api? How?
Yes, faster prefiltering for python only build.

Who will benefit with this feature?
Everyone once Mike gets time to write it.

Any Other info.
;-/

--version gives fail message, minor issue but annoying

Please make sure that this is a bug.

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):

Mac

GenoML installed from (source or binary):

pip

GenoML version:

1.0.3

Python version:

3.7

Describe the current behavior

--version gives fail message, minor issue but annoying

Describe the expected behavior

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

Error: Stack Overflow

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Biowulf (Linux)
GenoML installed from (source or binary):
GenoML version: 1.0.3
Python version:

When running genoml on Biowulf with a large number of SNPs, training the model fails with the error:
Error: protect(): protection stack overflow

OSError: R fail

This is running with 20,622 variants after pruning and 2,461 samples. I filtered the samples to 500 and roughly the same number of variants, and this error still occurred.

Finally, I ran the same files but this time with 500 samples and ~7,000 variants and it completed as expected.
We somehow need to get around an R memory failure.

GCTA software not being recognized by genoML 1.0.3

Please make sure that this is a bug.

System information

Linux Ubuntu 16.04.10
GenoML installed both from either source or using pip install option
GenoML version 1.03
Python version 3.7.3

Describe the current behavior
GenoML does not recognize GCTA software, even when it is already installed and working properly by itself when using the provided training files.

Describe the expected behavior
GenoML should either recognize the already installed GCTA software or download the latest version of it and install it again.

Code to reproduce the issue
genoml-train --geno-prefix=./example_data/discrete/training --pheno-file=./example_data/discrete/training.pheno --model-dir=./modelTest

Other info / logs
The problem is that the url to the GCTA software that is included in the genoML v.1.0.3 is not further available. When I was running the code, the check_dependencies.py file was looking for the gcta_1.91.7beta.zip (https://cnsgenomics.com/software/gcta/gcta_1.91.7beta.zip) version. However the url to this software is not available anymore. Instead, I replaced the required version to the current one (gcta_1.92.4beta.zip; https://cnsgenomics.com/software/gcta/bin/gcta_1.92.4beta.zip) and the GCTA check works now.

Thanks,

Yatros

Biowulf install optimization

Please make sure that this is a feature request.

System information

GenoML version (you are using):
Are you willing to contribute it (Yes/No):

Describe the feature and the current behavior/state.

It runs on biowulf but commentary from Susan Chacko:

genoml is installed and the webpage ready (https://hpc.nih.gov/apps/genoml.html).
My one concern is that the packages PRSice, GCTA, and Plink are installed for each user, which is redundant on a multi-user system where the applications and dependencies should be centrally installed.
I put those binaries into the genoml conda bin directory, and modified check_dependencies.py to look there first, but it seems that the other python scripts also look in the users $HOME/.genoml area for these executables.

It would be really a good feature if the genoml install process put these executables into the genoml bin directory, if all the scripts looked in the genoml bin directory before $HOME/.genoml/, or if it searched the regular $PATH .
Lets look into this in the full python version

Will this change the current api? How?

only at install

Who will benefit with this feature?

biowulf users

Any Other info.

Option to skip tuning

The only complexity is whether an untuned trained model is able to do inference.
If it is, the implementation is straightforward. otherwise we may need to distinguish between inference-able models and non-inference-able models. @mikeDTI, @ffaghri1 thoughts?

Add to the "Using the trained ML model for inference" section

The inference section in the quick start examples documentation is missing the "--valid-dir=valid_dir" where you specify the output directory, just using the command straight from the documentation will call up the usage of genoml-inference. We might want to add that in to avoid confusion for new users?

genoml / genoml Goto Github PK

genoml's Introduction

genoml's People

Contributors

Stargazers

Watchers

Forkers

genoml's Issues

Recommend Projects

Recommend Topics

Recommend Org