hakyim / to-delete-predixcan Goto Github PK

Code for the in-dev PrediXcan Project

License: MIT License

Python 49.22% HTML 48.04% Shell 0.34% R 1.56% Perl 0.85%

to-delete-predixcan's Introduction

Deprecation notice

This repository contains the original reference implementation of the PrediXcan method. It is now considered deprecated and exists only for reference purposes.

Active development is now conducted at the MetaXcan repository. Tutorial for this new version is here

PrediXcan

PrediXcan is a gene-based association test that prioritizes genes that are likely to be causal for the phenotype.

Do you have only summary results? Try MetaXcan, a new extension of PrediXcan that uses only summary statistics. No individual level data necessary.

Mailing List

Please join this Google Group for news on releases, features, etc. For support and feature requests, you can use this repository's issue tracker.

Reference

Gamazon ER†, Wheeler HE†, Shah KP†, Mozaffari SV, Aquino-Michaels K, Carroll RJ, Eyler AE, Denny JC, Nicolae DL, Cox NJ, Im HK. (2015) A gene-based association method for mapping traits using reference transcriptome data. Nat Genet. doi:10.1038/ng.3367. (Link to paper, Link to Preprint on BioRxiv)

†:equal contribution

*:correspondence haky at uchicago dot edu
Alvaro Barbeira, Kaanan P Shah, Jason M Torres, Heather E Wheeler, Eric S Torstenson, Todd Edwards, Tzintzuni Garcia, Graeme I Bell, Dan Nicolae, Nancy J Cox, Hae Kyung Im. (2016) MetaXcan: Summary Statistics Based Gene-Level Association Method Infers Accurate PrediXcan Results link to preprint
Heather E Wheeler, Kaanan P Shah, Jonathon Brenner, Tzintzuni Garcia, Keston Aquino-Michaels, GTEx Consortium, Nancy J Cox, Dan L Nicolae, Hae Kyung Im. (2016) Survey of the Heritability and Sparsity of Gene Expression Traits Across Human Tissues. link to preprint

Software

Python version

Download software from this link

PredictDB

PredictDB hosts genetic prediction models of transcriptome levels to be used with PrediXcan. See our wiki for a report of a recent update of the prediction models.

Gene2Pheno database of results

G2Pdb, Gene to Phenotype database, hosts the results of PrediXcan applied to a variety of phenotypes. Link to prototype.

Genetic Architecture of Gene Expression Traits

Heather E Wheeler, Kaanan P Shah, Jonathon Brenner, Tzintzuni Garcia, Keston Aquino-Michaels, GTEx Consortium, Nancy J Cox, Dan L Nicolae, and Hae Kyung Im (2016) Survey of the Heritability and Sparsity of Gene Expression Traits Across Human Tissues Link to Preprint; correspondence hwheeler at luc dot edu and haky at uchicago dot edu
Database of heritability estimates link older link or older link

Acknowledgements

GTEx data

Data downloaded from dbGaP link

DGN RNA-seq data

Data downloaded from NIMH Repository and Genomics Resource

Battle, A., Mostafavi, S., Zhu, X., Potash, J.B., Weissman, M.M., McCormick, C., Haudenschild, C.D., Beckman, K.B., Shi, J., Mei, R., et al. (2014). Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Research 24, 14–24.

to-delete-predixcan's People

Contributors

Stargazers

Watchers

to-delete-predixcan's Issues

AttributeError: TranscriptionMatrix instance has no attribute 'gene_list'

Hi all,

I'm new to PrediXcan and trying to run this with a test on gtex_v7_Adipose_Subcutaneous_imputed_europeans_tw_0.5_signif.db.
However, I got an error saying:

2018-04-18 16:13:44.757285 Preloading weights...
Traceback (most recent call last):
File "/exeh_3/yinly/05_PrediXcan/PrediXcan/Software/PrediXcan.py", line 230, in
main()
File "/exeh_3/yinly/05_PrediXcan/PrediXcan/Software/PrediXcan.py", line 212, in main
transcription_matrix.save(PRED_EXP_FILE)
File "/exeh_3/yinly/05_PrediXcan/PrediXcan/Software/PrediXcan.py", line 117, in save
outfile.write('FID\t' + 'IID\t' + '\t'.join(self.gene_list) + '\n') # Nb. this lists the names of rows, not of columns
AttributeError: TranscriptionMatrix instance has no attribute 'gene_list'

Here is my code:
TISSUES=("TW_Adipose_Subcutaneous")

for T in ${TISSUES[@]}
do
/exeh_4/ccarlos/bin/anaconda2/bin/python2.7 /exeh_3/yinly/05_PrediXcan/PrediXcan/Software/PrediXcan.py --predict
--weights /exeh_3/yinly/05_PrediXcan/PrediXcan/GTEx-V7_HapMap-2017-11-29/gtex_v7_Adipose_Subcutaneous_imputed_europeans_tw_0.5_signif.db
--dosages /exeh_3/yinly/05_PrediXcan/Dosage/
--samples /exeh_3/yinly/05_PrediXcan/Dosage/samples.txt
--output_prefix /exeh_3/yinly/05_PrediXcan/Result/gtex_v7_Adipose_Subcutaneous_imputed_europeans
done

It worked well when testing it on the given data example, However, when I run this on my own data, I got the above error. The path to genotype dosages is correct. Besides, I checked the file format, it's the same as the given sample and I paste several lines below(only the first 5 columns).
chr21 rs3965725 14597009 A C
chr21 rs2847439 14600130 A G
chr21 rs2775537 14601415 G A
chr21 rs2801233 14603577 C T
chr21 rs2266579 14604361 C T
chr21 rs2261012 14607040 T A
chr21 rs2801317 14608578 T A
chr21 rs2801314 14613701 A G
chr21 rs2801301 14627705 G A
chr21 rs2261645 14630588 A G

Please let me known if more information is required. Thanks a lot.

Covariance file flag

Hello, I am interested in conducting the s-predixcan in diverse populations. The DB folder at PredictDB contains covariance file with the .DB files, what flag or option should I be using to include them.
Thank you.

Not possible to download DGN Whole Blood LASSO SQLite db file

Hello,
When trying to download the DGN Whole Blood LASSO SQLite db file, the following error occurs:
AccessDeniedAccessDenied8E6090A9FEF0A18664M1u8fxfIu5kXhdQtR14J8/J2N7rXgEHAI6r+lxQr860nGCZ8H807uxa7I8boxe

There is no problem with the other files.

AttributeError: TranscriptionMatrix instance has no attribute 'gene_list'

Hi all,

I'm trying to run PrediXcan (master branch, commit 8bca774)with 3615 TCGA samples on both TW_Lung_0.5.db and TW_Breast_Mammary_Tissue_0.5.db.
However, I got an error saying

  File "/mnt/SCRATCH/PrediXcan/Software/PrediXcan.py", line 214, in <module>
    main()
  File "/mnt/SCRATCH/PrediXcan/Software/PrediXcan.py", line 197, in main
    transcription_matrix.save(PRED_EXP_FILE)
  File "/mnt/SCRATCH/PrediXcan/Software/PrediXcan.py", line 117, in save
    outfile.write('FID\t' + 'IID\t' + '\t'.join(self.gene_list) + '\n') # Nb. this lists the names of rows, not of columns
AttributeError: TranscriptionMatrix instance has no attribute 'gene_list'

I've tried run PrediXcan with and without --genelist flag, but neither worked. The exact cmd I used is :

/path/to/PrediXcan.py --predict --dosages /path/to/dosage/ --dosages_prefix chr --samples tcga.impute2.fam --weights /path/to/TW_Breast_Mammary_Tissue_0.5.db --output_dir tcga_3615_brca_gtex_model (--genelist /path/to/gene_breast.txt) #The gene_breast.txt is actually #1 column from TW_Breast_Mammary_Tissue_0.5.txt

Please let me know if more information is required. Thank you so much!

Does the genotype build make a difference on prediction??

Hello, I wanted to know if the genotype build makes any difference on the gene expression prediction. My genotype data is on hg18, and I am using GTEx v7 as reference in PrediXcan. Do I need to convert to hg19, then Gh38, and then do the prediction? Will that improve the accuracy of prediction ??
I have already done prediction using hg18 genotypes without any errors. Please advise.

PrediXcan fails with MemoryError when a huge amount of individuals is provided

If the number of individuals is around 150,000, PrediXcan fails with a MemoryError exception.
I've used the DGN model.

./PrediXcan.py --predict --assoc --linear \
      --weights ./models/DGN-HapMap-2015/DGN-WB_0.5.db \
      --dosages /dir --samples myfile.fam \
      --pheno pheno.txt --pheno_name myphenotype \
      --output_prefix test

    2017-04-06 11:24:08.266724 Preloading weights...
    2017-04-06 11:24:10.281721 Processing chr1.txt.gz
    Traceback (most recent call last):
      File "./PrediXcan.py", line 230, in <module>
        main()
      File "./PrediXcan.py", line 211, in main
        transcription_matrix.update(gene, weight, ref_allele, allele, dosage_row)
      File "./PrediXcan.py", line 101, in update
        self.D = np.zeros((len(self.gene_list), len(dosage_row))) # Genes x Cases
    MemoryError

Prediction example tar file is not available

The link to the example file (Prediction example tar file) is not working

Unit of the predicted values

Hello,
I wanted to know the unit of the values in the predicted gene expression output file, as the value ranges from negative to positive. I have tried to look up this information in the original paper, but couldn't find it.
Are the values in log, or compared to some housekeeping gene, as the GTEx has TPM unit..
If I am missing something, please direct me to the right publication.

Error in smooth.spline during Filtering on significance

Hello,

I got some error during runing post-process.py. This following is the error:
$ ./post_process.py
Generating ../../data/output/dbs/Test_Test_alpha0.5_window1e6.db...
Creating weights table...
Creating extra table...
Creating construction table...
Creating meta_data table...
Commiting...
Done.
Filtering Test on significance.
Error in smooth.spline(lambda, pi0, df = smooth.df) :
missing or infinite values in inputs are not allowed
Calls: package_db -> qvalue -> pi0est -> smooth.spline
In addition: Warning messages:
1: In min(p) : no non-missing arguments to min; returning Inf
2: In max(p) : no non-missing arguments to max; returning -Inf
3: In min(p) : no non-missing arguments to min; returning Inf
4: In max(p) : no non-missing arguments to max; returning -Inf
Execution halted

It look like the weight table in unfilter SQLiteDB have blank p.value and other 4 last values as well.

Thanks,
Phuwanat

Link to HOWTO-beta.md isn't working

Hello Authors,

I am trying to access the link to HOWTO-beta.md under the software section of the repository. But I am getting 404 (i.e. page not found) error.

Can someone please help me?

Thanks.

Best,
Tushar

Population difference

Hi, a quick question: Can I use the trained .db files to predict expression level in the Chinese population? I noticed all your files are named "...imputed_europeans".

error when training prediction model using the example data

Hi dear PrediXcan group,

I ran into a "invalid option" error when I was trying to custom the prediction model using the example input and script (train_models.py). The recommended version of python and R were already in the default environment.

[rchen7@hii example] python train_models.py
qsub -v study=gEUVADIS,expr_RDS=../../data/intermediate/expression_phenotypes/geuvadis.expr.RDS,geno=../../data/intermediate/genotypes/geuvadis.snps.chr1.txt,gene_annot=../../data/intermediate/annotations/gene_annotation/gencode.v12.genes.parsed.RDS,snp_annot=../../data/intermediate/annotations/snp_annotation/geuvadis.annot.chr1.RDS,n_k_folds=10,alpha=0.5,out_dir=../../data/intermediate/model_by_chr/,chrom=1,snpset=HapMap,window=1e6 -N gEUVADIS_model_chr1 -d /hii/work/r/rchen7/project/predixcan/PredictDBPipeline/joblogs/example train_model_by_chr.pbs
Unknown option: d
Usage:
qsub [-a start_time] [-A account] [-e err_path] [-I] [-l resource_list]
[-m mail_options] [-M user_list] [-N job_name] [-o out_path] [-p
priority] [-q destination] [-v variable_list] [-V] [-W
additional_attributes] [-h] [script]

qsub -v study=gEUVADIS,expr_RDS=../../data/intermediate/expression_phenotypes/geuvadis.expr.RDS,geno=../../data/intermediate/genotypes/geuvadis.snps.chr2.txt,gene_annot=../../data/intermediate/annotations/gene_annotation/gencode.v12.genes.parsed.RDS,snp_annot=../../data/intermediate/annotations/snp_annotation/geuvadis.annot.chr2.RDS,n_k_folds=10,alpha=0.5,out_dir=../../data/intermediate/model_by_chr/,chrom=2,snpset=HapMap,window=1e6 -N gEUVADIS_model_chr2 -d /hii/work/r/rchen7/project/predixcan/PredictDBPipeline/joblogs/example train_model_by_chr.pbs
Unknown option: d
Usage:
qsub [-a start_time] [-A account] [-e err_path] [-I] [-l resource_list]
[-m mail_options] [-M user_list] [-N job_name] [-o out_path] [-p
priority] [-q destination] [-v variable_list] [-V] [-W
additional_attributes] [-h] [script]

qsub -v study=gEUVADIS,expr_RDS=../../data/intermediate/expression_phenotypes/geuvadis.expr.RDS,geno=../../data/intermediate/genotypes/geuvadis.snps.chr3.txt,gene_annot=../../data/intermediate/annotations/gene_annotation/gencode.v12.genes.parsed.RDS,snp_annot=../../data/intermediate/annotations/snp_annotation/geuvadis.annot.chr3.RDS,n_k_folds=10,alpha=0.5,out_dir=../../data/intermediate/model_by_chr/,chrom=3,snpset=HapMap,window=1e6 -N gEUVADIS_model_chr3 -d /hii/work/r/rchen7/project/predixcan/PredictDBPipeline/joblogs/example train_model_by_chr.pbs
Unknown option: d
Usage:
qsub [-a start_time] [-A account] [-e err_path] [-I] [-l resource_list]
[-m mail_options] [-M user_list] [-N job_name] [-o out_path] [-p
priority] [-q destination] [-v variable_list] [-V] [-W
additional_attributes] [-h] [script]
.
.
.
.

qsub -v study=gEUVADIS,expr_RDS=../../data/intermediate/expression_phenotypes/geuvadis.expr.RDS,geno=../../data/intermediate/genotypes/geuvadis.snps.chr22.txt,gene_annot=../../data/intermediate/annotations/gene_annotation/gencode.v12.genes.parsed.RDS,snp_annot=../../data/intermediate/annotations/snp_annotation/geuvadis.annot.chr22.RDS,n_k_folds=10,alpha=0.5,out_dir=../../data/intermediate/model_by_chr/,chrom=22,snpset=HapMap,window=1e6 -N gEUVADIS_model_chr22 -d /hii/work/r/rchen7/project/predixcan/PredictDBPipeline/joblogs/example train_model_by_chr.pbs
Unknown option: d
Usage:
qsub [-a start_time] [-A account] [-e err_path] [-I] [-l resource_list]
[-m mail_options] [-M user_list] [-N job_name] [-o out_path] [-p
priority] [-q destination] [-v variable_list] [-V] [-W
additional_attributes] [-h] [script]

Any idea why this happened? Really appreciate for your advice.

Thanks

Ruoxi

The default test in Rscript

correction in documentation

The dosage file format section reads as, "Columns are snpid rsid position allele1 allele2 MAF id1 ..... idn.", where "snpid" is actually the chromosome number (column 1).

What is the parameter setting when using the elastic net in prediction

Hi,
Sorry for interrupting you. I wonder What is the parameter setting when using the elastic net in prediction?Just the default setting in the glmnet package? For instance, how to choose the lambda in real data? And what is your setting in your NG paper?

Thanks a lot.
Bests,
Zhongshang

Breakdown of PrediXcan Framework

Hello!

I've been looking over PrediXcan to understand its general functions. However, I have a few clarity questions in reference to the Framework image that is presented in the corresponding paper, "A gene-based association method for mapping traits using reference transcriptome data."

Under Genetic Variation, there are columns and rows that represent rsids and ids for each individual. What do 0, 1, and 2 specifically represent in this scenario? Also, why do the rsids appear to alternate between 1 and 2 instead of continuing the count?
Under Observed Transcriptome, there are columns and rows that represent genes and ids for each individual. What do the numbers in the table represent? Any significance between small/big numbers?

If these questions aren't ideal to be presented here, can you direct me to someone that can further explain?

PrediXcanFramework.pdf

Error in documentation

The default test in Rscript (PrediXcanAssociation.R) is 'linear' and not logistic.
the folllowing lines are copied from the script

Default TEST_TYPE: linear

if (is.null(argv$TEST_TYPE)) {
argv$TEST_TYPE <- "linear"
}

getting error while running PrediXcan.py for Predicting/Imputing Expression

Hello Authors,

I am facing an error while I am running PrediXcan.py script for Predicting/Imputing Expression. Below is my command line:

python ./PrediXcan.py --predict --dosages dosages --dosages_prefix chr --sample dosages/sample.txt -- weights Liver_LASSO.db --output_dir .

Below are first 3 lines of my sample.txt file:
10878 10878 0 0 0 0 -9
11655 11655 0 0 0 0 -9
12201 12201 0 0 0 0 -9

Below is the log file content:

2016-04-14 14:06:55.609509 Preloading weights...
2016-04-14 14:06:56.201847 Processing chr1.dosage.data.txt.gz
2016-04-14 14:24:10.396559 Processing chr10.dosage.data.txt.gz
2016-04-14 14:35:32.422774 Processing chr11.dosage.data.txt.gz
2016-04-14 14:46:11.936055 Processing chr12.dosage.data.txt.gz
2016-04-14 14:56:42.306676 Processing chr13.dosage.data.txt.gz
2016-04-14 15:04:40.451957 Processing chr14.dosage.data.txt.gz
2016-04-14 15:12:00.289541 Processing chr15.dosage.data.txt.gz
2016-04-14 15:18:19.583822 Processing chr16.dosage.data.txt.gz
2016-04-14 15:25:31.726997 Processing chr17.dosage.data.txt.gz
2016-04-14 15:31:46.137181 Processing chr18.dosage.data.txt.gz
2016-04-14 15:38:07.463976 Processing chr19.dosage.data.txt.gz
2016-04-14 15:43:57.293840 Processing chr2.dosage.data.txt.gz
2016-04-14 16:01:55.391602 Processing chr20.dosage.data.txt.gz
2016-04-14 16:07:00.779805 Processing chr21.dosage.data.txt.gz
2016-04-14 16:10:12.819881 Processing chr22.dosage.data.txt.gz
2016-04-14 16:13:29.788277 Processing chr3.dosage.data.txt.gz
2016-04-14 16:28:40.400313 Processing chr4.dosage.data.txt.gz
2016-04-14 16:44:31.586482 Processing chr5.dosage.data.txt.gz
2016-04-14 16:58:26.106604 Processing chr6.dosage.data.txt.gz
2016-04-14 17:12:43.945125 Processing chr7.dosage.data.txt.gz
2016-04-14 17:25:40.343552 Processing chr8.dosage.data.txt.gz
2016-04-14 17:38:02.067445 Processing chr9.dosage.data.txt.gz
Traceback (most recent call last):
File "/data/ssalimi/packages/PrediXcan/Software/PrediXcan.py", line 198, in
main()
File "/data/ssalimi/packages/PrediXcan/Software/PrediXcan.py", line 181, in main
transcription_matrix.save(PRED_EXP_FILE)
File "/data/ssalimi/packages/PrediXcan/Software/PrediXcan.py", line 118, in save
outfile.write('\t'.join(next(sample_generator)) + '\t' + '\t'.join(map(str, self.D[:,col]))+'\n')
StopIteration

I am not sure why I am having this error. Can you please help me to understand the issue and also help me to resolve it?

Thanks.

Best,
Tushar

A Query Regarding Sample size

Hi,

I do not have an exact issue. I have a query. What would be minimum sample size (How many controls, and how many cases???)one needs to predict the gene expression by PrediXcan?

Thanks in advance,

Waqas.

Error with dosage/samples.txt files

So my dosage files contain lines where there is no rs ID and predixcan can't handle these rows. So I removed those lines and tried running PrediXcan and received an error as seen below:

ERROR: There are not enough rows in your sample file! Make sure dosage files and sample files have the same number of individuals in the same order.

Would you happen to have a solution for this?