hakyimlab / summary-gwas-imputation Goto Github PK

harmonization, liftover, and imputation of summary statistics from GWAS

License: MIT License

Python 93.34% R 6.66%

summary-gwas-imputation's Introduction

This repository contains tools to harmonize GWAS summary statistics to a given reference. The main application is harmonization of a public gwas' variants to those in the GTEx study, and allow imputation of summary statistics for missing variants.

You can also find tools to run colocalization (COLOC or Enloc) on them.

Harmonization and Imputation Overview

The first step consists of compiling a dataset with variants' definitions and genotypes in a given sample. A script in this repository will convert them to an internal format using Apache Parquet for fast querying.

The second step is harmonizing each GWAS to the reference data set, optionally including a liftover conversion of chromosomal coordinates.

The third step is imputing missing variants' summary statistics. Because of the long computation time, this step can be split across multiple executions to decrease total running time.

Th last step is a simple postprocessing that collects the harmonized GWAS, the imputed variants, and produces a final product.

Prerequisites

The basic requirements for running GWAS summary-imputation are python>=3.5 with the following packages:

pandas=0.25.3
scipy=1.4.1
numpy=1.18.1
bgen_reader=3.0.2
cyvcf2=0.20.0
pyliftover=0.4
pyarrow=0.11.1

A quick-and-dirty solution to install these requirements is using Miniconda and the file src/conda_env.yaml in this repository to create a working environment.

conda env create -f /path/to/this/repo/src/conda_env.yaml
conda activate imlab

Other tools

Colocalization analysis via COLOC is supported on harmonized or imputed GWAS. We include a modified copy of COLOC for ease of use (original source at github) We include a modified copy of CTIMP for ease of use (original source at github) See their respective licenses at their repositories, or here in 3rd_licenses)

Final comments

Please see the wiki for extensive, in-detail documentation.

summary-gwas-imputation's People

Contributors

Stargazers

Watchers

Forkers

xhyuo leamhernandez rajlabmssm meliao hou jhchung johnm434 liangyy hxchengtj nyagam sckinta aksheyen joshua-haitao shugamoe yangchuhua fnyasimi crerecombinase hmutanqilong andreag5

summary-gwas-imputation's Issues

gtex_v8_eur_filtered.txt.gz

Thank you for sharing all this super useful code. Would you be willing to also share "gtex_v8_eur_filtered.txt.gz"?

Many thanks!

error when run_coloc.py

Hi when I run the run_coloc.py for the example data
python3.7 $REPO/run_coloc.py -gwas_mode pvalue -gwas $DATA/processed_summary_imputation/imputed_CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz -eqtl_mode bse -eqtl GTEx_Analysis_v8_eQTL/Whole-Blood.v8.signif-variant-gene-pairs.txt.gz -gwas_sample_size 149461 -eqtl_sample_size 670 -p1 2.4769e-05 -p2 0.00288782 -p12 1.264e-06 -parsimony 8 -output CARDIoGRAM_C4D_CAD_ADDITIVE_Whole_Blood.txt.gz

I got this error:
File "/scg/apps/software/summary-gwas-imputation/20220620/bin/run_coloc.py", line 67, in
run(args)
File "/apps/software/summary-gwas-imputation/20220620/bin/run_coloc.py", line 12, in run
Coloc.initialize(args.coloc_script)
File "/apps/software/summary-gwas-imputation/20220620/bin/genomic_tools_lib/external_tools/coloc/Coloc.py", line 25, in initialize
coloc_r = importr('coloc').coloc_abf
File "/home/.conda/envs/Coloc/lib/python3.7/site-packages/rpy2/robjects/packages.py", line 453, in importr
env = _get_namespace(rname)

Do you know what is wrong?

No objects to concatenate in post processing step

I am using the gwas_summary_imputation_postprocess.py, and am getting a pandas.concat related error in what looks like the process_imputed function. At first, I thought this may be because one of the imputed files was empty, but I ran wc -l to check, and all the files have at least 100 lines. This is the code I am using:

I am running this on an HPC cluster, so it does create a tmp directory, but I still get this error when running on an interactive node.

for FILE in *.gz; do wc -l $FILE; done

mkdir output
mv Ischemic_Heart_Disease_ADDITIVE_*.gz output


ls output/imputed_files

printf "\n Starting actual work now.\n"


singularity exec ${CONTAINER} python3 /NGS_tools/summary-gwas-imputation/src/gwas_summary_imputation_postprocess.py \
-gwas_file ${HARMONIZED_FILE} \
-folder /tmp/tmp.oCEs6Orzvs/output \
-pattern Ischemic_Heart_Disease_ADDITIVE_chr*.gz \
-parsimony 7 \
-output ./imputed_${PHENOTYPE}_ADDITIVE.txt.gz

and this is what an imputation file looks like
Ischemic_Heart_Disease_ADDITIVE_chr9_sb9_reg0.1_ff0.01_by_region.txt.gz

Thank you
Harry

In the example data on summary statistics imputation, there is a .bed file eur_ld.bed.gz to run per-region imputation. Is summary statistics-based imputation by region applicable only for population-specific GWASes? Are there any ld block schemes applicable to multipopulation studies?

Error in gwas harmonization step

Hello -

I'm working through the tutorial and trying out the new COVID-19 gwas summary statistics data. I keep getting the error below:

python ./summary-gwas-imputation-master/src/gwas_parsing.py
-gwas_file ./Desktop/gwastools/covida2.preptest.txt
-liftover ./DATA/liftover/hg19ToHg38.over.chain.gz
-snp_reference_metadata ./DATA/reference_panel_1000G/variant_metadata.txt.gz METADATA
-output_column_map markername variant_id
-output_column_map noneffect_allele non_effect_allele
-output_column_map effect_allele effect_allele
-output_column_map beta effect_size
-output_column_map p_dgc pvalue
-output_column_map chr chromosome
--chromosome_format
-output_column_map bp_hg19 position
-output_column_map effect_allele_freq frequency
--insert_value sample_size 709010 --insert_value n_cases 5582
-output_order variant_id panel_variant_id chromosome position effect_allele non_effect_allele frequency pvalue zscore effect_size standard_error sample_size n_cases
-output ./Desktop/gwastools/covida2.harmonized.txt.gz

INFO - Parsing input GWAS
Traceback (most recent call last):
File "./summary-gwas-imputation-master/src/gwas_parsing.py", line 311, in
run(args)
File "./summary-gwas-imputation-master/src/gwas_parsing.py", line 258, in run
enforce_numeric_columns=args.enforce_numeric_columns)
File "/Users/XYZ/Desktop/gwastools/summary-gwas-imputation-master/src/genomic_tools_lib/file_formats/gwas/GWAS.py", line 18, in load_gwas
d = _ensure_columns(d, input_pvalue_fix, enforce_numeric_columns)
File "/Users/XYZ/Desktop/gwastools/summary-gwas-imputation-master/src/genomic_tools_lib/file_formats/gwas/GWAS.py", line 31, in _ensure_columns
d[EFFECT_ALLELE] = d[EFFECT_ALLELE].str.upper()
File "/usr/local/anaconda3/envs/imlabtools/lib/python3.7/site-packages/pandas/core/generic.py", line 5175, in getattr
return object.getattribute(self, name)
File "/usr/local/anaconda3/envs/imlabtools/lib/python3.7/site-packages/pandas/core/accessor.py", line 175, in get
accessor_obj = self._accessor(obj)
File "/usr/local/anaconda3/envs/imlabtools/lib/python3.7/site-packages/pandas/core/strings.py", line 1917, in init
self._inferred_dtype = self._validate(data)
File "/usr/local/anaconda3/envs/imlabtools/lib/python3.7/site-packages/pandas/core/strings.py", line 1967, in _validate
raise AttributeError("Can only use .str accessor with string " "values!")
AttributeError: Can only use .str accessor with string values!

I've tried several ways of loading the gwastools packages (standard install, specifying python 3.5, etc) and tweaked the summary statistic files several times, too. Notably, I don't receive this error using the quick harmonization method outlined in the MetaXcan MASHR GTEx V8 tutorials, so I don't think it relates to the summary GWAS file. Any help and guidance would be greatly appreciated!

Very respectfully,
Dan

KeyError: "['zscore'] not in index"

Hi,
I am having trouble with gwas_summary_imputation.py after having harmonized my sumstats.
I get the following error:
KeyError: "['zscore'] not in index"

With the script:
python3 gwas_summary_imputation.py
-by_region_file eur_ld.bed.gz
-gwas_file meta_harm_200803.tab.gz
-parquet_genotype chr1.variants.parquet
-parquet_genotype_metadata variant_metadata.parquet
-window 100000
-parsimony 7
-chromosome 1
-regularization 0.1
-frequency_filter 0.01
-sub_batches 5
-sub_batch 0
--standardise_dosages
-output meta_harm_imput_200803_chr1_sb0.tab \

Is a zscore necessary for this step?
My input has the following columns:

variant_id panel_variant_id chromosome position effect_allele non_effect_allele frequency pvalue effect_size standard_error sample_size n_cases

Thank you for your help.
Best regards,
Laura.

liftover from hg19 -> hg38

Hi guys

Big fan of your work and has made it super easy to liftover coordinates (38 -> 19).
However, I am finding my self in a struggle to do it the other way around. I ofcourse change the chain file in the script, so it looks like the following:

module load tools anaconda3/4.4.0

python3 /path/1_liftover/src/gwas_parsing.py \
-gwas_file /path/liftover/liftover_b37.txt \
-liftover /path/hg19ToHg38.over.chain.gz \
-output_column_map POS position \
-output_column_map A0 non_effect_allele \
-output_column_map A1 effect_allele \
-output_column_map EAF frequency \
-output_column_map BETA effect_size \
-output_column_map PVAL pvalue \
-output_column_map CHR chromosome \
-output_column_map SNP variant_id \
-output_column_map SE standard_error \
-output_column_map N sample_size \
-output_order variant_id chromosome position effect_allele non_effect_allele frequency effect_size standard_error pvalue sample_size ID \
-output /path/liftover/liftover_b38.txt

The script is able to run and finishes in around 2 minutes. Logs look like this:

INFO - Parsing input GWAS
INFO - loaded 10101487 variants
INFO - Performing liftover
INFO - 10101487 variants after liftover
INFO - Saving...
INFO - Finished converting GWAS in 116.5172207057476 seconds

but when i read the output, there is no coordinates for chr or pos either. Just some good ol' NA's. Does the software support conversion from hg19 to 38? If yes, can you figure out where I am the fool

Cheers :)

Soren

post-processing of imputed chromosome files

Hi,

I work on Mac OS Mojave, under the 'IMLABTOOLS' conda environment with the original .yaml file (and I get the same issues with python 3.6 and numpy 1.15.4), and everything went fine so far, including performing prediXcan after quick harmonization using beta03.py. I wanted to use imputation. However, two issues prevent me from merging the 22 chromosome files generated using harmonization with gwas_parsing.py and imputed using gwas_summary_imputation.py.

Here is the head view of my harmonized gwas:

variant_id panel_variant_id chromosome position effect_allele non_effect_allele frequency pvalue zscore effect_size standard_error sample_size n_cases
rs11804171 NA chr1 788439 A T 0.057 0.09182 -1.6858743 -0.0256 0.0152 121604 121604
rs2977670 NA chr1 788511 C G 0.9428 0.0812 1.7437655 0.0268 0.0154 121604 121604
rs12138618 NA chr1 814855 A G 0.0657 0.10800000000000001 -1.607248 -0.0347 0.0216 121604 121604
rs3094315 NA chr1 817186 A G 0.7837 0.774 0.2871467 0.0013 0.0046 121604 121604

Here is the head view of my imputed summary stats (chr1):

variant_id panel_variant_id chromosome position effect_allele non_effect_allele frequency zscore variance imputation_status n n_indep most_extreme_z
rs141149254 chr1_54490_G_A_b38 chr1 54490 A G 0.17766990291262136 -0.9087151299792123 0.15151489339740898 imputed 177 177 -2.7700946
rs188486692 chr1_87021_T_C_b38 chr1 87021 C T 0.011650485436893204 -0.2811373416113958 0.14101801962612653 imputed 177 177 -2.7700946
rs112455420 chr1_263722_C_G_b38 chr1 263722 G C 0.12233009708737864 0.13789682101193423 0.18880870467012337 imputed 177 177 -2.7700946
rs144425991 chr1_594402_C_T_b38 chr1 594402 T C 0.02621359223300971 -0.1837089639958824 0.11961914002691187 imputed 177 177 -2.7700946

Using the following code for post-processing:

python ../summary-gwas-imputation-master/src/gwas_summary_imputation_postprocess.py
-gwas_file /Users/roicick/Desktop/HOT_STUFF/VGLUT_paper/draft_jan21/VGLUT_TWAS/AUD_harmo_final.txt.gz
-folder imputed/AUD_UKBB/
-pattern AUD_UKBB_imputed_chr*
-parsimony 7
--keep_criteria CHR_POS \ i used both options and also the code without this parameter
-output imputed/AUD_UKBB_imputed.txt.gz

I get two issues:

a known numpy issue, as follows: /Users/roicick/miniconda3/envs/imlabtools/lib/python3.6/site-packages/numpy/lib/arraysetops.py:522: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
mask |= (ar1 == a)

which does not occur when using the '--keep_criteria CHR_POS' parameter**

INFO - Beginning process
INFO - Processing imputed AUD_UKBB_imputed_chr1
INFO - Processing imputed AUD_UKBB_imputed_chr10
INFO - Processing imputed AUD_UKBB_imputed_chr11
INFO - Processing imputed AUD_UKBB_imputed_chr12
INFO - Processing imputed AUD_UKBB_imputed_chr13
INFO - Processing imputed AUD_UKBB_imputed_chr14
INFO - Processing imputed AUD_UKBB_imputed_chr15
INFO - Processing imputed AUD_UKBB_imputed_chr16
INFO - Processing imputed AUD_UKBB_imputed_chr17
INFO - Processing imputed AUD_UKBB_imputed_chr18
INFO - Processing imputed AUD_UKBB_imputed_chr19
INFO - Processing imputed AUD_UKBB_imputed_chr2
INFO - Processing imputed AUD_UKBB_imputed_chr20
INFO - Processing imputed AUD_UKBB_imputed_chr21
INFO - Processing imputed AUD_UKBB_imputed_chr22
INFO - Processing imputed AUD_UKBB_imputed_chr3
INFO - Processing imputed AUD_UKBB_imputed_chr4
INFO - Processing imputed AUD_UKBB_imputed_chr5
INFO - Processing imputed AUD_UKBB_imputed_chr6
INFO - Processing imputed AUD_UKBB_imputed_chr7
INFO - Processing imputed AUD_UKBB_imputed_chr8
INFO - Processing imputed AUD_UKBB_imputed_chr9
INFO - Processed 663006 imputed variants
INFO - Processing GWAS file /Users/roicick/Desktop/HOT_STUFF/VGLUT_paper/draft_jan21/VGLUT_TWAS/AUD_harmo_final.txt.gz
INFO - Read 2462742 variants
INFO - Kept 2426345 variants as observed
INFO - 3089351 variants
INFO - Filling median
INFO - Sorting by chromosome-position
Traceback (most recent call last):
File "../summary-gwas-imputation-master/src/gwas_summary_imputation_postprocess.py", line 128, in
run(args)
File "../summary-gwas-imputation-master/src/gwas_summary_imputation_postprocess.py", line 100, in run
process_original_gwas(args, imputed)
File "../summary-gwas-imputation-master/src/gwas_summary_imputation_postprocess.py", line 60, in process_original_gwas
g = Genomics.sort(g)
File "/Users/roicick/Desktop/DNA/summary-gwas-imputation-master/src/genomic_tools_lib/miscellaneous/Genomics.py", line 99, in sort
chr = [int(x.split("chr")[1]) if chr_re_.search(x) else None for x in d.chromosome]
File "/Users/roicick/Desktop/DNA/summary-gwas-imputation-master/src/genomic_tools_lib/miscellaneous/Genomics.py", line 99, in
chr = [int(x.split("chr")[1]) if chr_re_.search(x) else None for x in d.chromosome]
TypeError: expected string or bytes-like object

Script output: gwas_summary_imputation.py

Hello,
Not really an issue, but a specific question regarding the output of gwas_summary_imputation.py script (https://github.com/hakyimlab/summary-gwas-imputation/blob/master/src/gwas_summary_imputation.py). Does this script give Zscores or additionally betas and SEs? Thank you.

Output problem with gwas_parsing.py

Hi.
I am having trouble with a liftover with gwas_parsing.py.
My input has the columns (example with first line):
CHROM POS ID REF ALT A1 A1_FREQ MACH_R2 TEST OBS_CT BETA SE Z_STAT P A2
10 90127 10:90127:C:T;rs79817489 C T T 0.0741934 0.918957 ADD 3657 -0.040217 0.1068 -0.376564 0.706497 C

I use the command:
python .../gwas_parsing.py
-gwas_file ../test.tab.gz
-output_column_map ID variant_id
-output_column_map A2 non_effect_allele
-output_column_map A1 effect_allele
-output_column_map A1_FREQ freq
-output_column_map BETA effect_size
-output_column_map P pvalue
-output_column_map SE standard_error
-output_column_map CHROM chromosome
-output_column_map POS position
-output_column_map OBS_CT sample_size
-output_order variant_id non_effect_allele effect_allele pvalue standard_error chromosome position freq sample_size effect_size
-liftover hg19ToHg38.over.chain.gz
-output test.hg38.tab

But get an output with NAs in "chromosome" and "position", e.g.:
variant_id non_effect_allele effect_allele pvalue standard_error chromosome position freq sample_size effect_size
10:90127:C:T;rs79817489 C T 0.706497 0.1068 NA NA 0.07419339999999999 3657 -0.040217

I have tried to add "chr" before chromosome number in my inputfile, but get the same results.
Do you have any suggestions?
Thank you for your help.

Best regards,
Laura.

Losing almost all variants after restricting to reference step in harmonization.

Hi,

I am using your harmonization script, and it runs without error, but I am noticing that it is filtering out almost all of the variants in my GWAS summary stats. I'm starting with ~45M and after the "restricting to reference step" and ending up with ~400K (<1% of original data). The data is hg38, so I don't lose any variants in the liftover step.

This is the code I am using

-gwas_file step2_Ischemic_Heart_Disease.txt.gz \
-liftover hg19ToHg38.over.chain.gz \
-snp_reference_metadata variant_metadata.txt.gz METADATA \
-output_column_map ID variant_id \
-output_column_map ALLELE0 non_effect_allele \
-output_column_map ALLELE1 effect_allele \
-output_column_map BETA effect_size \
-output_column_map TEST test \
-output_column_map LOG10P pvalue \
-output_column_map CHROM chromosome \
--chromosome_format \
-output_column_map N sample_size \
-output_column_map SE standard_error \
-output_column_map INFO info \
-output_column_map CHISQ chisq \
-output_column_map EXTRA extra \
-output_column_map GENPOS position \
-output_column_map A1FREQ frequency \
-output_order variant_id panel_variant_id chromosome position effect_allele non_effect_allele frequency pvalue test effect_size chisq standard_error sample_size \
-output ./${PHENOTYPE}_ADDITIVE.txt.gz

sample_summary_stats.txt

I have also attached a sample of what my summary stats file looks like. These were generated using REGENIE V3.

Thank you!
Harry

Error running run_coloc.py : dataset 1: missing required element(s) snp

Hi when I run run_coloc.py:

python summary-gwas-imputation/src/run_coloc.py -keep_intermediate_folder -gwas_mode bse -gwas test1.txt -eqtl_mode bse -eqtl test2.txt -gwas_sample_size 149461 -eqtl_sample_size 670 -p1 1e-05 -p2 1e-04 -p12 1e-06 -parsimony 1 -output test_output.txt

I got error:

INFO - Loading gwas
Level 9 - sanitizing gwas
INFO - Beggining process
Level 9 - Processing gene ENSG00000227232.5
Level 9 - sanitizing eqtl
WARNING - R[write to console]: Error in check_dataset(d = dataset1, 1) :
dataset 1: missing required element(s) snp

INFO - Exception running coloc:
Traceback (most recent call last):
File "/oak/stanford/scg/lab_lilab/jwu/vitiligo/GWAS/summary-gwas-imputation/src/genomic_tools_lib/external_tools/coloc/Coloc.py", line 169, in _coloc
c = coloc_r(dataset1=d1, dataset2=d2, p1=p1, p2=p2, p12=p12)
File "/home/jiewu23/.conda/envs/Coloc/lib/python3.7/site-packages/rpy2/robjects/functions.py", line 202, in call
.call(*args, **kwargs))
File "/home/jiewu23/.conda/envs/Coloc/lib/python3.7/site-packages/rpy2/robjects/functions.py", line 124, in call
res = super(Function, self).call(*new_args, **new_kwargs)
File "/home/jiewu23/.conda/envs/Coloc/lib/python3.7/site-packages/rpy2/rinterface_lib/conversion.py", line 45, in _
cdata = function(*args, **kwargs)
File "/home/jiewu23/.conda/envs/Coloc/lib/python3.7/site-packages/rpy2/rinterface.py", line 810, in call
raise embedded.RRuntimeError(_rinterface._geterrmessage())
rpy2.rinterface_lib.embedded.RRuntimeError: Error in check_dataset(d = dataset1, 1) :
dataset 1: missing required element(s) snp

(test data are based on https://github.com/hakyimlab/summary-gwas-imputation/wiki/Running-Coloc)
test1.txt:
panel_variant_id effect_size standard_error frequency sample_size
chr1_731718_T_C_b38 0.039186 0.033355 0.1336 42921
chr1_734349_T_C_b38 0.041351 0.034082 0.128868 42921
chr1_752566_G_A_b38 -0.018285 0.021551 0.845492 149758

test2.txt
gene_id variant_id tss_distance ma_samples ma_count maf pval_nominal slope slope_se
ENSG00000227232.5 chr1_13550_G_A_b38 -16003 19 19 0.0141791 0.84 0.15 0.07
ENSG00000227232.5 chr1_14671_G_C_b38 -14882 17 17 0.0126866 0.17 -0.028 0.58
ENSG00000227232.5 chr1_14677_G_A_b38 -14876 69 69 0.0514925 0.99 -0.99 0.99

Can you please help me on this?

OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit

How may I solve this?

$ python gwas_summary_imputation.py -by_region_file $DATA/eur_ld.bed.gz -gwas_file harmonized.test.txt.gz -parquet_genotype $DATA/reference_panel_1000G/chr1.variants.parquet -parquet_genotype_metadata $DATA/reference_panel_1000G/variant_metadata.parquet -window 100000 -parsimony 7 -chromosome 1 -regularization 0.1 -frequency_filter 0.01 -sub_batches 10 -sub_batch 0 --standardise_dosages -output harmonized.test_chr1_sb0_reg0.1_ff0.01_by_region.txt.gz
INFO - Beginning process
INFO - Creating context by variant
INFO - Loading study
INFO - Loading variants' parquet file
Traceback (most recent call last):
File "/vf/users/HumanRNAProject/TYL/DNA/GWAS_compile/summary-gwas-imputation/src/gwas_summary_imputation.py", line 97, in
run(args)
File "/vf/users/HumanRNAProject/TYL/DNA/GWAS_compile/summary-gwas-imputation/src/gwas_summary_imputation.py", line 60, in run
results = run_by_region(args)
File "/vf/users/HumanRNAProject/TYL/DNA/GWAS_compile/summary-gwas-imputation/src/gwas_summary_imputation.py", line 40, in run_by_region
context = SummaryImputationUtilities.context_by_region_from_args(args)
File "/vf/users/HumanRNAProject/TYL/DNA/GWAS_compile/summary-gwas-imputation/src/genomic_tools_lib/summary_imputation/Utilities.py", line 229, in context_by_region_from_args
study = load_study(args)
File "/vf/users/HumanRNAProject/TYL/DNA/GWAS_compile/summary-gwas-imputation/src/genomic_tools_lib/summary_imputation/Utilities.py", line 162, in load_study
study = Parquet.study_from_parquet(args.parquet_genotype, args.parquet_genotype_metadata, chromosome=args.chromosome)
File "/vf/users/HumanRNAProject/TYL/DNA/GWAS_compile/summary-gwas-imputation/src/genomic_tools_lib/file_formats/Parquet.py", line 218, in study_from_parquet
_v = pq.ParquetFile(variants)
File "/data/lint6/conda/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 318, in init
self.reader.open(
File "pyarrow/_parquet.pyx", line 1470, in pyarrow._parquet.ParquetReader.open
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status

AttributeError("'pyarrow.lib.ChunkedArray' object has no attribute 'name'") when using gwas_summary_imputation.py

Hi,

Thank you for the great documentation and scripts! I am struggling to run the gwas_summary_imputation.py script after having harmonised all of my GWAS. I used the standard files you provide to compile the reference parquet files.

Example of a log file:

INFO - Beginning process
INFO - Creating context by variant
INFO - Loading study
INFO - Loading variants' parquet file
INFO - Loading variants metadata
Level 9 - Loading row group 21
INFO - Loading regions
Level 9 - Selecting target regions with specific chromosome
Level 9 - Selecting target regions from sub-batches
Level 9 - generating GWAS whitelist
INFO - Loading gwas
INFO - Acquiring filter tree for 17127 targets
INFO - Processing gwas source
Level 9 - Loaded 124 GWAS variants
Level 9 - Parsing GWAS
Level 9 - Processing region 1/3 [19924835.0, 22002927.0]
Level 8 - Roll out imputation
Level 8 - Preparing data
INFO - Error for region (22,19924835.0,22002927.0): AttributeError("'pyarrow.lib.ChunkedArray' object has no attribute 'name'")
Level 9 - Processing region 2/3 [22002927.0, 23370460.0]
Level 8 - Roll out imputation
Level 8 - Preparing data
INFO - Error for region (22,22002927.0,23370460.0): AttributeError("'pyarrow.lib.ChunkedArray' object has no attribute 'name'")
Level 9 - Processing region 3/3 [23370460.0, 24588236.0]
Level 8 - Roll out imputation
Level 8 - Preparing data
INFO - Error for region (22,23370460.0,24588236.0): AttributeError("'pyarrow.lib.ChunkedArray' object has no attribute 'name'")
INFO - Finished in 3.1158770509064198 seconds

Command used to run it:

python3 $REPO/gwas_summary_imputation.py \
-by_region_file $HOME/eur_ld.hg38.bed \
-gwas_file $DATA/${mydata}.txt.gz \
-parquet_genotype $HOME/genotype/gtex_v8_eur_filtered_maf0.01_monoallelic_variants.chr${chromosome}.variants.parquet \
-parquet_genotype_metadata $HOME/genotype/gtex_v8_eur_filtered_maf0.01_monoallelic_variants.variants_metadata.parquet \
-window 100000 \
-parsimony 7 \
-chromosome ${chromosome} \
-regularization 0.1 \
-frequency_filter 0.01 \
-sub_batches 10 \
-sub_batch ${mybatch} \
--standardise_dosages \
-output results_summary_imputation/${mydata}_chr${chromosome}_sb${mybatch}_reg0.1_ff0.01_by_region.txt.gz

I tried to narrow down where the issue might be and I think where the error is encoutered is at line 252 of SummaryInputation.py:
variants = _get_variants(context, ids)
I was a bit confused how get_variants gets defined and could not troubleshoot further. Any help would be appreciated!

I am running the pipeline using python 3.8.3, pyarrow 3.0.0 and numpy 1.20.1.

Failing harmonization

Hi,
I am running gwas_parsing.py to harmonize and liftover gwas summary stats like this:

python3 summary-gwas-imputation/src/gwas_parsing.py
-gwas_file all.saige.forMETAL_210617.txt.gz
-output_column_map rs.id variant_id
-output_column_map A2 non_effect_allele
-output_column_map A1 effect_allele
-output_column_map OR effect_size
-output_column_map pval pvalue
-output_column_map SE standard_error
-output_column_map chr chromosome
-output_column_map pos position
-output_column_map num sample_size
--chromosome_format
-liftover hg19ToHg38.over.chain.gz
-snp_reference_metadata saige_variants_b38_withoutchrX.txt.gz METADATA
-output all.saige.b38.harm.txt \

However, my sumstats are lifting, but without harmonizing:
INFO - Parsing input GWAS
INFO - loaded 7651549 variants
INFO - Performing liftover
INFO - 7651549 variants after liftover
INFO - Creating index to attach reference ids
INFO - Acquiring reference metadata
INFO - alligning alleles
INFO - 0 variants after restricting to reference variants
INFO - Ensuring variant uniqueness
INFO - 0 variants after ensuring uniqueness
INFO - Checking for missing frequency entries
INFO - Saving...
INFO - Finished converting GWAS in 142.95764788985252 seconds

And I get no error message.

My snp_reference meta-data looks like this:
chromosome position id allele_0 allele_1 allele_1_frequency rsid
1 1774334 chr1_1774334_A_G_b38 A G 0.948626151625909 chr1:1774334:SG
1 1774697 chr1_1774697_T_C_b38 T C 0.948619364368989 chr1:1774697:SG

And my sumstats:
id chr pos rs.id ref alt AF.alt mac num beta SE pval pval.noadj converged OR
339 chr1 693731 1:693731:A:G A G 0.106902117155606 2252 10533 -0.178771556135796 0.113841267586995 0.116331825589296 0.116331825589296 TRUE 0.836296924478001
349 chr1 705882 1:705882:G:A G A 0.0468052786480585 986 10533 -0.0471369452566297 0.167245157473858 0.778063584334615 0.778063584334615 TRUE 0.953956748792932

Thank you for your help.
Best regards,
Laura.

KeyError: "['n_cases'] not in index"

Imputation error: "AttributeError: type object 'object' has no attribute 'dtype'"

Hi, I'm trying to run the imputation step using the example dataset provided (https://github.com/hakyimlab/MetaXcan/wiki/Tutorial:-GTEx-v8-MASH-models-integration-with-a-Coronary-Artery-Disease-GWAS), but am getting an error. The upstream harmonization step works. I'm using the conda environment imlabtools, as recommended.

Here's my input:

python $REPO/gwas_summary_imputation.py \
-by_region_file $DATA/eur_ld.bed.gz \
-gwas_file $DATA/harmonized_GWAS.txt.gz \
-parquet_genotype $DATA/chr22.variants.parquet \
-parquet_genotype_metadata $DATA/variant_metadata.parquet \
-window 100000 \
-parsimony 7 \
-chromosome 22 \
-regularization 0.1 \
-frequency_filter 0.01 \
-sub_batches 10 \
-sub_batch 0 \
--standardise_dosages \
-output $DATA/imputed_gwas_chr22.txt.gz

Here's the log/error that results:

INFO - Beginning process
INFO - Creating context by variant
INFO - Loading study
INFO - Loading variants' parquet file
INFO - Loading variants metadata
Level 9 - Loading row group 21
/ihome/jshaffer/eko8/.conda/envs/imlabtools/lib/python3.7/site-packages/pyarrow/pandas_compat.py:752: FutureWarning: .labels was deprecated in version 0.24.0. Use .codes instead.
  labels, = index.labels
INFO - Loading regions
Level 9 - Selecting target regions with specific chromosome
Level 9 - Selecting target regions from sub-batches
Level 9 - generating GWAS whitelist
INFO - Loading gwas
INFO - Acquiring filter tree for 35799 targets
INFO - Processing gwas source
Level 9 - Loaded 9070 GWAS variants
Level 9 - Parsing GWAS
Level 9 - Processing region 1/3 [15927607.0, 17193405.0]
Traceback (most recent call last):
  File "/indy/storage3/eko8/Diss/GWAS/GWAS_followup/test/summary-gwas-imputation/src/gwas_summary_imputation.py", line 97, in <module>
    run(args)
  File "/indy/storage3/eko8/Diss/GWAS/GWAS_followup/test/summary-gwas-imputation/src/gwas_summary_imputation.py", line 60, in run
    results = run_by_region(args)
  File "/indy/storage3/eko8/Diss/GWAS/GWAS_followup/test/summary-gwas-imputation/src/gwas_summary_imputation.py", line 46, in run_by_region
    _r = SummaryInputation.gaussian_by_region(context, region)
  File "/indy/storage3/eko8/Diss/GWAS/GWAS_followup/test/summary-gwas-imputation/src/genomic_tools_lib/summary_imputation/SummaryInputation.py", line 270, in gaussian_by_region
    results = dataframe_from_results([], [])
  File "/indy/storage3/eko8/Diss/GWAS/GWAS_followup/test/summary-gwas-imputation/src/genomic_tools_lib/summary_imputation/SummaryInputation.py", line 280, in dataframe_from_results
    d = Utilities.to_dataframe(r, list(Results._fields))
  File "/indy/storage3/eko8/Diss/GWAS/GWAS_followup/test/summary-gwas-imputation/src/genomic_tools_lib/Utilities.py", line 115, in to_dataframe
    data =pandas.DataFrame(columns=columns)
  File "/ihome/jshaffer/eko8/.conda/envs/imlabtools/lib/python3.7/site-packages/pandas/core/frame.py", line 411, in __init__
    mgr = init_dict(data, index, columns, dtype=dtype)
  File "/ihome/jshaffer/eko8/.conda/envs/imlabtools/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 242, in init_dict
    val = construct_1d_arraylike_from_scalar(np.nan, len(index), nan_dtype)
  File "/ihome/jshaffer/eko8/.conda/envs/imlabtools/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1221, in construct_1d_arraylike_from_scalar
    dtype = dtype.dtype
AttributeError: type object 'object' has no attribute 'dtype'

Am I doing something incorrectly, or could there be a versioning issue with some of the packages in the conda environment? Thank you in advance.

Coloc with GTEx data: WARNING - R[write to console]: Error in process.dataset(d = dataset1, suffix = " dataset df1: Length of snp names and beta vectors must match

Hi,
I am running the below script, but gets this error:

WARNING - R[write to console]: Error in process.dataset(d = dataset1, suffix = "dataset df1: Length of snp names and beta vectors must match

python3 ../summary-gwas-imputation/src/run_coloc.py
-gwas_mode bse
-gwas ../metaXcan/meta_harm_200803.tab.gz
-eqtl_mode bse
-eqtl ../data/GTEx_Analysis_v8_eQTL/Heart_Atrial_Appendage.v8.egenes.txt.gz
-gwas_sample_size 32036
-eqtl_sample_size 372
-parsimony 8
-output test > /dev/null

Can you help me with this error message?
Best regards,
Laura.

AttributeError: 'DataFrame' object has no attribute 'sample_size'

this is my data format
variant_id chromosome base_pair_location effect_allele other_allele beta standard_error p_value variant_id_hg19 base_pair_location_grch38 sample_size frequency
rs144804129 chr10 100000122 A T -0.271 0.4979 0.5863 10_100000122_T_A 98240365 10000
rs6602381 chr10 10000018 G A -8e-04 0.0143 0.955 10_10000018_A_G 9958055 10000
rs11442554 chr10 100000554 AT A 0.0281 0.0217 0.1945 10_100000554_A_AT 98240797 10000
rs112832083 chr10 100000588 C T -0.5309 0.9955 0.5938 10_100000588_T_C 98240831 10000
rs7899632 chr10 100000625 G A -0.007 0.0153 0.6492 10_100000625_A_G 98240868 10000
rs61875309 chr10 100000645 C A 0.0042 0.0194 0.8272 10_100000645_A_C 98240888 10000
rs150203744 chr10 100001867 T C -0.0458 0.0893 0.6083 10_100001867_C_T 98242110 10000
rs8181398 chr10 100002399 G A -0.856 0.5069 0.09125 10_100002399_A_G 98242642 10000
rs145421501 chr10 100002418 G C -0.2714 0.4991 0.5866 10_100002418_C_G 98242661 10000

this is my code
$python $GWAS_TOOLS/gwas_parsing.py \

    -gwas_file $DATA/GCST90132223_buildGRCh37_for_harm.txt \
    -snp_reference_metadata /home/data/t010406/TWAS/data/reference_panel_1000G/variant_metadata.txt.gz METADATA \
    -output_column_map variant_id rsid \
    -output_column_map other_allele non_effect_allele \
    -liftover /home/data/t010406/TWAS/data/liftover/hg19ToHg38.over.chain.gz \
    -output_column_map effect_allele effect_allele \
    -output_column_map p_value  pvalue \
    -output_column_map frequency frequency\
    -output_column_map chromosome chromosome \
    -output_column_map beta effect_size \
    --chromosome_format \
    -output_column_map base_pair_location position \
    -output_order  rsid panel_variant_id chromosome position effect_allele non_effect_allele frequency pvalue  effect_size standard_error  zscore  \
    -output $OUTPUT/harmonized_gwas/Multi_All_RA_harmonized.txt.gz

INFO - Parsing input GWAS
INFO - loaded 10000 variants
INFO - Performing liftover

INFO - 10000 variants after liftover
INFO - Creating index to attach reference ids
INFO - Acquiring reference metadata
INFO - alligning alleles
INFO - 0 variants after restricting to reference variants
INFO - Ensuring variant uniqueness
INFO - 0 variants after ensuring uniqueness
INFO - Checking for missing frequency entries
Traceback (most recent call last):
File "/home/data/t010406/TWAS/summary-gwas-imputation-master/src/gwas_parsing.py", line 311, in
run(args)
File "/home/data/t010406/TWAS/summary-gwas-imputation-master/src/gwas_parsing.py", line 283, in run
d = clean_up(d)
File "/home/data/t010406/TWAS/summary-gwas-imputation-master/src/gwas_parsing.py", line 243, in clean_up
d = d.assign(sample_size=[int(x) if not math.isnan(x) else "NA" for x in d.sample_size])
File "/home/data/t010406/miniconda3/envs/imlabtools/lib/python3.7/site-packages/pandas/core/generic.py", line 5179, in getattr
return object.getattribute(self, name)
AttributeError: 'DataFrame' object has no attribute 'sample_size'

may I ask why this error happened?
thanks,
jack

ValueError: invalid literal for int() with base 10: '1_KI270766v1_alt'

I am trying to harmonize a UK Biobank GWAS (on HG19) for use with SPredixcan. However, after the liftover step, gwas_parsing.py is crashing because some of my variants have been lifted over to alternative chromosomal assemblies. It's not clear to me a priori how I can know which ones will be problematic (unlike X chromosome variants). Any pointers on how to overcome this?

INFO - Parsing input GWAS
INFO - loaded 19400443 variants
INFO - Performing liftover
INFO - 19400443 variants after liftover
Traceback (most recent call last):
  File "/mnt/storage/bioinformatics/summary-gwas-imputation/src/gwas_parsing.py", line 311, in <module>
    run(args)
  File "/mnt/storage/bioinformatics/summary-gwas-imputation/src/gwas_parsing.py", line 283, in run
    d = clean_up(d)
  File "/mnt/storage/bioinformatics/summary-gwas-imputation/src/gwas_parsing.py", line 245, in clean_up
    d = Genomics.sort(d)
  File "/mnt/storage/bioinformatics/summary-gwas-imputation/src/genomic_tools_lib/miscellaneous/Genomics.py", line 94, in sort
    chr = [int(x.split("chr")[1]) if "chr" in x else None for x in d.chromosome]
  File "/mnt/storage/bioinformatics/summary-gwas-imputation/src/genomic_tools_lib/miscellaneous/Genomics.py", line 94, in <listcomp>
    chr = [int(x.split("chr")[1]) if "chr" in x else None for x in d.chromosome]
ValueError: invalid literal for int() with base 10: '1_KI270766v1_alt'

Report "OSError: Invalid flatbuffers message." when reading parquet files.

Hi!
I am having trouble when I do the gwas imputation.
Firstly, I generate the parquet by myself. ( I'm currently working on cattle). Using the command below:

python $REPO/model_training_genotype_to_parquet.py
-input_genotype_file $DATA/parquet_run7/Chr${chr}-Run7-TAU-Beagle-toDistribute.txt.gz
-snp_annotation_file $DATA/parquet_run7/Chr${chr}_maf0.01_monoallelic_variants.txt.gz METADATA
-parsimony 9
--impute_to_mean
--split_by_chromosome
--only_in_key
-rsid_column rsid
-output_prefix $DATA/parquet_run7/Chr${chr}_maf0.01_monoallelic_variants

There's an error report:
ValueError: Table schema does not match schema used to create file:
table:
chromosome: int64
position: int64
id: null
allele_0: null
allele_1: null
allele_1_frequency: double
rsid: null vs.
file:
chromosome: int64
position: int64
id: string
allele_0: string
allele_1: string
allele_1_frequency: double
rsid: string

But I still got the parquet file for variants, but not for the metadata. Then I generated metadata parquet by myself.

Then I run the gwas imputation using codes below:

python $GWAS_TOOLS/gwas_summary_imputation.py
-gwas_file $OUTPUT/harmonized_gwas/HM2_${trait}.*
-by_region_file ${LD}/LD_blocks.txt.gz
-parquet_genotype ${Reference}/Chr${chr}_ARS_UCD1.2_maf0.01_monoallelic_variants.chr${chr}.variants.parquet
-parquet_genotype_metadata ${Reference}/variant_metadata.parquet
-window 100000
-parsimony 7
-chromosome ${chr}
-regularization 0.1
-frequency_filter 0.01
-sub_batches 10
-sub_batch 0
--standardise_dosages
-output $OUTPUT/summary_imputation/${trait}.chr${chr}.txt.gz

I got the error report:
File "/bin/summary-gwas-imputation/src/gwas_summary_imputation.py", line 97, in
run(args)
File "/bin/summary-gwas-imputation/src/gwas_summary_imputation.py", line 62, in run
results = run_by_variant(args)
File "/bin/summary-gwas-imputation/src/gwas_summary_imputation.py", line 24, in run_by_variant
context = SummaryImputationUtilities.context_from_args(args)
File "/bin/summary-gwas-imputation/src/genomic_tools_lib/summary_imputation/Utilities.py", line 174, in context_from_args
study = load_study(args)
File "/bin/summary-gwas-imputation/src/genomic_tools_lib/summary_imputation/Utilities.py", line 162, in load_study
study = Parquet.study_from_parquet(args.parquet_genotype, args.parquet_genotype_metadata, chromosome=args.chromosome)
File "/bin/summary-gwas-imputation/src/genomic_tools_lib/file_formats/Parquet.py", line 218, in study_from_parquet
_v = pq.ParquetFile(variants)
File "/home/anaconda3/envs/imlabtools/lib/python3.7/site-packages/pyarrow/parquet.py", line 137, in init
read_dictionary=read_dictionary, metadata=metadata)
File "pyarrow/_parquet.pyx", line 1048, in pyarrow._parquet.ParquetReader.open
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Invalid flatbuffers message.

Could you help to let me know where's wrong with my script?

Thank you very much for your help.

Best Regards,

Shuli

Mismatched order of imputation column and header

Order of imputation results does not match expected header in SummaryInputation.py.

effect_allele and non_effect_allele order are different in: