jakelever / pgxmine Goto Github PK

Text mining for pharmacogenomic associations for PharmGKB

License: MIT License

Python 26.33% Shell 2.01% R 20.55% HTML 0.13% TeX 50.97%

text-mining pharmacogenomics biocuration pharmgkb

pgxmine's Introduction

PGxMine

This is the codebase for the PGxMine project that uses text-mining to identify papers for curation into PharmGKB. It is a Python3 project that makes use of the Kindred relation classifier along with the BioText project to manage the download of PubMed/PMC and alignment with PubTator.

Viewing the Data

The data can be viewed through the Shiny app. It can be downloaded as TSV files at Zenodo.

To run a local instance of the PGxmine viewer, the R Shiny code can be found in shiny/ and installation instructions are found there too.

Software Dependencies

This project depends on Kindred, scispacy and snakemake. They can be installed by:

pip install -r requirements.txt

Data Dependencies

This project uses a variety of data sources.

A few need to be downloaded as below, apart from DrugBank which needs to be download manually.

MeSH (only needed to update the drug list)
DrugBank (download manually as account is required and name it drugbank.xml)
PharmGKB (used for constructing the drug list and comparisons)

The prepareData.sh script downloads some of the data dependencies and runs some preprocessing to extract necessary data (such as gene name mappings). The commands that it runs are detailed below.

year=`date +"%Y"`

# Download MeSH, dbSNP, Entrez Gene metadata and pharmGKB drug info
sh downloadDataDependencies.sh

# Extract the gene names associated with rsIDs from dbSNP
python linkRSIDToGeneName.py --dbsnp <(zcat data/GCF_000001405.*.gz) --pubtator <(zcat data/bioconcepts2pubtatorcentral.gz) --outFile data/dbsnp_selected.tsv

# Create the drug list with mappings from MeSH IDs to PharmGKB IDs (with some filtering using DrugBank categories)
python createDrugList.py --meshC data/c$year.bin --meshD data/d$year.bin --drugbank drugbank.xml --pharmgkb data/drugs.tsv --outFile data/selected_chemicals.json

# Extract a mapping from Entrez Gene ID to name
zgrep -P "^9606\t" data/gene_info.gz | cut -f 2,3,10 -d $'\t' > data/gene_names.tsv

# Unzip the annotated training data of pharmacogenomics relations
gunzip -c annotations.variant_other.bioc.xml.gz > data/annotations.variant_other.bioc.xml
gunzip -c annotations.variant_star_rs.bioc.xml.gz > data/annotations.variant_star_rs.bioc.xml

Example Run

There is an example input file in the test_data directory which contains an PubMed abstract in BioC format. The run_example.sh script does a full run extracting chemical/variant associations and is shown below with comments. The final output is three files: mini_unfiltered.tsv, mini_collated.tsv, mini_sentences.tsv. This is equivalent to the test run with snakemake shown below.

Running with Snakemake

To run a small example of the pipeline using snakemake, run the command below. This runs on the data in the test_data directory. It is equivalent to the commands in the run_example.sh script which provides some comments on what each step does. Snakemake is useful for running on the larger datasets with the full run commands further down.

MODE=test snakemake --cores 1

To do a full run, you need set up a local instance of BioText with the biocxml format. The command below will run Snakemake on the biotext. You must change BIOTEXT to point towards the biocxml directory in your local instance of BioText. The run will take a while and a cluster is recommended using snakemake's cluster support.

MODE=full BIOTEXT=/path/to/biotext/biocxml snakemake --cores 1

Script Overview

Here is a summary of the main script files. The Snakefile manages the execution of these in the correct ordering.

Main scripts

findPGxSentences.py: Identify star alleles then find sentences that mention a chemical and variant
getRelevantMeSH.py: Extracts MeSH terms related to age groups that is used by additional analysis
createKB.py: Train and apply a relation classifier to extract pharmacogenomic chemical/variant associations
filterAndCollate.py: Filter the results to reduce false positives and collate the associations
utils/init.py: Big functions for variant normalization and outputting the formatted sentences

Other scripts

createDrugList.py: Creates the list of drugs and drug mappings from MeSH IDs to PharmGKB IDs with some filtering by categories
linkRSIDToGeneName.py: Extracts gene names from dbSNP associated with rsIDs
linkStarToRSID.py: Some rudimentary text mining to link star alleles with a specific rsID
prepareForAnnotation.py: Select sentences and output to the standoff format to be annotated
prCurve.py: Calculate PR curves for the classifiers

Paper

The paper can be recompiled using the dataset using Bookdown. All text and code for stats/figures are in the paper/ directory.

Supplementary Materials

Supplementary materials for the manuscript are found in supplementaryMaterials/.

pgxmine's People

Contributors

Stargazers

Watchers

Forkers

flywind2 lurace frankburg usametov rykovan

pgxmine's Issues

Links in downloadDataDependencies.sh are broken

Apparently some links in downloadDataDependencies.sh are broken:

$ wget ftp://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz
--2022-10-23 22:19:37--  ftp://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz
           => ‘GCF_000001405.39.gz’
Resolving ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)... failed: Temporary failure in name resolution.
wget: unable to resolve host address ‘ftp.ncbi.nih.gov’

wget ftp://ftp.ncbi.nih.gov/gene/DATA/gene_info.gz
--2022-10-23 22:20:54--  ftp://ftp.ncbi.nih.gov/gene/DATA/gene_info.gz
           => ‘gene_info.gz’
Resolving ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)... failed: Temporary failure in name resolution.
wget: unable to resolve host address ‘ftp.ncbi.nih.gov’

Input files for pgxmine

@jakelever could you please elaborate on how to prepare input files for pgxmine? There is pubmed_26736037.bioc.xml file in "example" folder but it is not clear how you obtained it. I tried to use BioText project but its output doesn't contain that file. What am I doing wrong?

$ snakemake --cores 1 downloaded.flag
$ snakemake --cores 1 converted.flag
$ snakemake --cores 1 pubtator_downloaded.flag
$ snakemake --cores 1 pubtator.flag

$ cd biocxml
$ grep '26736037' *.xml #nothing
$ cd ../pubtator
$ grep '26736037' *.xml #nothing

I'm trying to run pgxmine with some of the files outputed by BioTex but the result is empty:

$ ls -l example1
pubmed_test.bioc.xml -> ../../biotext/pubtator/pmc_baseline.oa_comm_xml.PMC008xxxxxx.baseline.2022-09-03_36.bioc.xml

$ python findPGxSentences.py --inBioc example1/pubmed_test.bioc.xml \
    --filterTermsFile pgx_filter_terms.txt \
    --outBioc example1/pubmed_test.sentences.bioc.xml

$ python getRelevantMeSH.py --inBioc example1/pubmed_test.bioc.xml \
    --outJSONGZ example1/pubmed_test.mesh.json.gz

$ python createKB.py \
    --trainingFiles data/annotations.variant_star_rs.bioc.xml,data/annotations.variant_other.bioc.xml \
    --inBioC example1/pubmed_test.sentences.bioc.xml \
    --selectedChemicals data/selected_chemicals.json \
    --dbsnp data/dbsnp_selected.tsv \
    --variantStopwords stopword_variants.txt \
    --genes data/gene_names.tsv \
    --relevantMeSH example1/pubmed_test.mesh.json.gz  \
    --outKB example1/pubmed_test.kb.tsv

$ python filterAndCollate.py \
    --inData example1 \
    --outUnfiltered example1/mini_unfiltered.tsv \
    --outCollated example1/mini_collated.tsv \
    --outSentences example1/mini_sentences.tsv

Output:

+ python findPGxSentences.py --inBioc example1/pubmed_test.bioc.xml --filterTermsFile pgx_filter_terms.txt --outBioc example1/pubmed_test.sentences.bioc.xml
Found 0 candidate sentences
+ python getRelevantMeSH.py --inBioc example1/pubmed_test.bioc.xml --outJSONGZ example1/pubmed_test.mesh.json.gz
Loaded PMIDs from corpus file...
Searching for MeSH terms in:  ['Adolescent', 'Adult', 'Aged', 'Birth Cohort', 'Child', 'Child, Preschool', 'Infant', 'Infant, Newborn', 'Middle Aged', 'Pediatrics', 'Young Adult']

Found 0 PubMed ID(s) with relevant MeSH terms
+ python createKB.py --trainingFiles data/annotations.variant_star_rs.bioc.xml,data/annotations.variant_other.bioc.xml --inBioC example1/pubmed_test.sentences.bioc.xml --selectedChemicals data/selected_chemicals.json --dbsnp data/dbsnp_selected.tsv --variantStopwords stopword_variants.txt --genes data/gene_names.tsv --relevantMeSH example1/pubmed_test.mesh.json.gz --outKB example1/pubmed_test.kb.tsv
Loaded chemical, gene and variant data
Loaded mesh PMIDs for pediatric/adult terms
Creating classifier for star_rs
Predicted 0 association(s) for star_rs variants
Creating classifier for other
Predicted 0 association(s) for other variants
+ python filterAndCollate.py --inData example1 --outUnfiltered example1/mini_unfiltered.tsv --outCollated example1/mini_collated.tsv --outSentences example1/mini_sentences.tsv
Found 1 PubMed files
Found 0 PMC files
0 records filtered to 0 sentences and collated to 0 chemical/variant associations
Written to example1/mini_sentences.tsv and example1/mini_collated.tsv

run_example.sh is outdated

createKB.py is called without --relevantMeSH:

createKB.py: error: the following arguments are required: --relevantMeSH

Mismatch in entity annotation

When I run run_example.sh I'm getting the following error:

$ ./prepareData.sh
$ ./run_example.sh
+ rm -f example/aligned.bioc.xml example/sentences.bioc.xml example/kb.tsv example/mini_unfiltered.tsv example/mini_collated.tsv example/mini_sentences.tsv
+ python align.py --inBioc example/input.bioc.xml --annotations /dev/fd/63 --outBioc example/aligned.bioc.xml
++ zcat data/bioconcepts2pubtatorcentral.gz
2022-09-28 14:18:27 0 26736037
Done!
+ python findPGxSentences.py --inBioc example/aligned.bioc.xml --filterTermsFile pgx_filter_terms.txt --outBioc example/sentences.bioc.xml
Traceback (most recent call last):
  File "findPGxSentences.py", line 65, in <module>
    for corpus in kindred.iterLoad('biocxml',args.inBioc):
  File "/home/natalia/git/kindred/kindred/loadFunctions.py", line 387, in iterLoad
    kindredDocs = convertBiocDocToKindredDocs(document)
  File "/home/natalia/git/kindred/kindred/loadFunctions.py", line 311, in convertBiocDocToKindredDocs
    assert entityText == a.text, "Mismatch in entity annotation between expected text (%s) and extracted text (%s) using offset info for passage with text: %s" % (a.text, entityText, text)
AssertionError: Mismatch in entity annotation between expected text (methamphetamine) and extracted text (hetamine use as) using offset info for passage with text: Although stimulant dependence is highly heritable, few studies have examined genetic influences on methamphetamine dependence. We performed a candidate gene study of 52 SNPs and pretreatment methamphetamine use frequency among 263 methamphetamine dependent Hispanic and Non-Hispanic White participants of several methamphetamine outpatient clinical trials in Los Angeles. One SNP, rs7591784 was significantly associated with pretreatment methamphetamine use frequency following Bonferroni correction (p < 0.001) in males but not females. We then examined rs7591784 and methamphetamine urine drug screen results during 12 weeks of outpatient treatment among males with treatment outcome data available (N = 94) and found rs7591784 was significantly associated with methamphetamine use during treatment controlling for pretreatment methamphetamine use. rs7591784 is near CREB1 and in a linkage disequilibrium block with rs2952768, previously shown to influence CREB1 expression. The CREB signaling pathway is involved in gene expression changes related to chronic use of multiple drugs of abuse including methamphetamine and these results suggest that variability in CREB signaling may influence pretreatment frequency of methamphetamine use as well as outcomes of outpatient treatment. Medications targeting the CREB pathway, including phosphodiesterase inhibitors, warrant investigation as pharmacotherapies for methamphetamine use disorders.

Am I missing something?