Giter VIP home page Giter VIP logo

vanboefer / cui_grouping Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 1.0 121 KB

Group biomedical records from different sources (PubMed, CTgov, EMA drug register) based on the drug and disease entities mentioned in them.

License: GNU General Public License v3.0

Python 71.37% Jupyter Notebook 28.63%
pubmed ctgov ema bern umls-cui quickumls cosine-similarity biomedical-entities linking-biomedical-data

cui_grouping's Introduction

CUI_grouping

The scripts in this repo are used to group biomedical records from different sources based on the drug and disease entities mentioned in them.

The method extracts and normalizes drug and disease names from the free text of various documents. It then groups together records that mention the same diseases and drugs.

The three data sources from which records are grouped are:

A sample of 100 CTgov records, 100 PubMed records and 72 EMA records can be found in the data folder.

For details on the bigger experiment and its results, see here.

Step 1: BERN annotation

BERN is a state-of-the-art biomedical named entity recognition tool, which detects entities of types disease, drug, gene, species and mutation. Please refer to the paper by Kim et al. (2019) for further information.

The bernotate.py script creates a json file per record with all the BERN-detected biomedical entities.

Step 2: Parse BERN json files

The bernparse.py script parses the json files created in Step 1 and stores the detected entities in pickled dataframes. The processing is done in batches of 10,000 json files.

Step 3: Map BERN entities to UMLS CUIs

The UMLS metathesaurus maps synonymous medical terms to a concept unique identifier (CUI); for example, the disease terms Hb-SS disease, Herrick syndrome and sickle cell anemia (which all refer to the same disease) are mapped to the CUI C0002895. The tool used for mapping is QuickUMLS; please refer to the paper by Soldaini and Goharian (2016) for further information.

The get_cuis.py script processes the pickled dataframes created in Step 2. The disease and drug entities in the dataframes are mapped to CUIs. The results are stored in pickled dataframes.

NOTE: To run QuickUMLS, you will need to obtain a license from the National Library of Medicine and download the UMLS files. See instructions and links here. The script assumes that a folder called quickUMLS_eng, which contains the UMLS data for English, is located in the data folder.

Step 4: Grouping

The script groupings.py groups the records based on their disease and drug CUI's using a similarity measure and a threshold (defined by the available parameters; see the script).

The results of grouping the sample data using cosine similarity and a distance threshold of 0.4 (i.e. similarity threshold 0.6) can be found in data/groupings; these results are examined in the notebook results_explore.ipynb.

cui_grouping's People

Contributors

vanboefer avatar

Watchers

 avatar  avatar

Forkers

gevg

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.