Giter VIP home page Giter VIP logo

badrex-biomedical-abbreviation-expander's Introduction

Copyright (c) 2012, Phil Gooch. This software is licenced under the GNU Library General Public License Version 3, 29 June 2007.

See LICENSE.txt file for license details.


BADREX: Biomedical Abbreviation Expander

BADREX is a GATE plugin that identifies term-abbreviation pairs using dynamic regular expressions that generalise and extend the Schwartz-Hearst algorithm. In addition it uses a subset of the inner-outer selection rules described in Ao & Takagi's ALICE algorithm. Rather than simply extracting terms and their abbreviations, it annotates them in situ and adds the corresponding long form and short form text as features on each.

It also has the option of expanding all abbreviations in the text that match the short-form of the most recently matched long-form--short-form pair. In addition, there is the option of annotating and classifying common medical abbreviations extracted from Wikipedia.

Against the Medstract corpus it achieves precision and recall of 98% and 97% respectively.

You can download the corrected BioText gold standard markables and the corrected Medstract gold standard markables. BioText corpus reproduced with kind permission of Prof. Marti Hearst.

A white paper describing BADREX and its evaluation is available.

How to use BADREX

BADREX is compatible with GATE version 6.1 and higher. The plugin can be loaded via the GATE Java API, or in GATE Developer go to File->Manage Creole Plugins, click the 'Add new CREOLE repository' button and select the 'BiomedicalAbbreviationExpander' directory.

Be sure to add a RegEx Sentence Splitter to your pipeline before running this plugin. (The ANNIE Sentence Splitter can also be used, although this also requires a Tokenizer.)

To favour precision over recall, set maxInner and maxOuter to low values, e.g. 5, and set the threshold to 1.0 or 0.9 To favour recall over precision, set maxInner and maxOuter to high values, e.g. 10, and set the threshold to 0.75 or below


Parameters

Init-time

  • configFileURL: Location of configuration file that lists the stop-words and lookup files
  • gazetteerListsURL: Location of gazetteer definition file for lists of common medical abbreviations

Run-time

  • inputASName: Input annotation set name
  • outputASName: Output annotation set name
  • sentenceType: Annotation type for Sentence annotations. Defaults to Sentence.
  • longType: Annotation type to mark the term's long form
  • longTypeFeature: Feature name to contain the expanded abbreviation on the short form annotation
  • shortType: Annotation type to mark the term's short form
  • shortTypeFeature: Feature name to contain the abbreviation on the long form annotation
  • expandAllShortFormInstances: Once a term-abbreviation pair has been identified, should all instances of that abbreviation be annotated and expanded? Defaults to false.
  • maxInner: Maximum length of outer string (text before parentheses)
  • maxOuter: Maximum length of inner string (text inside parentheses)
  • threshold: Fraction of short form characters that must match the long form to count as a match
  • swapShortest: Swap annotation types if the outer phrase is shorter than the inner phrase? Defaults to true (some datasets always annotate the outer phrase the same way, even if the inner phrase is the abbreviation)
  • useLookups: Set to true to run a gazetteer lookup of common medical abbreviations
  • useBidirectionMatch: In the event of the first character of the inner not matching the first character of the outer, attempts a match against the first character of the last word of the outer against the last character in the outer. Defaults to false. Can boost recall but reduce precision if set to true.
  • underlyingAnnots: Use these annotations to tag the term and abbreviation if they contain or are contained in the matched long-form of the term. Clear this list to use the default longType and shortType annotations.

badrex-biomedical-abbreviation-expander's People

Contributors

philgooch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.