Giter VIP home page Giter VIP logo

abbreviation-extraction's Introduction

Extraction of abbreviation-definition pairs

Build Status

Version: 0.2.5

This is a Python3 implementation of the Schwartz-Hearst algorithm for identifying abbreviations and their corresponding definitions in free text[1].

The original implementation is in Java, and Vincent Van Asch created a Python2 implementation at

http://www.cnts.ua.ac.be/~vincent/scripts/abbreviations.py

  • NB: As of March 2019 this link appears to be dead.

I have simplified, refactored it for Python 3 and added some tests.

This version outputs a Python dictionary of abbreviation:definition pairs.

Installation for command-line use

pip install -r requirements.txt

Usage

From the command line

python abbreviations/schwartz_hearst.py <input file>

Installation as a module

python3 setup.py install

or

pip install abbreviations

Usage

from abbreviations import schwartz_hearst

# By default, the most recently encountered definition for each term is returned
pairs = schwartz_hearst.extract_abbreviation_definition_pairs(doc_text='The emergency room (ER) was busy')
pairs = schwartz_hearst.extract_abbreviation_definition_pairs(file_path='<path_to_file>')

# If multiple definitions are encountered for each term, you might want to return the most common for each
pairs = schwartz_hearst.extract_abbreviation_definition_pairs(doc_text='...', most_common_definition=True)

# ... or you might want to return the first encountered definition for each
pairs = schwartz_hearst.extract_abbreviation_definition_pairs(doc_text='...', first_definition=True)

# when using a longer text, the format is line-separated sentences:
import nltk
sentences = nltk.sent_tokenize(longer_text)
pairs = schwartz_hearst.extract_abbreviation_definition_pairs(doc_text='\n'.join(sentences))

[1] A. Schwartz and M. Hearst (2003) A Simple Algorithm for Identifying Abbreviations Definitions in Biomedical Text. Biocomputing, 451-462.

abbreviation-extraction's People

Contributors

aolieman avatar micahjsmith avatar phil-scholarcy avatar philgooch avatar renaud avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

abbreviation-extraction's Issues

[tests] Some results, for reference ...

This is fabulous; thank you! :-)

I ran some tests. As expected abbreviations in [ ] or { } are ignored (request: add these?), and as mentioned in Issue #12 it would be nice if reverse definitions such as n.s. (not significant) were included, along with the standard not significant (n.s.) occurrences.


## SINGLE-SENTENCE TESTS:

$ cat victoria-test-single_sentence.txt; python abbreviations/schwartz_hearst.py victoria-test-single_sentence.txt 

  Breast cancer susceptibility gene 1 (BRCA1) is a tumor suppressor protein.
  {'BRCA1': 'Breast cancer susceptibility gene 1'}

## Ditto:

  Breast cancer susceptibility gene 2 [BRCA2] is also a tumor suppressor protein.
  {}

  Victoria is from NS (Nova Scotia).
  {}

  Victoria is from N.S. (Nova Scotia).
  {}

  Victoria is from Nova Scotia (N.S.).
  {'N.S.': 'Nova Scotia'}

  Victoria is from Nova Scotia (NS).
  {'NS': 'Nova Scotia'}

  Breast cancer susceptibility gene 2 (BRCA2) is also a tumor suppressor protein.
  {'BRCA2': 'Breast cancer susceptibility gene 2'}

  Breast cancer susceptibility gene 2 {BRCA2} is also a tumor suppressor protein.
  {}

  Breast cancer susceptibility gene 2 -- BRCA2 -- is also a tumor suppressor protein.
  {}

## More complex tests:

  Victoria is from Nova Scotia (NoSc).
  {'NoSc': 'Nova Scotia'}

  Victoria is from Nova Scotia (ns).
  {'ns': 'Nova Scotia'}

  Victoria is from Nova Scotia (nosc).
  {'nosc': 'Nova Scotia'}

  Association of bioMediCal scientists of CanaDa (ABC).
  {'ABC': 'Association of bioMediCal scientists of CanaDa'}

I am impressed~ :-D

Is there any example

if i want to extract introduction from a science-papper , is there a nice tool?

Condition redundant

def conditions(candidate):
"""
Based on Schwartz&Hearst
2 <= len(str) <= 10
len(tokens) <= 2
re.search(r'\p{L}', str)
str[0].isalnum()
and extra:
if it matches (\p{L}\.?\s?){2,}
it is a good candidate.
:param candidate: candidate abbreviation
:return: True if this is a good candidate
"""
viable = True
if regex.match(r'(\p{L}\.?\s?){2,}', candidate.lstrip()):
viable = True
if len(candidate) < 2 or len(candidate) > 10:
viable = False
if len(candidate.split()) > 2:
viable = False
if not regex.search(r'\p{L}', candidate):
viable = False
if not candidate[0].isalnum():
viable = False
return viable

if regex.match(r'(\p{L}\.?\s?){2,}', candidate.lstrip()):

The above condition is unused, the condition being true or false has no effect on the variable "viable"

a few false positives

Hi Phil,

Thanks so much for this clean and easy to use implementation! I noticed a couple of minor false positives when running it through a long document about space warfare.

Text:

The “satellite” goal of the program was accomplished when China established a space presence with the launch of Dongfanghong I in 1970; although, it wasn’t until the 21st century that the PRC space program kicked into high gear, with the rapid development, buildup and deployment of rockets, satellites, and the first Taikonaut (astronaut) in October 2003. In fact, prior to 2010, the PRC had only conducted ten space launches, one of which put the satellite into orbit.

Once more, also for the Space Race, a strong transatlantic link could strengthen the path towards a peaceful and prosperous future for humankind and by consequence, a more secure period for our democracies: it is in our hands (and brains) to transform these ideas into a great reality.

Berlin is acknowledging the vulnerabilities that could potentially arise through hostile acts in space and set up its own space monitoring center, called the Air and Space Operations Center (ASOC) in September 2020 . 

Abbreviations proposed:

ASOC: and Space Operations Center
and brains: and by consequence, a more secure period for our democracies: it is in our hands
astronaut: and deployment of rockets, satellites, and the first Taikonaut

In case you feel like tweaking!

Fred

Release on PyPI?

You've already done the hard work of cutting releases to GitHub releases. It would make it easier to install if it could be released to PyPI as well. I'm happy to help out with this with your blessing

Improve abbreviation-term mapping when parentheses are not present

Abbreviation expansions of the form This Is A Term (TIAT) are not always present in a document. You may also see glossary lists such as

TIAT This Is A Term

where the abbreviation comes first, and abbreviation and term are separated by a tab or other whitespace.

Can we extend the algorithm to resolve such cases?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.