This is a Python3 implementation of the Schwartz-Hearst algorithm for identifying abbreviations and their corresponding definitions in free text[1].
The original implementation is in Java, and Vincent Van Asch created a Python2 implementation at
http://www.cnts.ua.ac.be/~vincent/scripts/abbreviations.py
I have taken the liberty of taking Vincent's code, simplifying it a little, refactoring it for Python 3, and adding some tests.
This version outputs a Python dictionary of abbreviation:definition pairs.
As per Vincent's code, this version is licensed under GPLv3. See LICENSE.txt
pip install -r requirements.txt
From the command line
python abbreviations/schwartz_hearst.py <input file>
python3 setup.py install
or
pip install abbreviations
from abbreviations import schwartz_hearst
pairs = schwartz_hearst.extract_abbreviation_definition_pairs(doc_text='The emergency room (ER) was busy')
pairs = schwartz_hearst.extract_abbreviation_definition_pairs(file_path='<path_to_file>')
[1] A. Schwartz and M. Hearst (2003) A Simple Algorithm for Identifying Abbreviations Definitions in Biomedical Text. Biocomputing, 451-462.