Giter VIP home page Giter VIP logo

dhchenx / umls-similarity Goto Github PK

View Code? Open in Web Editor NEW
14.0 3.0 0.0 235 KB

Estimate similarity of medical concepts based on Unified Medical Language System (UMLS)

Home Page: https://pypi.org/project/umls-similarity/

License: MIT License

Python 6.93% Perl 93.07%
unified-medical-language-system umls medical-terminology medical-concept semantic-similarity umls-similarity umls-metathesaurus

umls-similarity's Introduction

UMLS-Similarity

Estimate the similarity of medical concepts based on Unified Medical Language System (UMLS) and WordNet

Installation

First of all, please install Perl environment (Strawberry).

For UMLS use:

  1. Install MySQL and MySQL Workbench and the MySQL Home folder should not have space in its path;

  2. Download the UMLS and extract the subset;

  3. Goto UMLS's META and NET folders and Load UMLS data into MySQL database with scripts;

  4. Install necessary libs with 'cpanm' command with the flag --force like below:

    cpanm UMLS::Interface --force
    
    cpanm UMLS::Similarity --force
    

    Errors may occur in the above process, just ignore them.

  5. Please check if you have installed DBI, DBD::mysql; install them if not;

    • Issue: mysql.xs.dll not found problem, please found more details in link.

    • Solution: Copying C:\strawberry\c\bin\libmysql.dll_ to c:\strawberry\perl\vendor\lib\auto\mysql

  6. Finished!

For WordNet use (skip it if not)

  1. Download the WordNet-2.1 if you want to use WordNet Similarity (if not, please skip)
  2. Set WNHome environment variables (if you need to use WordNet Similarity)
  3. Install WordNet::QueryData via cpanm command in perl
  4. Install WordNet::Similarity via cpanm command in perl
  5. Finished!

Finally, install our Python package umls-similrity via pip

pip install umls-similarity

Available similarity measures

  • Leacock and Chodorow (1998) referred to as lch
  • Wu and Palmer (1994) referred to as wup
  • Zhong, et al. (2002) referred to as zhong
  • The basic path measure referred to as path
  • The undirected path measure referred to as upath
  • Rada, et. al. (1989) referred to as cdist
  • Nguyan and Al-Mubaid (2006) referred to as nam
  • Resnik (1996) referred to as res
  • Lin (1988) referred to as lin
  • Jiang and Conrath (1997) referred to as jcn
  • The vector measure referred to as vector
  • Pekar and Staab (2002) referred to as pks
  • Pirro and Euzenat (2010) referred to as faith
  • Maedche and Staab (2001) referred to as cmatch
  • Batet, et al (2011) referred to as batet
  • Sanchez, et al. (2012) referred to as sanchez

Let Codes Speak

Example Code 1: Estimate similarity between two medical concepts using UMLS

from umls_similarity.umls import UMLSSimilarity
import os

if __name__ == "__main__":
    # define MySQL information that stores UMLS data in your computer
    mysql_info = {}
    mysql_info["database"] = "umls"
    mysql_info["username"] = "root"
    mysql_info["password"] = "{I am not gonna tell you}"
    mysql_info["hostname"] = "localhost"

    # Perl bin's path which will be automatically detected by the lib, but you can also manually specify in its constructor
    # perl_bin_path = r"C:\Strawberry\perl\bin\perl"

    # create an instance
    umls_sim = UMLSSimilarity(mysql_info=mysql_info,
                              # perl_bin_path=''
                              )
    
    # show the names of all available measures so you can pass them into the following `measure` parameter
    measures=umls_sim.get_all_measures()
    print(measures)

    # Directly pass two CUIs into the function below:
    sims = umls_sim.similarity(cui1="C0017601", cui2="C0232197", measure="lch")
    print(sims[0])  # only one pair with two concepts
    
    # Or batch process many CUI pairs from a text file where each line is formatted like 'C0006949<>C0031507'
    current_path = os.path.dirname(os.path.realpath(__file__))
    sims = umls_sim.similarity_from_file(current_path + r"\cuis_umls_sim.txt", measure="lch")
    for sim in sims:
        print(sim)

Example Code 2: Estimate similarity between concept using WordNet 2.1

from umls_similarity.wordnet import WNSimilarity

if __name__ == "__main__":

    wn_root_path = r"C:\Program Files (x86)\WordNet\2.1"
    # perl_bin_path=r"C:\Strawberry\perl\bin\perl"

    var1 = "dog#n#1"
    var2 = "orange#n#1"

    wn_sim = WNSimilarity(wn_root_path=wn_root_path)

    sims = wn_sim.similarity(var1, var2)
    print(sims)

    for k, v in enumerate(sims):
        print(k, '\t', v, '\t', sims[v])

Credits

This project is a wrapper of the Perl library of UMLS::Similarity and UMLS::Interface.

Note: There are plenty of unexpected errors to occur during the installation of the perl library of UMLS::Similarity, possibly because I am not an expert about Perl and its library use.

License

The umls-similarity Python package is provided by Donghua Chen.

umls-similarity's People

Contributors

dhchenx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

umls-similarity's Issues

Question about CuiFinder.pm

Hello

Thank you for publishing this library.
I am not familiar with Perl
at this time I download UMLS and load scripts into MySQL.
after this command (cpanm UMLS::Interface --force) I get the error :
image

I guess this error is related to the user and password of MySQL.
But I can not find the CuiFinder.pm for adding a fixed username and password in Line 721.
would you please help me with this issue?

Thank you very much

Indexing Issues with UMLS?

Hey there,

I'm having an issue that's causing the indexing process that's done in the umls-similarity.pl script to take an incredibly long time.

I've looked into the Perl source code and I understand that the UMLS::Similarity and UMLS::Interface modules need to first create a set of indexes in order to speed up subsequent path dependent semantic similarity measures. I've checked their mailing list, which has been unfortunately inactive since 2019, and I know that this process is expected to take several hours and maybe even days. However, when I ran it in my machine, the first time I tried it, I left it running over the weekend and after 48+ hours just about 1000 CUIs had been indexed which means that at that rate it would take over 17 YEARS to index all of the 3.31 million concepts within the UMLS... The machine I'm running it in should definitely not be having these issues, at least from a hardware perspective (96 cores, 488 GB RAM, 2TB SSD).

I've been experimenting with the MySQL my.ini config file and was able to reduce that down to 3 years. I used the parameters delineated here as a starting point: https://www.nlm.nih.gov/research/umls/implementation_resources/scripts/README_RRF_MySQL_Output_Stream.html. A good improvement but it is still unreasonably long... I'm aware of the --realtime flag which I've turned on, however, that can take a VERY long time to process some concept pairs and I'd rather go through the long process of setting up the indexes once in order to significantly speed up any subsequent calculations.

The routine that's causing these delays seems to be the subroutine "_initializeDepthFirstSearch" that can be found here: https://github.com/bmcinnes/UMLS-Interface/blob/master/lib/UMLS/Interface/PathFinder.pm

Do you have any recommendations/suggestions? Could you share with me what is the exact configuration you used for your UMLS build as well as your MySQL parameters?

Here's some system info in case it helps:
OS: Windows Server 2019 Datacenter (64-bit OS)
MySQL version : 8.1.0
UMLS Version: 2023AA
Perl Version: Strawberry Perl 5.38.0.1-64bit
UMLS Interface Version: 1.51

I'm including some screenshots of what I'm seeing in case it helps:

Here you can see that the routine successfully connected to MySQL database:
1

Here you can see that after a successful connection to the umls database in MySQL, there is a new schema called "umlsinterfaceindex"
3

Here are the contents of the file containing example CUI pairs that I'm looking to get a semantic similarity measure for:
2

Here you can see some of the outputs from running the routine with VERBOSE mode turned on:
7
8

Here you can see the contents of a table file that gets created alongside verbose mode that tracks the progress of the indexing process:
9

Please let me know if you have any thoughts on this.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.