Giter VIP home page Giter VIP logo

ml-embl-publication-project's Introduction

ML-EMBL-publication-project


Repository of ML models and algorithm built for the EMBL publication project

Python Confluence

Clone

git clone https://github.com/0AlphaZero0/ML-EMBL-publication-project.git

Table of Contents

Introduction

This prototype is made to detect every EMBL paper within a list of PMIDs. An EMBL paper is a paper where there is at least one affiliation to either an EMBL site or an EMBL partnership.

At this time there are 6 different sites and 2 partnerships :

  • Australia (partnership)
  • Barcelona
  • Hinxton (EMBL-EBI Cambridge)
  • Grenoble
  • Hamburg
  • Heidelberg
  • Nordic (partnership)
  • Rome

The prototype used two machine learning models and two vectorizer built during the FREYA project. This work has been made for the deliverable 4.6.

All this project is described in Confluence

Running

As described before, the prototype needs a list of PMIDs. This list can be in a file provide by the user or a string directly wrote in the script. Results are organized by searches, each new search require a file .csv or .txt containing PMIDs. Each search have its corresponding folder in the searches folder. The results are a list of file, one for each site and one for all EMBL PMIDs detected. For each site the file will be a .csv table like the following :

PMIDs EMBL Member states Worldwide Partnership
30537516 TRUE TRUE TRUE TRUE
30496853 TRUE TRUE FALSE FALSE
29330484 TRUE FALSE TRUE FALSE

To run this prototype just use the following command :

python .\detect_EMBL.py

In the script, 3 variables are necessary to run your search:

search_name="test"
search_file="test_pmid_EPMC.txt"
directory="./searches/"+search_name+"/"

The search_name corresponds to a name you choose and the directory name in the searches directory. Then the search_file corresponds to the file in your directory where the PMIDs you want to process are located.

This algorithm uses multiprocessing to be able to process huge amount of PMIDs, it is, therefore, possible that the machine where this algorithm run could be slowed.

Details

This prototype remains on two algorithms :

is_EMBL

This algorithm take a an affiliation string and will return a dictionary with prediction scores and methods of prediction. It uses a combination of exact matches and predictions either on the whole string or sub parts of this string.

get_geoloc_from

This algorithm take a an affiliation string and will return a dictionary with corresponding geolocation information found in this string. This algorithm is not the best one to extract geolocation from a string and thus to improve the EMBL detection this is one algorithm to think about.

ml-embl-publication-project's People

Contributors

0alphazero0 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.