Giter VIP home page Giter VIP logo

nedfornoisytext's Introduction

NEDforNoisyText

Named Entity Disambiguation for Noisy Text

This repository contains code for reproducing our experiments in our "Named Entity Disambiguation for Noisy Text" submittion to CoNLL 2017. Some manual steps are required to setup the data for the experiments

  • Please setup a mysql schema with the page and redirect tables from a Wikipedia dump.
  • Please place the wikipedia pages-article xml file at data/enwiki/pages-articles.xml.
  • For processing wikilinks files from umass an installationg of scala is required.

python libraries required for the project: keras, tensorflow, numpy, nltk, json, unicodedata, unidecode, mysql.connector, urllib, matplotlib, zipfile, ujson, pandas, urlparse, sklearn.

For running the CoNLL test

Seting up the data run ./setup_data.sh to setup data for the wikilinksNED evaluation run ./setup_wikilinks.sh <db_user> <db_pass> <db_schema> <db_ip> to setup data for the conll evaluation run ./setup_conll.sh <db_user> <db_pass> <db_schema> <db_ip>

REMARK: as of now you are required to use the script at https://github.com/NoamGit/Wiki2JSON/ to extract Wikilinks from its original format (thrift) to an easier to work with json format. The resulting files should be stored at data/wikilinks/unprocessed. We are working to automate this process as soon as possible.

Running evaluations for running WikilinksNED evaluation run ./evaluateWikilinksNED.sh <db_user> <db_pass> <db_schema> <db_ip> for running CoNLL evaluation run ./pretrainOnWikipedia.sh <db_user> <db_pass> <db_schema> <db_ip> and then ./evaluateCoNLL.sh <db_user> <db_pass> <db_schema> <db_ip>

results are written to evaluation.txt file in the main directory

NOTES:

  • Setting up data and running experiments takes a very long time.
  • After setting up data for both experiments the data folder can reach 300+Gb

This code is provided as-is. Running this code or any part of it is at your own risk. We do not take any responsibility for running any of the code or usage of any of the data. Much of this code was written at the same time as I was learning Keras, TF and Theano. It has undergone many changes and was used for extensive experimentation. It is therefore probably full of design flaws and redundancies.

nedfornoisytext's People

Contributors

yotam-happy avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.