Giter VIP home page Giter VIP logo

complexembeddings's Introduction

ComplexEmbeddings

This project aims to generate complex word embeddings for Out-Of-Vocabulary entities. After completion, the package will be able to generate pre-trained embeddings, improve them by generating embeddings on-the-fly, evaluate the benchmarks using pre-trained embeddings, and make the same evaluations on the imporoved embeddings.

Requirements

All the libraries used under this project are included in the file, requirements.txt. To install, just run the command

$ pip install --upgrade pip
$ pip install -r requirements.txt

Downloading and cleaning wiki dump

Here, you can find the first 1 billion bytes of English Wikipedia.

$ mkdir data
$ wget -c http://mattmahoney.net/dc/enwik9.zip -P data
$ unzip data/enwik9.zip -d data

This is a raw Wikipedia dump and needs to be cleaned because it contains a lot of HTML/XML data. There are two ways to pre-process it. Here, I am using the wikifil.pl bundled with FastText (the script was originally developed by Matt Mahoney, and can be found on his website.) to pre-process it.

$ perl src/package/wikifil.pl data/enwik9 > data/fil9

Pre-Training and Evaluating Analogy using Google Dataset

The script, pre-train.py takes the following arguments:

  • Files
    • Input File: Clean wiki dump
    • Output File: Saved model
  • Model Hyperparameters
    • Vector Size, -s: Defines the Embedding Size
    • Skipgram, -sg: Decides whether model uses skipgram or CBOW
    • Loss Function, -hs: Loss function used is Hierarchal Softmax or Negative Sampling
    • Epochs, -e: Number of epochs
$ mkdir model
$ python src/pre-train.py -i data/fil9 -o model/pre_wiki -s 300 -sg 1 -hs 1 -e 5

Next, we try to see how these pre-trained embeddings perform on the Google Analogy Task. For this we have the analogy.py.
Updates are to be made so that the script evaluates the model for the entire dataset.

$ python analogy.py -i data/questions-words.txt -m model/pre_wiki
Question: high is to higher as great is to ?
Answer: greater
Predicted: greater
Question: glendale is to arizona as akron is to ?
Answer: ohio
Predicted: ohio
Question: ethical is to unethical as comfortable is to ?
Answer: uncomfortable
Predicted: comfortably
Question: netherlands is to dutch as brazil is to ?
Answer: brazilian
Predicted: brazilian
Question: free is to freely as happy is to ?
Answer: happily
Predicted: happily
Question: luanda is to angola as monrovia is to ?
Answer: liberia
Predicted: liberia

complexembeddings's People

Contributors

bharat-suri avatar

Stargazers

Prakhar Srivastava avatar Tom Soru avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.