Giter VIP home page Giter VIP logo

ofst-spellchecker-and-sentanalysis-with-w2vec's Introduction

LAB 1: OpenFST Spell checker and familiarization with Word2vec

Examples

The examples folder contains the pre-lab examples we demonstrated in class.

Setup

To setup openfst in your machine run.

bash install_openfst.sh

Leave the OPENFST_VERSION=1.6.1, since next versions are not supported for this lab / contain breaking changes.

Install python dependencies with:

pip install -r requirements.txt

Fetch the NLTK Gutenberg corpus using the following script.

python scripts/fetch_gutenberg.py > data/gutenberg.txt

This script downloads and preprocesses the corpus.

Proposed code structure and provided resources

We propose the following structure to organize your code.

├── data                            # -> Train and test corpora
│   ├── spell_test.txt              # -> spell checker evaluation corpus
│   └── wiki.txt                    # -> Wikipedial word misspellings
├── fsts                            # -> Compiled FSTs and FST description files
├── install_openfst.sh              # -> OpenFST installation script
├── README.md                       # -> This file.
├── requirements.txt                # -> Python dependencies
├── scripts                         # -> Python and Bash scripts go here
│   ├── fetch_gutenberg.py          # -> Provided script to download the gutenberg corpus
│   ├── helpers.py                  # -> Provided helper functions
│   ├── mkfstinput.py               # -> Provided script to pass a word as input to the spell checker
│   ├── predict.sh                  # -> Provided script to run prediction for a word
│   ├── run_evaluation.py           # -> Provided script to run evaluation on the test corpus
│   ├── util.py                     # -> Stubs to fill in some of your utility functions. TODO
│   ├── w2v_sentiment_analysis.py   # -> Complete the code here for sentiment analysis using word2vec
│   ├── w2v_train.py                # -> Complete the code here to train a word2vec model on the gutenberg corpus
│   └── word_edits.sh               # -> Provided script to get the minimum edit distance edits between two words
└── vocab                           # -> Place your vocab and syms files here

We also propose to use the .fst suffix for fst description files and the .binfst suffix for compiled fsts.

We recommend you study the code we provide before you start and try to run some basic examples. This will give you some basic understanding about how to script for OpenFST. Also, you will probably avoid reimplementing existing functionality

Part 1: Spell checker using Finite state transducers

In this part you are going to implement a spell checker using Finite state transducers and the OpenFST library. The spell checker consists of two parts: a Levenshtein transducer and a word acceptor.

Follow the instructions in the lab handouts (available through mycourses) to complete this part.

Spell checker evaluation

Once you have implemented a spell checker, e.g. fsts/MY_SPELL_CHECKER.binfst you can use the following script for evaluation on the provided test set.

Run:

python scripts/run_evaluation.py fsts/MY_SPELL_CHECKER.binfst

The script will run the spell checker on the test set and print the model accuracy (percentage of misspelled words that are corrected appropriately).

Part 2: Familiarization with word2vec and sentiment analysis

In this part you will use the gutenberg corpus to train a word2vec model. You will find most similar words, perform word analogies and visualize the embeddings using https://projector.tensorflow.org/. Finally you will use the embeddings you created to perform sentiment analysis on the IMDB movie reviews dataset.

Follow the instructions in the lab handouts (available through mycourses) to complete this part.

ofst-spellchecker-and-sentanalysis-with-w2vec's People

Contributors

olgabarlou avatar dimitrioskakouris avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.