Giter VIP home page Giter VIP logo

fasttext_torch's Introduction

This is an Torch implementation of fasttext based on A. Joulin's paper Bag of Tricks for Efficient Text Classification.

Author: Junwei Pan

Email: [email protected]

Requirements

This code is written in Lua and requires Torch. If you're on Ubuntu, installing Torch in your home directory may look something like:

$ curl -s https://raw.githubusercontent.com/torch/ezinstall/master/install-deps | bash
$ git clone https://github.com/torch/distro.git ~/torch --recursive
$ cd ~/torch
$ ./install.sh      # and enter "yes" at the end to modify your bashrc
$ source ~/.bashrc

This code also require the nn package:

$ luarocks install nn

Usage

First down load the text classification data mentioned in Xiang Zhang's paper: Character-level Convolutional Networks for Text Classification. We use the ag_news_csv dataset for training and evaluation.

Then run the following commands to train and evaluate the fasttext model:

$ th main.lua -corpus_train data/ag_news_csv/train.csv -corpus_test data/ag_news_csv/test.csv -dim 10 -minfreq 10 -stream 0 -epochs 5 -suffix 1 -n_classes 4 -n_gram 1 -decay 0 -lr 0.5

If the dataset is too large to fit in the memory, try to use the paratemer -stream 1.

The trained model can get an accuracy of 90.93% on the g_news_csv dataset using the above configuration.

Parameters

-corpus_train: path of the training data

-corpus_test: path of the testing data

-minfreq: only those words with frequence higher than this will be used as features, default 10

-dim: the embedding dimension, default 10

-lr: learning rate, default 0.5

-min_lr: the minimal learning rate, default 0.001

-decay: whether to decay learning rate, 1 for decay, 0 for no decay, default 0

-epochs: number of epochs to go through the training data, default 5

-stream: whether to stream the data: 1 for streaming, 0 for store all data in memory, default 0

-suffix: suffix of the model

-n_classes: number of classification categories

-n_gram: 1 for unigram, 2 for bigram, 3 for trigram, default 1

-title: whether use the title to generate features, default 1

-description: whether use the description to generate features, default 1

To be done

  1. Support hashtrick
  2. Efficiency improvement

Acknowledgements

This code is based on the word2vec_torch project, which extends Yoon Kim's word2vec_torch by implementing the Continuous Bag-of-words Model.

fasttext_torch's People

Contributors

junwei-pan avatar

Watchers

James Cloos avatar Elon avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.