Giter VIP home page Giter VIP logo

word2mat's Introduction

word2mat

Word2Mat is a framework that learns sentence embeddings in a CBOW-word2vec style, but where the words and sentences are represented as matrices. Details of this method and results can be found in our ICLR paper.

Dependencies

  • Python3
  • PyTorch >= 0.4 with CUDA support
  • NLTK >= 3

Setup python3 environment

Please install the python3 dependencies in your environment:

virtualenv -p python3 venv && source venv/bin/activate
pip install -r requirements.txt
python3 -c "import nltk; nltk.download('punkt')"

Download training data

In order to reproduce the results from our paper, which were trained on the UMBC corpus, download the UMBC corpus, extract the tar.gz file, and run the extract_umbc.py script in the following way:

python extract_umbc.py umbc_corpus/webbase_all <path_to_store_sentences>

This stores the sentences from the UMBC corpus in a format that is usable by our code: Each line in the resulting file contains a single sentence, whose (already pre-processed) tokens are separated by a whitespace character.

Running the experiments

Note: After further experiments, we observed that terminating training based on the validation loss produces unreliable results because of relatively high variance in the validation loss. Hence, we recommend using training loss as stopping criterion, which is more stable.

The results below are trained with this stopping criterion, and therefore slightly differ from the results reported in the ICLR paper. However, the conclusions remain the same: CMOW is much better than CBOW at capturing linguistic properties except WordContent. Therefore, CBOW is superior in almost all downstream tasks except TREC. The Hybrid model retains the capabilities of both models and therefore is extremely close to the better model among CBOW and CMOW, or better on all tasks.

Probing tasks: All scores denote accuracy.

Model Depth BigramShift SubjNumber Tense CoordinationInversion Length ObjNumber TopConstituents OddManOut WordContent
CBOW 32.73 49.65 79.65 79.46 53.78 75.69 79.00 72.26 49.64 89.11
CMOW 34.40 72.44 82.08 80.32 62.05 82.93 79.70 74.25 51.33 65.15
Hybrid 35.38 71.22 81.45 80.83 59.17 87.00 79.37 72.88 50.53 86.97

Supervised downstream tasks: For STS-Benchmark and Sick-Relatedness, the results denote Spearman correlation coefficient. For all others the score denotes accuracy.

Model SNLI SUBJ CR MR MPQA TREC SICKEntailment SST2 SST5 MRPC STSBenchmark SICKRelatedness
CBOW 67.76 90.45 79.76 74.32 87.23 84.4 79.58 78.14 41.72 72.17 0.619 0.721
CMOW 64.77 87.11 74.60 71.42 87.55 88.0 76.90 76.77 40.18 70.61 0.576 0.705
Hybrid 67.59 90.26 79.60 74.10 87.38 89.2 78.69 77.87 41.58 71.94 0.613 0.718

Unsupervised downstream tasks: The score denotes Spearman correlation coefficient.

Model STS12 STS13 STS14 STS15 STS16
CBOW 0.458 0.497 0.556 0.637 0.630
CMOW 0.432 0.334 0.403 0.471 0.529
Hybrid 0.472 0.476 0.530 0.621 0.613

Train CBOW, CMOW, and CBOW-CMOW hybrid model

To train a 784-dimensional CBOW model, run the following:

python train_cbow.py --w2m_type cbow --batch_size=1024 --outputdir=<path_to_save_model> --optimizer adam,lr=0.0003 --max_words=30000 --n_epochs=1000 --n_negs=20 --validation_frequency=1000 --mode=random --num_samples_per_item=30 --patience 10 --downstream_eval full --outputmodelname mode w2m_type word_emb_dim --validation_fraction=0.0001 --context_size=5 --word_emb_dim 784 --temp_path <some_directory_for_temp_files> --dataset_path=<path_to_parsed_UMBC_dataset> --num_workers 2 --output_file <path_to_output.csv> --num_docs 134442680 --stop_criterion train_loss

For CMOW:

python train_cbow.py --w2m_type cmow --batch_size=1024 --outputdir=<path_to_save_model> --optimizer adam,lr=0.0003 --max_words=30000 --n_epochs=1000 --n_negs=20 --validation_frequency=1000 --mode=random --num_samples_per_item=30 --patience 10 --downstream_eval full --outputmodelname mode w2m_type word_emb_dim --validation_fraction=0.0001 --context_size=5 --word_emb_dim 784 --temp_path <some_directory_for_temp_files> --dataset_path=<path_to_parsed_UMBC_dataset> --num_workers 2 --output_file <path_to_output.csv> --num_docs 134442680 --stop_criterion train_loss --initialization identity

And the CBOW-CMOW Hybrid:

python train_cbow.py --w2m_type hybrid --batch_size=1024 --outputdir=<path_to_save_model> --optimizer adam,lr=0.0003 --max_words=30000 --n_epochs=1000 --n_negs=20 --validation_frequency=1000 --mode=random --num_samples_per_item=30 --patience 10 --downstream_eval full --outputmodelname mode w2m_type word_emb_dim --validation_fraction=0.0001 --context_size=5 --word_emb_dim 400 --temp_path <some_directory_for_temp_files> --dataset_path=<path_to_parsed_UMBC_dataset> --num_workers 2 --output_file <path_to_output.csv> --num_docs 134442680 --stop_criterion train_loss --initialization identity

Evaluate components of hybrid model

In the paper, we have shown that the jointly training of the individual CBOW/CMOW components emphasizes their individual strengths. To assess the performance of the CBOW component, restrict the final embedding representation to include only the first half of the representations from the HybridEncoder (--included_features 0 400 in a 800-dimensional Hybrid encoder), or restrict it to the second half (--included features 400 800) to evaluate the CMOW component. E.g, for evaluating the CMOW component, run:

python evaluate_word2mat.py --encoders <path_to_hybrid.encoder_file> --word_vocab <path_to_.vocab_file> --included_features 400 800 --outputdir <temp_path_to_save_encoder> --outputmodelname hybrid_constituent --downstream_eval full

Here, 'encoder' and 'word_vocab' is saved in 'outputdir' after training the models. By

Files

  • train_cbow.py Main training executable. Type python train_cbow.py --help to get overview of training parameters.
  • cbow.py Contains the data preparation code as well as the neural architecture for CBOW except the encoder.
  • word2mat.py The code for word2mat encoder.
  • wrap_evaluation.py Wrapper script for SentEval to automatically evaluate encoder after training.
  • evaluate_word2mat.py Script for evaluating sub-components of hybrid encoder with SentEval.
  • mutils.py Helpers for saving the results, hyperparameter optimization and stuff.

Reference

Please cite our ICLR paper [1] to reference our work or code.

CBOW Is Not All You Need: Combining CBOW with the Compositional Matrix Space Model (ICLR 2019)

[1] Mai, F., Galke, L & Scherp, A., CBOW Is Not All You Need: Combining CBOW with the Compositional Matrix Space Model

@inproceedings{mai2018cbow,
title={{CBOW} Is Not All You Need: Combining {CBOW} with the Compositional Matrix Space Model},
author={Florian Mai and Lukas Galke and Ansgar Scherp},
booktitle={International Conference on Learning Representations},
year={2019},
url={https://openreview.net/forum?id=H1MgjoR9tQ},
}

word2mat's People

Contributors

florianmai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

word2mat's Issues

Scores in the paper

I was going through your paper and was curious as to how the scores are calculated. Is it pearson r * 100?

Padded zeroes are not removed in CBOW vector

Is there a specific reason as to why the padded zeroes are added in the CBOW vector? The lookup table learns an embedding for the zeroes and this embedding is added to CBOW vector in case of padded zeroes and missing word.

Code : word2mat.py, line 96
elif self.w2m_type == "cbow":
cur_emb = torch.sum(word_matrices, 1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.