Giter VIP home page Giter VIP logo

whuir / dazer Goto Github PK

View Code? Open in Web Editor NEW
33.0 3.0 13.0 144 KB

The Tensorflow implementation of accepted ACL 2018 paper "A deep relevance model for zero-shot document filtering", Chenliang Li, Wei Zhou, Feng Ji, Yu Duan, Haiqing Chen, http://aclweb.org/anthology/P18-1214

Python 99.49% Shell 0.51%
tensorflow zero-shot document-classification document-filtering deeplearning document-ranking

dazer's Introduction

DAZER

The Tensorflow implementation of our ACL 2018 paper:
A Deep Relevance Model for Zero-Shot Document Filtering, Chenliang Li, Wei Zhou, Feng Ji, Yu Duan, Haiqing Chen Paper url: http://aclweb.org/anthology/P18-1214

Requirements

  • Python 3.5
  • Tensorflow 1.2
  • Numpy
  • Traitlets

Guide To Use

Prepare your dataset: first, prepare your own data. See Data Preparation

Configure: then, configure the model through the config file. Configurable parameters are listed here

See the example: sample.config

In additional, you need to change the zero-shot label settings in get_label.py

(You need make sure both get_label.py and model.py are put in same directory)

Training : pass the config file, training data and validation data as

python model.py config-file\
    --train \
    --train_file: path to training data\
    --validation_file: path to validation data\
    --checkpoint_dir: directory to store/load model checkpoints\ 
    --load_model: True or False(depends on existing or not). Start with a new model or continue training

See example: sample-train.sh

Testing: pass the config file and testing data as

python model.py config-file\
    --test \
    --test_file: path to testing data\
    --test_size: size of testing data (number of testing samples)\
    --checkpoint_dir: directory to load trained model\
    --output_score_file: file to output documents score\

Relevance scores will be output to output_score_file, one score per line, in the same order as test_file.

Data Preparation

All seed words and documents must be mapped into sequences of integer term ids. Term id starts with 1.

Training Data Format

Each training sample is a tuple of (seed words, postive document, negative document)

seed_words \t postive_document \t negative_document

Example: 334,453,768 \t 123,435,657,878,6,556 \t 443,554,534,3,67,8,12,2,7,9

Testing Data Format

Each testing sample is a tuple of (seed words, document)

seed_words \t document

Example: 334,453,768 \t 123,435,657,878,6,556

Validation Data Format

The format is same as training data format

Label Dict File Format

Each line is a tuple of (label_name, seed_words)

label_name/seed_words

Example: alt.atheism/atheist christian atheism god islamic

Word2id File Format

Each line is a tuple of (word, id)

word id

Example: world 123

Embedding File Format

Each line is a tuple of (id, embedding)

id embedding

Example: 1 0.3 0.4 0.5 0.6 -0.4 -0.2

Configurations

Model Configurations

  • BaseNN.embedding_size: embedding dimension of word
  • BaseNN.max_q_len: max query length
  • BaseNN.max_d_len: max document length
  • DataGenerator.max_q_len: max query length. Should be the same as BaseNN.max_q_len
  • DataGenerator.max_d_len: max query length. Should be the same as BaseNN.max_d_len
  • BaseNN.vocabulary_size: vocabulary size
  • DataGenerator.vocabulary_size: vocabulary size
  • BaseNN.batch_size: batch size
  • BaseNN.max_epochs: max number of epochs to train
  • BaseNN.eval_frequency: evaluate model on validation set very this epochs
  • BaseNN.checkpoint_steps: save model very this epochs

Data

  • DAZER.emb_in: path of initial embeddings file
  • DAZER.label_dict_path: path of label dict file
  • DAZER.word2id_path: path of word2id file

Training Parameters

  • DAZER.epsilon: epsilon for Adam Optimizer
  • DAZER.embedding_size: embedding dimension of word
  • DAZER.vocabulary_size: vocabulary size of the dataset
  • DAZER.kernal_width: width of the kernel
  • DAZER.kernal_num: num of kernel
  • DAZER.regular_term: weight of L2 loss
  • DAZER.maxpooling_num: num of K-max pooling
  • DAZER.decoder_mlp1_num: num of hidden units of first mlp in relevance aggregation part
  • DAZER.decoder_mlp2_num: num of hidden units of second mlp in relevance aggregation part
  • DAZER.model_learning_rate: learning rate for model instead of adversarial calssifier
  • DAZER.adv_learning_rate: learning rate for adversarial classfier
  • DAZER.train_class_num: num of class in training time
  • DAZER.adv_term: weight of adversarial loss when updating model's parameters
  • DAZER.zsl_num: num of zero-shot labels
  • DAZER.zsl_type: type of zero-shot label setting ( you may have multiply zero-shot settings in same number of zero-shot label, this indicates which type of zero-shot label setting you pick for experiemnt, see get_label.py for more details )

dazer's People

Contributors

lichenliang-whu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

dazer's Issues

An end-to-end working example is appreciated

I just finished reading the paper and it's a great one! Very clearly written with solid experimental results.

It'll help greatly for people to try out your model if you can provide an end-to-end working example starting from publicly available word embeddings and datasets. The current code requires the user to follow a specific data format and it takes time to convert the data before feeding to the model.

training data format

what is positive_document and negative_document in training data format when training with 20 newsgroup dataset.

Unable to reproduce MAP numbers

Hello,

Thank you for the great work. Below are the steps, I follow to run the code where I assume task = space.

  1. Use https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html to formulate training data and ignore training records corresponding to categories = ['sci.space'] and ['comp.graphics']. This way, training_data_size = 10,134
  2. Use https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html to get val/data data. This way, testing_data_size = 7,532
  3. Set c.DAZER.train_class_num = 18 in sample.config. Rest of settings remain same.
  4. Run sample-train.sh and sample-test.sh
  5. Relevance score file is produced.
  6. For the testing dataset, ignore document corresponding to ['comp.graphics'], mark the documents = 1 for category ['sci.space'] and mark the documents = 0 for rest of the categories.
  7. Use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html to calculate AP score for task = space where y_true is binary and y_score = relevance scores.

Following above steps, I get MAP ~ 0.050 which is way far from the reported number. Could you please let me know how did you calculate MAP scores? Additionally, please let me know if any of the above steps are incorrect. Thanks.

The Data

I want to know which public dataset you are using, or can you send me the data you use, I hope to run the program correctly.Thank you.
My email [email protected]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.