alexa / dialoglue Goto Github PK

View Code? Open in Web Editor NEW

280.0 19.0 25.0 336 KB

DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue

Home Page: https://evalai.cloudcv.org/web/challenges/challenge-page/708/overview

License: Apache License 2.0

Python 98.21% Shell 1.79%

deep-learning natural-language-processing natural-language-understanding machinelearning

dialoglue's Issues

Multi-GPU training?

@mihail-amazon Does this model support multi-gpu training? I see that the code uses only one gpu for training!

Hi ! For all datasets in dialoGLUE benchmark, I can reproduce similar results except for the MultiWOZ.
For ConverBERT-DG, your joint goal is around 58, but I can only get 56, which is the same as the original Trippy reported.
I wonder if you have used different hyper-parameters for Trippy? If so, can you share them ?

Thank you!

The original hypers for Trippy are as follows:

--do_lower_case \ --learning_rate=1e-4 \ --num_train_epochs=10 \ --max_seq_length=180 \ --per_gpu_train_batch_size=48 \ --per_gpu_eval_batch_size=1 \ --output_dir=${OUT_DIR} \ --save_epochs=2 \ --logging_steps=10 \ --warmup_proportion=0.1 \ --eval_all_checkpoints \ --adam_epsilon=1e-6 \ --label_value_repetitions \ --swap_utterances \ --append_history \ --use_history_labels \ --delexicalize_sys_utts \ --class_aux_feats_inform \ --class_aux_feats_ds \

Acc of CLINC150 and HWU64 on leaderboard are reversed.

reproducibility for slot filling task

dialoglue/data_utils/process_slot.py

Lines 47 to 49 in 42737da

 slots = set([slot['slot'] for row in train_data for slot in row.get('labels', [])]) 

 vocab = ["O"] + [prefix + slot for slot in slots for prefix in ["B-", "I-"]] 

 json.dump(vocab, open(dataset + "vocab.txt", "w+"))

Slot BIO labels are stored in a python set, then saved into a python list using a for loop.
But set is unordered. In my experiment, vocab.txt is different in two runs. So I changed the code to

slots = set([slot['slot'] for row in train_data for slot in row.get('labels', [])])
slots = sorted(list(slots))
vocab = ["O"] + [prefix + slot for slot in slots for prefix in ["B-", "I-"]]
json.dump(vocab, open(dataset + "vocab.txt", "w+"))

and get the same vocab.txt for every run.

download_data.sh fails due to depublication of polyai-models

PolyAI has taken down the contents of the https://github.com/PolyAI-LDN/polyai-models repository. Due to this, download_data.sh fails to retrieve the intent data.

Cannot download pre-trained model

When running

python run.py \
        --train_data_path data_utils/dialoglue/hwu/train.csv \
        --val_data_path data_utils/dialoglue/hwu/val.csv \
        --test_data_path data_utils/dialoglue/hwu/test.csv \
        --token_vocab_path bert-base-uncased-vocab.txt \
        --train_batch_size 64 --dropout 0.1 --num_epochs 0 --learning_rate 6e-5 \
        --model_name_or_path convbert-dg --task intent --do_lowercase --max_seq_length 50 --mlm_pre --mlm_during --dump_outputs \

there is an error:

OSError: Model name 'convbert-dg' was not found in model name list. We assumed 'https://s3.amazonaws.com/models.huggingface.co/bert/convbert-dg/config.json' was a path, a model identifier, or url to a configuration file named config.json or a directory containing such a file but couldn't find any such file at this path or url.

How to download the checkpoints/pre-trained models stated in the README?

HWU64 odd number of samples

The HWU64 dataset contains 25k samples according to the original paper. The DialoGLUE paper stats the same number of samples.

However, the Readme states 11k samples.

If I count the number of samples which are actually in the HWU64 part of DialoGLUE then I get 12,112 samples (12k).

My questions:

Is there a reason for the difference in numbers in the original HWU64 and in the DialoGLUE HWU64? Or is it a bug?
Did you compute the performance of the intent prediction models on 25k, 12k or 11k samples?

Thank you for your answers :)

Will the imbalance of the label affect the final result?

For example, there are two categories A and B, a total of 100 samples, 99 of which are category A and one is category B. the output of the final result is likely to be category A, even if the input is more similar to category B.

What's the impact of running MLM during training?

Hi, thank you very much for sharing the code! From the code of run.py, in function train(), during each training epoch, if args.mlm_during is true, "Run MLM during training" part will run. But it doesn't change "model". Is it because this sentence "model.bert_model = pre_model.bert_model.bert", that "model" and "pre_model" share the same weights since they use the same physical address? Thank you very much!

How to run TripPy DST using ConvBERT?

As mentioned in the README:

To train/evaluate the model using our modifications (i.e., MLM pre-training), you can use trippy/DO.example.advanced.

But trippy/DO.example.advanced use BERT instead of ConvBERT. I wonder how to reproduce the result of ConvBERT and ConvBERT-DG. Thanks!

Error when downloading data

Error when running bash download_data.sh in data_utils dir

mkdir: dialoglue: File exists
Do you wish to download dataset hwu?
1) Yes
2) No
#? 1
Downloading dataset hwu into ../dialoglue/hwu
Getting train data
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [01:10<00:00,  1.10s/it]
Getting test data
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [01:08<00:00,  1.07s/it]
Creating categories.json file
Dataset has been downloaded
Creating train_10.csv, etc...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:00<00:00, 4190.11it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:00<00:00, 5320.93it/s]
Done!
Do you wish to download dataset clinc?
1) Yes
2) No
#? 1
Downloading dataset clinc into ../dialoglue/clinc
Dataset has been downloaded
Creating train_10.csv, etc...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 150/150 [00:00<00:00, 4146.97it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 150/150 [00:00<00:00, 4657.23it/s]
Done!
Do you wish to download dataset banking?
1) Yes
2) No
#? 1
Downloading dataset banking into ../dialoglue/banking
Getting file: train.csv
Getting file: test.csv
Getting file: categories.json
Dataset has been downloaded
Creating train_10.csv, etc...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 77/77 [00:00<00:00, 4635.72it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 77/77 [00:00<00:00, 3855.20it/s]
Done!
/Users/zhuqi/Documents/share/research/Platform/dialoglue/data_utils
Processing dialoglue/hwu/
Processing dialoglue/banking/
Processing dialoglue/clinc/
Done downloading intent datasets
Cloning into 'task-specific-datasets'...
remote: Enumerating objects: 103, done.
remote: Counting objects: 100% (103/103), done.
remote: Compressing objects: 100% (58/58), done.
remote: Total 103 (delta 58), reused 77 (delta 45), pack-reused 0
Receiving objects: 100% (103/103), 1001.92 KiB | 339.00 KiB/s, done.
Resolving deltas: 100% (58/58), done.
Traceback (most recent call last):
  File "process_slot.py", line 14, in <module>
    train_data = json.load(open(dataset + "train_0.json"))
FileNotFoundError: [Errno 2] No such file or directory: 'dialoglue/restaurant8k/train_0.json'
Traceback (most recent call last):
  File "process_slot.py", line 30, in <module>
    sub_train_data = json.load(open(dataset + sub + "/train_0.json"))
FileNotFoundError: [Errno 2] No such file or directory: 'dialoglue/dstc8_sgd/Buses_1/train_0.json'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:01:15 --:--:--     0curl: (7) Failed to connect to fb.me port 80: Operation timed out
unzip:  cannot find or open sem.zip, sem.zip.zip or sem.zip.ZIP.
Traceback (most recent call last):
  File "process_top.py", line 97, in <module>
    data = read_data(data_file)
  File "process_top.py", line 16, in read_data
    f = open(data_file)
FileNotFoundError: [Errno 2] No such file or directory: 'top-dataset-semantic-parsing/train.tsv'
cp: top-dataset-semantic-parsing/train.txt: No such file or directory
cp: top-dataset-semantic-parsing/train_10.txt: No such file or directory
cp: top-dataset-semantic-parsing/eval.txt: No such file or directory
cp: top-dataset-semantic-parsing/test.txt: No such file or directory
cp: top-dataset-semantic-parsing/vocab.*: No such file or directory
rm: sem.zip: No such file or directory
Cloning into 'trippy-public'...
remote: Enumerating objects: 77, done.
remote: Counting objects: 100% (77/77), done.
remote: Compressing objects: 100% (55/55), done.
remote: Total 77 (delta 21), reused 60 (delta 12), pack-reused 0
Unpacking objects: 100% (77/77), done.
/Users/zhuqi/Documents/share/research/Platform/dialoglue/data_utils
mv: rename dialoglue/multiwoz/MULTIWOZ2.1 to dialoglue/multiwoz/MULTIWOZ2.1/MULTIWOZ2.1: Invalid argument
Traceback (most recent call last):
  File "merge_data.py", line 61, in <module>
    train += load_top("dialoglue/top/")
  File "merge_data.py", line 21, in load_top
    data = open(fn+"train.txt").readlines()
FileNotFoundError: [Errno 2] No such file or directory: 'dialoglue/top/train.txt'

DST

Hi,

Thanks for the nice work. As the proposed model is based on TripPy when evaluating dialogue state tracking, I wonder how the MultiWOZ is preprocessed. I found that TripPy may change some ground-truth labels using the original preprocessing script (e.g., the time-slot value "10:30" may be changed to "10" only.), so if convenient, could you please do a simple comparison between the ground-truth labels before and after the preprocessing? Since you are launching a leaderboard, I hope that the evaluation could be as precise as possible. Thanks!

About Observers in the paper

Hi Mehri,
Thanks for your awesome work 'Example-Driven Intent Prediction with Observers', and your open sourcing codebase.
How did you add observers to bert model in your codebase? I can't find what is related to [OBS]. Did you use the [PAD] as the [OBS]? And how did you make Observers the tokens that are not attended to?
Looking forward to your reply.

ONNX conversion issue

@mihail-amazon

Following this thread, I tried to convert convbert-dg model to ONNX format with the following snippet,

model = IntentBertModel('bert-base-uncased', dropout=0.1, num_intent_labels=len(intent_label_to_idx))
model.load_state_dict(torch.load(os.path.join(model_path, "model.pt"), map_location=torch.device('cpu')))
model.eval()

model_onnx_path = "model.onnx"
max_seq_length = 100
input_ids = torch.LongTensor(1, max_seq_length).to(device)
token_type_ids = torch.LongTensor(1, max_seq_length).to(device)
attention_mask = torch.LongTensor(1, max_seq_length).to(device)

dummy_input = (input_ids, attention_mask, token_type_ids)
input_names = ["input_ids", "attention_mask", "token_type_ids"]
output_names = ["output"]
torch.onnx.export(model, dummy_input, model_onnx_path, \
    input_names=input_names, output_names=output_names, \
    verbose=True)

and this throws,

RuntimeError: index out of range: Tried to access index -1457520640 out of table with 30521 rows.

Can you please correct me if there is anything wrong in the snippet I am using.

Error while using tokenizers

While running the following command:
pip3 install -r requirements.txt

python run.py
--train_data_path data_utils/dialoglue/hwu/train.csv
--val_data_path data_utils/dialoglue/hwu/val.csv
--test_data_path data_utils/dialoglue/hwu/test.csv
--token_vocab_path bert-base-uncased-vocab.txt
--train_batch_size 64 --dropout 0.1 --num_epochs 100 --learning_rate 6e-5
--model_name_or_path convbert-dg --task intent --do_lowercase --max_seq_length 50 --mlm_pre --mlm_during --dump_outputs \

I get the following error:

Namespace(adam_epsilon=1e-08, device=0, do_lowercase=True, dropout=0.1, dump_outputs=True, grad_accum=2, learning_rate=6e-05, logging_steps=100, max_grad_norm=-1.0, max_seq_length=100, mlm_data_path='', mlm_during=True, mlm_pre=True, model_name_or_path='convbert-dg', num_epochs=100, output_dir='', patience=5, repeat=1, seed=42, task='intent', test_data_path='data_utils/dialoglue/banking/test.csv', token_vocab_path='bert-base-uncased-vocab.txt', train_batch_size=32, train_data_path='data_utils/dialoglue/banking/train.csv', val_data_path='data_utils/dialoglue/banking/val.csv', weight_decay=0.0)
Errors: Os { code: 2, kind: NotFound, message: "No such file or directory" }
Traceback (most recent call last):
File "run.py", line 566, in
scores.append(train(args, i))
File "run.py", line 305, in train
lowercase=args.do_lowercase)
File "/home/kapilpathak/py37-venv/lib/python3.7/site-packages/tokenizers/implementations/bert_wordpiece.py", line 30, in init
tokenizer = Tokenizer(WordPiece(vocab_file, unk_token=str(unk_token)))
Exception: Error while initializing WordPiece

The error is consistent even if I change tasks and datasets
Any suggestions?

Reproducing few-shot experiments on Multiwoz2.1

Hi,

I am working on few-shot experiments on MultiWOZ2.1. However, I faced the same problem as in #7 .

BERT + pre + multi trained on few-shot dataset achieved ~0.49 JGA on the test set (with random seed 42).

I modified a small part of your codes, and the diff is listed here (GitHub comparing changes). I ran the experiment directly with DO.example.advanced.

Environment

GPU: RTX 3090
PyTorch: 1.7.0+cu110

I wonder if my training / evaluation process were wrong and got the high performance even in the few-shot setting.

Thanks for your reply in advance!

Some questions about the trippy dst

I have several questions:

the original performance of trippy is 55.3% on multiwoz 2.1(in paper). Your bert-base DST achieves 56.3%. So where does the improvement comes from? I notice that the original trippy repo mentions:

With a sequence length of 180, you should expect the following average JGA: 56% for MultiWOZ 2.1`
Best performance can be achieved by using the maximum sequence length of 512.

Do you change the code or just the trippy repo has better performance than trippy paper?
How to reproduce the dst experiments? I guess:

for BERT(56.3%): run DO.example.advanced without any modification (although the max seq length is set to 180)
for CONVBERT-DG(58.57%): run DO.example.advanced with --model_name_or_path="convbert-dg"

Look forward to your reply :)

	slots = set([slot['slot'] for row in train_data for slot in row.get('labels', [])])
	vocab = ["O"] + [prefix + slot for slot in slots for prefix in ["B-", "I-"]]
	json.dump(vocab, open(dataset + "vocab.txt", "w+"))

alexa / dialoglue Goto Github PK

dialoglue's Issues

Recommend Projects

Recommend Topics

Recommend Org