Comments (13)
I think QG_main.py
does use augmented data. Here is how I see it:
QG_main.py
takes <p, q, a>
tuples from the SQuAD dataset (SQuAD1.1-Zhou) and transforms them into <p, q, a, c, s>
tuples in a pre-processing step (in FQG_data.py:prepro()
and FQG_data.py:get_loader()
).
The clue info is generated in FQG_data_augmentor.py:get_clue_info()
.
The style info is generated in FQG_data_utils:get_question_type()
.
This data is then used to train the model.
This seems to correspond directly to the figure you mentioned.
I'm currently working on a more self-contained version of the code of that helps: https://github.com/flackbash/ACS-QG
from acs-qg.
Without having examined this thoroughly, it seems to me the probability distribution is learned the first time DA_main.py
is executed. That is, when running experiments_2_DA_file2sents.sh
here. experiments_3_DA_sents2augsents.sh
then seems to use this probability distribution.
Regarding experiments_4_QG_generate_seq2seq.sh
: Yes, I think it uses the model trained in experiment_1_QG_train_seq2seq.sh
.
Regarding experiments_3_repeat_da_de.sh
: I also find this quite confusing, but I agree with you, that this seems to be the pipeline that relates closer to the one described in the paper. More specifically, I think the proper pipeline would be:
experiments_1_ET_train.sh
experiments_1_QG_train_seq2seq.sh
experiments_2-DA_file2sents.sh
experiments_3_DA_sents2augsents.sh
experiments_4_QG_generate_seq2seq.sh
experiments_5_uniq_seq2seq.sh
- (maybe
experiments_6_postprocess_seq2seq.sh
) - and then from the
experiments_3_repeat_da_de.sh
filerun_glue.py
(I'm not sure what this does, but I assume this is part of the filtering process mentioned in the paper? I can't find any reference to a BERT-based QA-model either. Maybe they switched to XLNet instead in the code, see here?) andDE_main.py
for the index ranges given e.g. inexperiments_3_DA_sents2augsents.sh
.
from acs-qg.
You seem to be spot-on with the sampling probabilities, thanks!
Concerning the pipeline:
I agree, that's what I did for my recent tests as well, leaving out experiments_6_postprocess_seq2seq.sh
and including run_glue.py
and DE_main.py
(which also adds values for perplexity and readability, although the metrics used for the latter seem a bit questionable).
I think run_glue.py
is actually the entailment model (as it is based on a GLUE benchmark task, MRPC, and they seem to have copy-pasted most of its code with some modifications). It gets trained in experiments_1_ET_train.sh
(again via run_glue.py
) and then applied to the generated questions (as seen in experiments_3_repeat_da_de.sh
). As you mentioned, they switched this model from BERT to XLNet, but that still leaves out the separate QA-model for filtering - unless this is somehow also contained in run_glue.py
? I'll have to take a closer look at that. I ultimately want to try training this with German text, but before I start looking for appropriate datasets I definitely have to try understanding it a bit better...
from acs-qg.
I see... I haven't even had a look at run_squad.py
until now as it was not included in any experiment.
I have contacted Bang Liu in the meantime and he said that both the BERT-QA and BERT-based filter modules were implemented using the Huggingface Transformer library but that this part was implemented by the second author of the paper, Haojie Wei, and is therefore not included in this repo. Unfortunately, Haojie Wei has not replied to my email until now. However, Bang also said it should be easy to implement using the Huggingface Transformer library.
This seems to be in line with your experiments and also matches the fact that run_squad.py
seems to be mostly just copied from the transformers library. So parts of it are there, but it's not really integrated in the experiments in this repo.
from acs-qg.
Good point you rise about sampling. Some answer phrases were not sampled in my case as well. Perhaps its because of the randomness in sampling of <a, c, s>
. I ran the code again with same input and got different answer phrases in the output this time.
As far as epochs are concerned, I also trained for 10 epochs. In the paper also they have mentioned that they trained Seq2Seq model for 10 epochs. However I got the best model at 8th epoch, so that is being used.
I guess the GPT2 model would give better results. As seen in Table 2 in the paper, only 40% of questions are reported to be well-formed (for seq2seq) and 74.5% (for GPT2). Which does seem to be the case because I am looking at the questions generated by Seq2Seq model and many questions are syntactically as well as semantically incorrect.
Training GPT2 model is taking some time, I'll update whenever I get the results.
from acs-qg.
Update: Questions generated from GPT2 model are certainly way better in terms of syntactic structure compared to seq2seq model. It did generate exact same 2 of the 6 questions given in Fig. 5 of the paper. For remaining questions, I did not see those exact answer phrases and question type sampled out. So maybe if the sampling also goes same as the examples, it will probably generate the same questions.
from acs-qg.
Thanks @oppasource for your update! I managed to train a GTP2-based model now as well, and I can report similar results: its a lot better at generating coherent language and could reproduce some of the questions reported in the paper. Likewise, for the others, the sampler didn't pick the appropriate answers. It might be interesting to query the generation model with pre-made answers to test its capabilities separate from the sampler...
from acs-qg.
Thanks a lot for your answer @flackbash , that definitely helps! Clue and style indeed seem to be generated right from the start (aka in experiments_1_QG_*
). I think what threw me off was the difference for training/inference data, where only for the latter, good <p, a, c, s>-candidates need to be sampled (whereas during training, they are extracted from the dataset via the two algorithms mentioned in the paper).
However, I'm then also not sure where the conditional probabilistic distributions for data sampling are actually learned? I think I'm just still confused about what actually happens in the different experiment steps (ignoring the GPT2 variant for now):
-
experiments_1_ET_train.sh
Trains the entailment model. -
experiments_1_QG_train_seq2seq.sh
Takes SQuAD 1.1, extracts clue and style, and with that trains the QG-model. -
experiments_2_DA_file2sents.sh
We are "simulating" data augmentation on SQuAD2.0 and Wiki1000 (instead of completely raw data). First step, get singular sentences. -
experiments_3_DA_sents2augsents.sh
Second step, sample augmented data from previously extracted sentences, using probabilities obtained inexperiments_1_QG_train_seq2seq.sh
(?) -
experiments_4_QG_generate_seq2seq.sh
We now generate questions from the resulting augmented data. For some reason, this uses a different model file (QG_augment_main.py
), but will use the trained model fromexperiment_1_QG_train...
(?) -
experiments_5_uniq_seq2seq.sh
Throw out non-unique results. -
experiments_6_postprocess_seq2seq.sh
This just filter out duplicate words from the generated answer? Which seems a bit of a... brute force improvement? ;) -
experiments_3_repeat_da_de.sh
This seems to be an alternative, combined pipeline - a combination ofexperiments_3_DA_sents2augsents.sh
,experiments_4_QG_generate_seq2seq.sh
,experiments_5_uniq_seq2seq.sh
(all for a certain data index range), PLUS entailment score calculation and filtering. The latter two only take place in this experiment and seem therefore not to be part of the other "pipeline"?
Additionally, in the paper, the post-generation data filtering section (3.4) mentions a BERT-based "normal" QA model (in addition to the entailment model) to filter generated questions. This, however does not seem to be part of the code? At least QG_postprocess_seq2seq.py
and DE_main.py
(used in experiments_3_repeat_da_de.sh
) don't contain anything like it as far as I can see (the latter does however apply some other filtering not mentioned in the paper, like a readability score)?
In summary, the pipeline in experiments_3_repeat_da_de.sh
seems to represent the more complete path akin to whats described in the paper? I'll definitely also have a look at your repo, too @flackbash , maybe this will also help me understand things a bit more clearly.
from acs-qg.
I just finished some experiments and I would now include experiments_6_postprocess_seq2seq.sh
into the pipeline after all since I get quite a lot of word duplications in the generated questions (might be brute force, but seems like it gets the job done ;) ).
Ah ok, right, the entailment model is also supposed to be BERT-based (but actually XLNet-based in the code). Thanks for the clarifications :)
Even after a closer look I can't find the QA-model anywhere. Maybe they didn't include it in the code? If you find anything please let me know.
from acs-qg.
It's true, it does filter out quite a lot! Still feels a bit like "cheating" to me, though, as I feel that an adequate generative model shouldn't even make those kinds of errors. ;) At least not as many...
I experimented around a bit more and I'm now pretty sure the QA-model mentioned in the paper is contained in run_squad.py
. At least, you can train a QA model on SQuAD with this. However, it seems the inference for the purpose of filtering is not really there (maybe unfinished?), as there seems to be no implemented way to let a trained model generate answer spans for given inputs. I did some (very ugly) hacking around it to test it on sentence-answer-pairs generated via the other experiments, but the results so far are pretty bad. Might also be bugs on my end, though.
from acs-qg.
Thanks @flackbash and @redfarg for the discussion. It has been helpful for me to get to know the inner workings of the code.
I tried to regenerate the questions given in the paper as example in Fig. 5. There were a couple of questions generated that were kind of similar but still the quality was bad. Here is what I did....
- Trained QG model using
experiments_1_QG_train_seq2seq.sh
- Used
experiments_3_DA_sents2augsents.sh
to augment the sentences with<a, c, s>
- Generated questions using
experiments_4_QG_generate_seq2seq.sh
- Removed duplicates and did post-processing using
experiments_5_uniq_seq2seq.sh
andexperiments_6_postprocess_seq2seq.sh
I didn't use experiments_2-DA_file2sents.sh
and manually set the format for the two example sentences.
As far as I understand, the above pipeline is suppose to generate the superset of questions which reduces after filtering. So I examine this superset of questions if it contained the example questions from the figure without doing the filtering.
I am yet to experiment with GPT2 model and will do it soon.
Apart from that, Is there anything that I am doing wrong or missing something? Can you try the two example sentences mentioned in the paper and verify the results?
Thanks.
from acs-qg.
Hi @oppasource, your pipeline seems right to me. You skipped training and applying the entailment model (in experiments_1_ET_train.sh
and parts of experiments_3_repeat_da_de.sh
) but that would only give you an additional feature to filter with, so the main generation part would be the same nonetheless.
I generated questions for the two mentioned sentences as well, and my results also look pretty bad. There are barely any viable questions among the output (and none that come close to the examples in the paper). However, on my side it seems to be the fault of the input sampler, at least partially: for example, it never sampled "The New York Amsterdam News", "the United States", or "Manhattan" as possible answers (which, from a human perspective, seem like very obvious candidates). Did you observe better sampled answers?
Plus, if I may ask: how long did you train your QG model for? I trained mine for 10 epochs, which also just might be too few?
from acs-qg.
Related Issues (7)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from acs-qg.