bangliu / acs-qg Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Approximate wait time for the code
?
Apologizes if this request is not suitable for an issue.
I'm trying to train/reproduce this model (or rather collection of models), but I'm having a hard time gathering all the necessary datasets. I especially cannot find any dataset fitting "SQuAD1.1-Zhou". I was only able to find this version of SQuaAD1.1 - but this doesn't seem to fit the expected input format, as the preprocessor in FQG_data.py
expects tab-separated lined of at least 10 fields, whereas this dataset only has 4.
Could anyone give me a pointer to the right dataset?
EDIT: Now also concerning the SQuAD2.0 data set, see my comment below.
I have issue with this code, there are several condition, that explain from several statements here
because it's using benepar, and download the benepar_en2
or benepar_en2.gz
, we must make file to setup first like
import nltk, benepar
nltk.download('punkt')
benepar.download('benepar_en2')
after that, when I want to run python config.py
. there's error says we must change several statements in benepar library such as /usr/local/lib/<python_version>/dist-packages/benepar/base_parser.py
, changed it into like this syntax: (i prefer using nano editor to change the root directory or nltk directories)
graph = tf.Graph()
graph_def = tf.compat.v1.GraphDef()
but there's an error appear again, we must using tf2 disable behaviour
like this syntax:
import tensorflow.compat.v1 as tf
tf..disable_v2_behavior()
it's work, as my suggestion..
but we should thinking twice, when there's an update from benepar library next future issue..
because we're using glove pretrained models, I suggest when we're using load_word2vec_format
.
we must convert first from glove to word2vec format, using this command line
from gensim.scripts.glove2word2vec import glove2word2vec
...
GLOVE_TXT_PATH = DATA_PATH + "original/Glove/glove.6B.300d.txt"
GLOVE_OUT_PATH = DATA_PATH + "original/Glove/word2vec.txt" # whatever format do you want
...
glove2word2vec(GLOVE_TXT_PATH, GLOVE_OUT_PATH)
GLOVE = gensim.models.KeyedVectors.load_word2vec_format(GLOVE_OUT_PATH, binary=True, encoding= 'utf-8', unicode_errors= 'ignore')
# or
GLOVE = gensim.models.KeyedVectors.load_word2vec_format(GLOVE_OUT_PATH, binary=False)
and don't forget to make folder as mentioned in common/constants.py such as
1. FQG/src/model/Fsctorized
2. FQG/output
3. FQG/output/checkpoint in
4. FQG/output/figure
5. FQG/output/log
6. FQG/output/pkl
7. FQG/output/result
if we have CUDA device, I suggest you to use cuda channles that available like
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
or
DEVICE = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
but there are several common problems that have been solved such as..
model/encoder.py
- in line 50model/beam_searcher.py
- in line 138because spacy doesn't support for "en" format, so we must change it into "en_core_web_sm"
don't forget to use this command line for download that module as reference here
python -m spacy download en_core_web_sm
When it doesn't work after install allennlp==0.8.3
from allennlp.modules.elmo import batch_to_ids
and appears errors like this one
TypeError: Highway.forward: return type
<class 'torch.Tensor'>is not a
<class 'NoneType'>.
So how to solve it, we should changed the version as this reference
pip uninstall overrides
pip install overrides==3.1.0
it's works fine to me..
there's an errors about several steps from this references that from file in /util/nlp_utils.py in line 225 because we have mentioned that GLOVE as our model with keyedvectors from glove to word2vec as mentioned in Issue 1
if token in GLOVE.vocab:
token_in_glove = token
elif token.lower() in GLOVE.vocab:
token_in_glove = token.lower()
to be
if token in GLOVE.key_to_index:
token_in_glove = token
elif token.lower() in GLOVE.key_to_index:
token_in_glove = token.lower()
I've got so many issue when I'm using the run_glue.py
For my best solution, we don't take risk about local_rank,
because of that, I can't use the CUDA for only one channel..
So I have been inactivate all the syntax that cause of problems..
and also about config, tokenizer, and model from pretrained data.
I recommed you to download the pretrained data from internet, because
When we are not able to download the model from offline, This is the best solution so far..
we should consider to use proper syntax like this:
MODEL_CLASSES = {
'bert': (BertConfig, BertForSequenceClassification, BertTokenizer, BertModel, 'bert-base-uncased'),
'xlnet': (XLNetConfig, XLNetForSequenceClassification, XLNetTokenizer, XLNetModel, 'xlnet-base-cased'),
'xlm': (XLMConfig, XLMForSequenceClassification, XLMTokenizer, XLMModel, 'xlm-mlm-enfr-1024'),
}
....
config_class, model_sequence_class, tokenizer_class, model_class, model_name = MODEL_CLASSES[args.model_type]
...
config = config_class.from_pretrained(model_name)
tokenizer = tokenizer_class.from_pretrained(model_name)
model = model_sequence_class(config)
or
model = model_class.from_pretrained(model_name)
but there's several error during this training this..
so we must using code like..
from torch import nn
from pytorch_transformers import XLNetConfig, XLNetModel
config = XLNetConfig.from_pretrained("xlnet-base-cased")
if config.n_token <= 0 or config.d_model <= 0:
# Modify config.n_token and config.d_model to be positive integers
config.n_token = 32000
config.d_model = 768
model = XLNetModel(config)
After I've done with this syntax, no more conflict after that..
because if we are looking from offline pretrained, sometimes it have several errors..
Because I can't use several metada that I can't find it..
please give me proper metadata link, that I can download for now..
here's several link.. that I can use so far,
When I debug the run_glue.py, there are several errors appears like:
inputs = {'input_ids': batch[0],
'attention_mask': batch[1],
'token_type_ids': batch[2] if args.model_type in ['bert', 'xlnet'] else None} # XLM don't use segment_ids
outputs = model(**inputs)
loss = outputs[0] # model outputs are always tuple in pytorch-transformers (see doc)
The error message is indicating that the forward method of your model is being passed an unexpected keyword argument labels. This argument is likely being passed from the inputs dictionary, but it is not being handled correctly in the forward method.
To solve this, you need to make sure that the forward method has the necessary arguments to handle all the values in the inputs dictionary. If labels is not a necessary argument, you should remove it from the inputs dictionary before passing it to the forward method. If labels is a necessary argument, you need to add it to the method signature of the forward method.
This error occurs because the argument labels is not expected by the forward method of the model. The error message states "forward() got an unexpected keyword argument 'labels'".
You should remove the labels argument from the inputs dictionary and the corresponding calculation of the loss function.
and also change variable outputs
if we have errors like
raise RuntimeError("grad can be implicitly created only for scalar outputs")
so we change syntax like this:
loss = outputs[0].mean()
The error message "grad can be implicitly created only for scalar outputs" means that the loss should be a scalar tensor, but it seems to be a tensor of shape (batch_size,), which is not a scalar tensor. You'll need to reduce the tensor to a scalar by taking the mean over the batch_size.
After that, if we meet the error about this
AttributeError: 'tuple' object has no attribute 'detach'
so we must to change the variable into like this:
outputs = model(**inputs)
logits = outputs[0]
preds = logits.detach().cpu().numpy()
The error message suggests that the output from the model is a tuple, but the .detach() method is only supported for tensors, not tuples. To fix this, you need to extract the tensor component from the tuple that you want to use for computing the preds.
Note that the specific index used for outputs[0] may vary depending on the structure of the model output, so you may need to modify this based on your specific model.
I would expect the tests to pass.
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
Please let me know if there is any more information I can provide.
Can you please share requirements.txt
content? Or dump and share here pip freeze
output?
I can't get the requirements.txt to install some Python packages that require some version of Python, while others require another version of Python.
I hit the roadblock while installing requirements when I could not find the de-core-news-sm
package in PyPI website.
Can you please add some more information on which Python version is required? which OS tool has been tested on? etc.
When will the code be provided?
I'm trying to reproduce the results of the associated paper, but I have trouble making sense how the code fits the text. In the paper, the pipeline seems fairly straightforward (fig.2, p.3). A QA data set is augmented to obtain ACS-aware datasets. With these, a QG model is trained and, in a third step, its result are refined via filtering.
In the code, various "experiments" exist, some resembling certain steps of the pipeline in the paper. However, what I don't really understand: One of the first experiments, experiments_1_QG_train_seq2seq.sh
, trains a QG model, via QG_main.py
. This, however, without the augmented data. Is this just for comparison (e.g. as a baseline)?
A later experiment, experiments_3_repeat_da_de.sh
, seems closer to the pipeline in the paper. Augmented data is created and then used for another QG model, this one in QG_augment_main.py
. However, this model actually doesn't seem to be trained at all. In the code, it only gets tested (see here). I don't really understand then where the training step of a model with augmented data actually takes place? Or am I just missing something?
Apologies if this is not really fitting for an issue. And great work on the paper!
Change:
father_beam_idx = best_output_accumulate_scores_id / vocab_plus_input_size
To:
father_beam_idx = best_output_accumulate_scores_id // vocab_plus_input_size
To ensure the father_beam_idx tensor contains int and thus proper index values.
Else the tensor object lead to float values and lead to error with the index_select() method in Torch used in the same function,
where can I get the model of en.wiki.bpe.op50000.model
Could you please share generated <q, a, c, s, p> results?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.