Giter VIP home page Giter VIP logo

question-generation's Introduction

Question Generation

This project was originally intended for an AI course at Sofia University. During it's execution, I was constraint on time and couldn't implement all the ideas I had, but I plan to continue working on it... and I did pick up the topic for my Master's thesis, using T5 Transformers to generate question-answer pairs along with distractors. Check it out in the Question-Generation-Transformers repository.

The approach for identifyng keywords used as target answers has been accepted in the RANLP2021 conference - Generating Answer Candidates for Quizzes and Answer-Aware Question Generators.

General idea

The idea is to generate multiple choice answers from text, by splitting this complex problem to simpler steps:

  • Identify keywords from the text and use them as answers to the questions.
  • Replace the answer from the sentence with blank space and use it as the base for the question.
  • Transform the sentence with a blank space for answer to a more question-like sentence.
  • Generate distractors, words that are similar to the answer, as incorrect answers.

Question generation step by step gif

Installation

Creating a virtual environment (optional)

To avoid any conflicts with python packages from other projects, it is a good practice to create a virtual environment in which the packages will be installed. If you do not want to this you can skip the next commands and directly install the the requirements.txt file.

Create a virtual environment :

python -m venv venv

Enter the virtual environment:

Windows:

. .\venv\Scripts\activate

Linux or MacOS

source .\venv\Scripts\activate

Install ipython inside the venv:

ipython kernel install --user --name=.venv

Install jupyter lab inside the venv:

pip install jupyterlab

Installing packages

pip install -r .\requirements.txt 

Run jupyter

jupyter lab

Execution

Data Exploration

Before I could to anything, I wanted to understand more about how questions are made and what kind of words are it's answers.

I used the SQuAD 1.0 dataset which has about 100 000 questions generated from Wikipedia articles.

You can read about the insights I've found in the Data Exploration jupyter notebook.

Identifying answers

My assumption was that words from the text would be great answers for questions. All I needed to do was to decide which words, or short phrases, are good enough to become answers.

I decided to do a binary classification on each word from the text. spaCy really helped me with the word tagging.

Feature engineering

I pretty much needed to create the entire dataset for the binary classification. I extracted each non-stop word from the paragraphs of each question in the SQuAD dataset and added some features on it like:

  • Part of speech
  • Is it a Named entity
  • Are only alpha characters used
  • Shape - whether it's only alpha characters, digits, has punctuation (xxxx, dddd, Xxx X. Xxxx)
  • Word count

And the label isAnswer - whether the word extracted from the paragraph is the same and in the same place as the answer of the SQuAD question.

Some other features like TF-IDF score and cosine similarity to the title would be the great, but I didn't have the time to add them.

Other than those, it's up to our imagination to create new features - maybe whether it's in the start, middle or end of a sentence, information about the words surrounding it and more... Though before adding more feature it would be nice to have a metric to assess whether the feature is going to be useful or not.

Model training

I found the problem similar to spam filtering, where a common approach is to tag each word of an email as coming from a spam or not a spam email.

I used scikit-learn's Gaussian Naive Bayes algorithm to classify each word whether it's an answer.

The results were surprisingly good - at a quick glance, the algorithm classified most of the words as answers. The ones it didn't were in fact unfit.

The cool thing about Naive Bayes is that you get the probability for each word. In the demo I've used that to order the words from the most likely answer to the least likely.

Creating questions

Another assumption I had was that the sentence of an answer could easily be turned to a question. Just by placing a blank space in the position of the answer in the text I get a "cloze" question (sentence with a blank space for the missing word)

Answer: Oxygen

Question: _____ is a chemical element with symbol O and atomic number 8.

I decided it wasn't worth it to transform the cloze question to a more question-looking sentence, but I imagine it could be done with a seq2seq neural network, similarly to the way text is translated from one language to another.

Generating incorrect answers

The part turned out really well.

For each answer I generate it's most similar words using word embeddings and cosine similarity.

Most similar words to oxygen

Most of the words are just fine and could easily be mistaken for the correct answer. But there are some which are obviously not appropriate.

Since I didn't have a dataset with incorrect answers I fell back on a more classical approach.

I removed the words that weren't the same part of speech or the same named entity as the answer, and added some more context from the question.

I would like to find a dataset with multiple choice answers and see if I can create a ML model for generating better incorrect answers.

Results

After adding a Demo project, the generated questions aren't really fit to go into a classroom instantly, but they are't bad either.

The cool thing is the simplicity and modularity of the approach, where you could find where it's doing bad (say it's classifying verbs) and plug a fix into it.

Having a complex Neural Network (like all the papers on the topics do) will probably do better, especially in the age we're living. But the great thing I found out about this approach, is that it's like a gateway for a software engineer, with his software engineering mindset, to get into the field of AI and see meaningful results.

Future work (updated)

I find this topic quite interesting and with a lot of potential. I would probably continue working in this field.

I even enrolled in a Masters of Data Mining and will probably do some similar projects. I will link anything useful here.

I've already put some more time on finishing the project, but I would like to transform it more to a tutorial about getting into the field of AI while having the ability to easily extend it with new custom features.

Updates

Update - 29.12.19: The repository has become pretty popular, so I added a new notebook (Demo.ipynb) that combines all the modules and generates questions for any text. I reordered the other notebooks and documented the code (a bit better).

Update - 09.03.21: Added a requirements.txt file with instructions to run a virtual environment and fixed the bug a with ValueError: operands could not be broadcast together with shapes (230, 121) (83, )

I have also started working on my Master's thesis with a similar topic of Question Generation.

Update - 27.10.21: I have uploaded the code for my Master's thesis in the Question-Generation-Transformers repository. I highly encourage you to check it out.

Additionally the approach using a classfier to pick the answer candidates has been accepted as a students paper in the RANLP2021 conference. Paper here.

question-generation's People

Contributors

kristiyanvachev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

question-generation's Issues

ValueError: operands could not be broadcast together with shapes (230, 121) (83, )

So first of all I cloned your repo to try out the code. The Demo.ipynb worked fine as expected (without any change in input text). However after giving my own text as input, I am getting error in the naive bias file of sklearn.
The error:

image

I tried a lot to debug the issue, however since I am new to ML and Data science in general, I was not able to get the issue fixed. Also the error is in the sklearn's files, so not sure what's causing the issue. I even tried downgrading my python and all the modules so as to match yours and rebuilt the model, but still it gave the same error message in Predict.ipynb. I saw a lot of people were facing the same issue in the 'issue' section of your repo

It would be greatly helpful if you could point me in the right direction so that I can resolve this issue.

Database

Hello sir, can you add database file or sql file which help to store created mcq in database

ValueError: operands could not be broadcast together with shapes (2448,122) (121,)

Hi,
I clone your project and made some changes like reading form pdf and as well txt file dynamically rote an API but hen i was trying to read the pdf file and process the file getting an error like ValueError: operands could not be broadcast together with shapes (2448,122) (121,)
and i have seen this same issue in resolved issue section i tried of that requirements provided their i had run the same requirements but i was unable to resolve that issue..

Thanks and Regards,
Manikantha Sekhar...

Regarding Predicting Accuracy [ y_test vs y_pred ]

On training:

y_test.value_counts() is:
False 828
True 40

i.e. 40 of 868 are marked 'True' in test set.

y_pred_series.value_count() is:
True 657
False 211

that shows 657 values have been marked True which reflects a lot of True values, which actually reflects a lot of false positive values (619) i.e. 657 = 38 (Actual Tru, i.e TP ) + 619 (Predicted True but actually are False i.e FPs)

In fact the confusion_matrix(y_test, y_pred) shows the same:
array([[209, 619],
[ 2, 38]])

It's quite clear that the correct answers predicted is very high i.e. 38/40.

In light of above results:
a. I am unable to understand you following conclusion as the overall accuracy considering True Negatives(209) is not good.

"Seems like I'm super biased for towards correct answers." But as I found during the Data exploration, there are a lot more answer-worthy words that are just not labeled since, I guess the Mechanical Turks had the job to label just 5. So, who knows, maybe I did some black magic and managed to extract all the answer worthy words!

b. Secondly, can you elaborate what was your aim in regards to predicting 'isAnswer'? Is it related to finding only relevant answers as that of the original dataset i.e. (40)? Or you intended to extract new words which could serve as answers for more questions and if that's correct, which part of confusion matrix results you consider to be potential relevant answers beyond what is provided in data set?

Linux Problem

I am wondering why the code is not generating any "incorrect answers" when the code run in Linux.
I tried the same code using Windows and using Linux; Windows works fine but Linux none? any idea?

Proper Documentation needed.

A proper documentation is needed to execute this project. What are the requirements? How should one execute ?

ValueError: operands could not be broadcast together with shapes (60,121) (77,)

getting a ValueError inside the predictWords function at line 6 ,

y_pred = predictor.predict_proba(wordsDf)

and inside the naive_bayes.py

<ipython-input-6-55c8aeacc0c5> in predictWords(wordsDf, df)
      4     predictor = loadPickle(predictorPickleName)
      5 
----> 6     y_pred = predictor.predict_proba(wordsDf)
      7 
      8     labeledAnswers = []

~/env/lib/python3.6/site-packages/sklearn/naive_bayes.py in predict_log_proba(self, X)
     95         check_is_fitted(self)
     96         X = self._check_X(X)
---> 97         jll = self._joint_log_likelihood(X)
     98         # normalize by P(x) = P(f_1, ..., f_n)
     99         log_prob_x = logsumexp(jll, axis=1)

~/env/lib/python3.6/site-packages/sklearn/naive_bayes.py in _joint_log_likelihood(self, X)
    449             jointi = np.log(self.class_prior_[i])
    450             n_ij = - 0.5 * np.sum(np.log(2. * np.pi * self.sigma_[i, :]))
--> 451             n_ij -= 0.5 * np.sum(((X - self.theta_[i, :]) ** 2) /
    452                                  (self.sigma_[i, :]), 1)
    453             joint_log_likelihood.append(jointi + n_ij)

Requirements used :
attrs==19.3.0
backcall==0.1.0
bleach==3.1.1
blis==0.4.1
boto==2.49.0
boto3==1.12.5
botocore==1.15.5
catalogue==1.0.0
certifi==2019.11.28
chardet==3.0.4
cycler==0.10.0
cymem==2.0.3
decorator==4.4.1
defusedxml==0.6.0
docutils==0.15.2
en-core-web-md==2.2.5
en-core-web-sm==2.2.5
entrypoints==0.3
gensim==3.8.1
idna==2.9
importlib-metadata==1.5.0
ipykernel==5.1.4
ipython==7.12.0
ipython-genutils==0.2.0
ipywidgets==7.5.1
jedi==0.16.0
Jinja2==2.11.1
jmespath==0.9.4
joblib==0.14.1
jsonschema==3.2.0
jupyter==1.0.0
jupyter-client==5.3.4
jupyter-console==6.1.0
jupyter-core==4.6.3
kiwisolver==1.1.0
MarkupSafe==1.1.1
matplotlib==3.1.3
mistune==0.8.4
murmurhash==1.0.2
nbconvert==5.6.1
nbformat==5.0.4
nltk==3.4.5
notebook==6.0.3
numpy==1.18.1
pandas==1.0.1
pandocfilters==1.4.2
parso==0.6.1
pexpect==4.8.0
pickleshare==0.7.5
pkg-resources==0.0.0
plac==1.1.3
preshed==3.0.2
prometheus-client==0.7.1
prompt-toolkit==3.0.3
ptyprocess==0.6.0
Pygments==2.5.2
pyparsing==2.4.6
pyrsistent==0.15.7
python-dateutil==2.8.1
pytz==2019.3
pyzmq==18.1.1
qtconsole==4.6.0
requests==2.23.0
s3transfer==0.3.3
scikit-learn==0.22.1
scipy==1.4.1
Send2Trash==1.5.0
six==1.14.0
sklearn==0.0
smart-open==1.9.0
spacy==2.2.3
srsly==1.0.1
terminado==0.8.3
testpath==0.4.4
thinc==7.3.1
tornado==6.0.3
tqdm==4.43.0
traitlets==4.3.3
urllib3==1.25.8
wasabi==0.6.0
wcwidth==0.1.8
webencodings==0.5.1
widgetsnbextension==3.5.1
zipp==3.0.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.