kristiyanvachev / question-generation Goto Github PK

Generating multiple choice questions from text using Machine Learning.

License: GNU General Public License v3.0

Jupyter Notebook 100.00%

question-generation question-generator questions-and-answers quiz machine-learning spacy spacy-nlp nlp ai word-embeddings

question-generation's Introduction

Question Generation

This project was originally intended for an AI course at Sofia University. During it's execution, I was constraint on time and couldn't implement all the ideas I had, but I plan to continue working on it... and I did pick up the topic for my Master's thesis, using T5 Transformers to generate question-answer pairs along with distractors. Check it out in the Question-Generation-Transformers repository.

The approach for identifyng keywords used as target answers has been accepted in the RANLP2021 conference - Generating Answer Candidates for Quizzes and Answer-Aware Question Generators.

General idea

The idea is to generate multiple choice answers from text, by splitting this complex problem to simpler steps:

Identify keywords from the text and use them as answers to the questions.
Replace the answer from the sentence with blank space and use it as the base for the question.
Transform the sentence with a blank space for answer to a more question-like sentence.
Generate distractors, words that are similar to the answer, as incorrect answers.

Installation

Creating a virtual environment (optional)

To avoid any conflicts with python packages from other projects, it is a good practice to create a virtual environment in which the packages will be installed. If you do not want to this you can skip the next commands and directly install the the requirements.txt file.

Create a virtual environment :

python -m venv venv

Enter the virtual environment:

Windows:

. .\venv\Scripts\activate

Linux or MacOS

source .\venv\Scripts\activate

Install ipython inside the venv:

ipython kernel install --user --name=.venv

Install jupyter lab inside the venv:

pip install jupyterlab

Installing packages

pip install -r .\requirements.txt

Run jupyter

jupyter lab

Execution

Data Exploration

Before I could to anything, I wanted to understand more about how questions are made and what kind of words are it's answers.

I used the SQuAD 1.0 dataset which has about 100 000 questions generated from Wikipedia articles.

You can read about the insights I've found in the Data Exploration jupyter notebook.

Identifying answers

My assumption was that words from the text would be great answers for questions. All I needed to do was to decide which words, or short phrases, are good enough to become answers.

I decided to do a binary classification on each word from the text. spaCy really helped me with the word tagging.

Feature engineering

I pretty much needed to create the entire dataset for the binary classification. I extracted each non-stop word from the paragraphs of each question in the SQuAD dataset and added some features on it like:

Part of speech
Is it a Named entity
Are only alpha characters used
Shape - whether it's only alpha characters, digits, has punctuation (xxxx, dddd, Xxx X. Xxxx)
Word count

And the label isAnswer - whether the word extracted from the paragraph is the same and in the same place as the answer of the SQuAD question.

Some other features like TF-IDF score and cosine similarity to the title would be the great, but I didn't have the time to add them.

Other than those, it's up to our imagination to create new features - maybe whether it's in the start, middle or end of a sentence, information about the words surrounding it and more... Though before adding more feature it would be nice to have a metric to assess whether the feature is going to be useful or not.

Model training

I found the problem similar to spam filtering, where a common approach is to tag each word of an email as coming from a spam or not a spam email.

I used scikit-learn's Gaussian Naive Bayes algorithm to classify each word whether it's an answer.

The results were surprisingly good - at a quick glance, the algorithm classified most of the words as answers. The ones it didn't were in fact unfit.

The cool thing about Naive Bayes is that you get the probability for each word. In the demo I've used that to order the words from the most likely answer to the least likely.

Creating questions

Another assumption I had was that the sentence of an answer could easily be turned to a question. Just by placing a blank space in the position of the answer in the text I get a "cloze" question (sentence with a blank space for the missing word)

Answer: Oxygen

Question: _____ is a chemical element with symbol O and atomic number 8.

I decided it wasn't worth it to transform the cloze question to a more question-looking sentence, but I imagine it could be done with a seq2seq neural network, similarly to the way text is translated from one language to another.

Generating incorrect answers

The part turned out really well.

For each answer I generate it's most similar words using word embeddings and cosine similarity.

Most of the words are just fine and could easily be mistaken for the correct answer. But there are some which are obviously not appropriate.

Since I didn't have a dataset with incorrect answers I fell back on a more classical approach.

I removed the words that weren't the same part of speech or the same named entity as the answer, and added some more context from the question.

I would like to find a dataset with multiple choice answers and see if I can create a ML model for generating better incorrect answers.

Results

After adding a Demo project, the generated questions aren't really fit to go into a classroom instantly, but they are't bad either.

The cool thing is the simplicity and modularity of the approach, where you could find where it's doing bad (say it's classifying verbs) and plug a fix into it.

Having a complex Neural Network (like all the papers on the topics do) will probably do better, especially in the age we're living. But the great thing I found out about this approach, is that it's like a gateway for a software engineer, with his software engineering mindset, to get into the field of AI and see meaningful results.

Future work (updated)

I find this topic quite interesting and with a lot of potential. I would probably continue working in this field.

I even enrolled in a Masters of Data Mining and will probably do some similar projects. I will link anything useful here.

I've already put some more time on finishing the project, but I would like to transform it more to a tutorial about getting into the field of AI while having the ability to easily extend it with new custom features.

Updates

Update - 29.12.19: The repository has become pretty popular, so I added a new notebook (Demo.ipynb) that combines all the modules and generates questions for any text. I reordered the other notebooks and documented the code (a bit better).

Update - 09.03.21: Added a requirements.txt file with instructions to run a virtual environment and fixed the bug a with ValueError: operands could not be broadcast together with shapes (230, 121) (83, )

I have also started working on my Master's thesis with a similar topic of Question Generation.

Update - 27.10.21: I have uploaded the code for my Master's thesis in the Question-Generation-Transformers repository. I highly encourage you to check it out.

Additionally the approach using a classfier to pick the answer candidates has been accepted as a students paper in the RANLP2021 conference. Paper here.

question-generation's People

Contributors

Stargazers

Watchers

Forkers

cormac-work develupper pidugusundeep dmytrosytro ramespark tanmaypandey7 bertmckay79 shyamalschandra sumhncku vishwajeet93 anupgoenka gitrekm capasitore bertds mohtawfik deepakthandra bajracharya-kshitij vishal2612200 git04112019 cjhsu1991 vithonch aaditkachalia sachin1190 meshks flictuum majid03 aesmin umairbinahmad homizoka katinder fighting41love hrankit greenandro foxfoxio m1gra1n3 nitin-person franceshe anand-kumar-2002 rash-mithas himanshumangal09 hanglics rogervaas nlpprj danmaze shrutisingh-89 abdulmk787 ngochai1591 hamza-rashid niloypurkait danielhabenicht shayanriyaz jha6139 elnazsn1988 liumengyang tcapilla harshshaw janardhanv mostafa-eltalawy jaqujaqu askintution pathtolearn jaiparmani kaiwalyaharkare abhishek-st morioka shivakumar-np lbtanh ruchitadamodar1922 afrinjamanbd turntoright bgopinath3 mustafa-legend adamlouly mikpim01 lisaterumi dwtcourses paco777 techthiyanes helioxgroup jgeofil stjordanis igledaniel vyomgarg47 messorem7 anusharma1729 indrayanipawar truongnguyen58 mishav78 shreeshreee roshni3499 jithintom08 peterodejide gopireddy590 arvind-india socioprophet iliyas0111 poveteen jalshboul sumonrayy dumpmemory

question-generation's Issues

ValueError: operands could not be broadcast together with shapes (230, 121) (83, )

So first of all I cloned your repo to try out the code. The Demo.ipynb worked fine as expected (without any change in input text). However after giving my own text as input, I am getting error in the naive bias file of sklearn.
The error:

I tried a lot to debug the issue, however since I am new to ML and Data science in general, I was not able to get the issue fixed. Also the error is in the sklearn's files, so not sure what's causing the issue. I even tried downgrading my python and all the modules so as to match yours and rebuilt the model, but still it gave the same error message in Predict.ipynb. I saw a lot of people were facing the same issue in the 'issue' section of your repo

It would be greatly helpful if you could point me in the right direction so that I can resolve this issue.

Database

Hello sir, can you add database file or sql file which help to store created mcq in database

VC

ValueError: operands could not be broadcast together with shapes (2448,122) (121,)

Hi,
I clone your project and made some changes like reading form pdf and as well txt file dynamically rote an API but hen i was trying to read the pdf file and process the file getting an error like ValueError: operands could not be broadcast together with shapes (2448,122) (121,)
and i have seen this same issue in resolved issue section i tried of that requirements provided their i had run the same requirements but i was unable to resolve that issue..

Thanks and Regards,
Manikantha Sekhar...

Regarding Predicting Accuracy [ y_test vs y_pred ]

On training:

y_test.value_counts() is:
False 828
True 40

i.e. 40 of 868 are marked 'True' in test set.

y_pred_series.value_count() is:
True 657
False 211

that shows 657 values have been marked True which reflects a lot of True values, which actually reflects a lot of false positive values (619) i.e. 657 = 38 (Actual Tru, i.e TP ) + 619 (Predicted True but actually are False i.e FPs)

In fact the confusion_matrix(y_test, y_pred) shows the same:
array([[209, 619],
[ 2, 38]])

It's quite clear that the correct answers predicted is very high i.e. 38/40.

In light of above results:
a. I am unable to understand you following conclusion as the overall accuracy considering True Negatives(209) is not good.

"Seems like I'm super biased for towards correct answers." But as I found during the Data exploration, there are a lot more answer-worthy words that are just not labeled since, I guess the Mechanical Turks had the job to label just 5. So, who knows, maybe I did some black magic and managed to extract all the answer worthy words!

b. Secondly, can you elaborate what was your aim in regards to predicting 'isAnswer'? Is it related to finding only relevant answers as that of the original dataset i.e. (40)? Or you intended to extract new words which could serve as answers for more questions and if that's correct, which part of confusion matrix results you consider to be potential relevant answers beyond what is provided in data set?

What if the answers were sentences instead of one word?

And the incorrect answers would also be sentences... What can I do for that?

Linux Problem

I am wondering why the code is not generating any "incorrect answers" when the code run in Linux.
I tried the same code using Windows and using Linux; Windows works fine but Linux none? any idea?

Where is the part "Creating questions" ? Why did not you see it in this code?

Requirements: failure to install spacy==2.2.0

There seems to be a problem with the requirement spacy==2.2.0. I'm now trying with spacy 3.2.2 (brought by the github link in the requirements file), not sure if that'll do...

Cannot find a the main file, that implements all your components.

I think the approach used is great, but i cannot find a file that implements all other files, can you please upload that. What else is missing in the project?

Transforming a fill-in-the-blank sentence to a natural question

Hi,

When you said you believe it is possible to train a seq2seq NN to do the transformation from a fill-in-the-blank sentence to a question, do you know any dataset of this kind that we can do the training on?

Proper Documentation needed.

A proper documentation is needed to execute this project. What are the requirements? How should one execute ?

Is there any configuration where i could change the answer length

Is there any configuration to change the n-gram length of the answer? as the words generated are not that great.

ValueError: operands could not be broadcast together with shapes (60,121) (77,)

getting a ValueError inside the predictWords function at line 6 ,

y_pred = predictor.predict_proba(wordsDf)

and inside the naive_bayes.py

<ipython-input-6-55c8aeacc0c5> in predictWords(wordsDf, df)
      4     predictor = loadPickle(predictorPickleName)
      5 
----> 6     y_pred = predictor.predict_proba(wordsDf)
      7 
      8     labeledAnswers = []

~/env/lib/python3.6/site-packages/sklearn/naive_bayes.py in predict_log_proba(self, X)
     95         check_is_fitted(self)
     96         X = self._check_X(X)
---> 97         jll = self._joint_log_likelihood(X)
     98         # normalize by P(x) = P(f_1, ..., f_n)
     99         log_prob_x = logsumexp(jll, axis=1)

~/env/lib/python3.6/site-packages/sklearn/naive_bayes.py in _joint_log_likelihood(self, X)
    449             jointi = np.log(self.class_prior_[i])
    450             n_ij = - 0.5 * np.sum(np.log(2. * np.pi * self.sigma_[i, :]))
--> 451             n_ij -= 0.5 * np.sum(((X - self.theta_[i, :]) ** 2) /
    452                                  (self.sigma_[i, :]), 1)
    453             joint_log_likelihood.append(jointi + n_ij)

Requirements used :
attrs==19.3.0
backcall==0.1.0
bleach==3.1.1
blis==0.4.1
boto==2.49.0
boto3==1.12.5
botocore==1.15.5
catalogue==1.0.0
certifi==2019.11.28
chardet==3.0.4
cycler==0.10.0
cymem==2.0.3
decorator==4.4.1
defusedxml==0.6.0
docutils==0.15.2
en-core-web-md==2.2.5
en-core-web-sm==2.2.5
entrypoints==0.3
gensim==3.8.1
idna==2.9
importlib-metadata==1.5.0
ipykernel==5.1.4
ipython==7.12.0
ipython-genutils==0.2.0
ipywidgets==7.5.1
jedi==0.16.0
Jinja2==2.11.1
jmespath==0.9.4
joblib==0.14.1
jsonschema==3.2.0
jupyter==1.0.0
jupyter-client==5.3.4
jupyter-console==6.1.0
jupyter-core==4.6.3
kiwisolver==1.1.0
MarkupSafe==1.1.1
matplotlib==3.1.3
mistune==0.8.4
murmurhash==1.0.2
nbconvert==5.6.1
nbformat==5.0.4
nltk==3.4.5
notebook==6.0.3
numpy==1.18.1
pandas==1.0.1
pandocfilters==1.4.2
parso==0.6.1
pexpect==4.8.0
pickleshare==0.7.5
pkg-resources==0.0.0
plac==1.1.3
preshed==3.0.2
prometheus-client==0.7.1
prompt-toolkit==3.0.3
ptyprocess==0.6.0
Pygments==2.5.2
pyparsing==2.4.6
pyrsistent==0.15.7
python-dateutil==2.8.1
pytz==2019.3
pyzmq==18.1.1
qtconsole==4.6.0
requests==2.23.0
s3transfer==0.3.3
scikit-learn==0.22.1
scipy==1.4.1
Send2Trash==1.5.0
six==1.14.0
sklearn==0.0
smart-open==1.9.0
spacy==2.2.3
srsly==1.0.1
terminado==0.8.3
testpath==0.4.4
thinc==7.3.1
tornado==6.0.3
tqdm==4.43.0
traitlets==4.3.3
urllib3==1.25.8
wasabi==0.6.0
wcwidth==0.1.8
webencodings==0.5.1
widgetsnbextension==3.5.1
zipp==3.0.0