Giter VIP home page Giter VIP logo

question_generator's Introduction

question_generator

Question Generator is an NLP system for generating reading comprehension-style questions from texts such as news articles or pages excerpts from books. The system is built using pretrained models from HuggingFace Transformers. There are two models: the question generator itself, and the QA evaluator which ranks and filters the question-answer pairs based on their acceptability.

Usage

The easiest way to generate some questions is to clone the github repo and then run qg_run.py like this:

!git clone https://github.com/amontgomerie/question_generator
!python 'question_generator/run_qg.py' --text_dir 'question_generator/articles/twitter_hack.txt'

This will generate 10 question-answer pairs of mixed style (full-sentence and multiple choice) based on the article specified in --text_dir and print them to the console. For more information see the qg_commandline_example notebook.

The QuestionGenerator class can also be instantiated and used like this:

from questiongenerator import QuestionGenerator
qg = QuestionGenerator()
qg.generate(text, num_questions=10)

This will generate 10 questions of mixed style and return a list of dictionaries containing question-answer pairs. In the case of multiple choice questions, the answer will contain a list of dictionaries containing the answers and a boolean value stating if the answer is correct or not. The output can be easily printed using the print_qa() function. For more information see the question_generation_example notebook.

Choosing the number of questions

The desired number of questions can be passed as a command line argument using --num_questions or as an argument when calling qg.generate(text, num_questions=20. If the chosen number of questions is too large, then the model may not be able to generate enough. The maximum number of questions will depend on the length of the input text, or more specifically the number of sentences and named entities containined within text. Note that the quality of some of the outputs will decrease for larger numbers of questions, as the QA Evaluator ranks generated questions and returns the best ones.

Answer styles

The system can generate questions with full-sentence answers ('sentences'), questions with multiple-choice answers ('multiple_choice'), or a mix of both ('all'). This can be selected using the --answer_style or qg.generate(answer_style=<style>) arguments.

Models

Question Generator

The question generator model takes a text as input and outputs a series of question and answer pairs. The answers are sentences and phrases extracted from the input text. The extracted phrases can be either full sentences or named entities extracted using spaCy. Named entities are used for multiple-choice answers. The wrong answers will be other entities of the same type found in the text. The questions are generated by concatenating the extracted answer with the full text (up to a maximum of 512 tokens) as context in the following format:

answer_token <extracted answer> context_token <context>

The concatenated string is then encoded and fed into the question generator model. The model architecture is t5-base. The pretrained model was finetuned as a sequence-to-sequence model on a dataset made up several well-known QA datasets (SQuAD, RACE, CoQA, and MSMARCO). The datasets were restructured by concatenating the answer and context fields into the previously mentioned format. The concatenated answer and context was then used as an input for training, and the question field became the targets.

The datasets can be found here.

QA Evaluator

The QA evaluator takes a question answer pair as an input and outputs a value representing its prediction about whether the input was a valid question and answer pair or not. The model is bert-base-cased with a sequence classification head. The pretrained model was finetuned on the same data as the question generator model, but the context was removed. The question and answer were concatenated 50% of the time. In the other 50% of the time a corruption operation was performed (either swapping the answer for an unrelated answer, or by copying part of the question into the answer). The model was then trained to predict whether the input sequence represented one of the original QA pairs or a corrupted input.

The input for the QA evaluator follows the format for BertForSequenceClassification, but using the question and answer as the two sequences. It is the following format:

[CLS] <question> [SEP] <answer [SEP]

question_generator's People

Contributors

amontgomerie avatar bgmartins avatar

Stargazers

Cátia Teixeira avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.