Giter VIP home page Giter VIP logo

natural-question-answering's Introduction

Overview

In this repository you will find the code for the TensorFlow 2.0 Question Answering Challenge. My team finished as 2nd / 1239 with a micro F1-score of 0.71 on the private test set. The challenge was to develop better algorithms for natural question answering. You can can find more details about the task and the dataset in these resources:

Important: This solution uses cloud instances. Don't forget to stop / delete them if you don't need these instances anymore. Unwanted costs might occur otherwise.

  • GUI: To stop / delete your VMs go to: https://console.cloud.google.com -> Compute Engine -> VM instances. To stop / delete your TPU instances go to https://console.cloud.google.com -> Compute Engine -> TPUs.

  • CLI: To stop your VMs: sudo shutdown -h now. To stop your TPU instances: gcloud compute tpus stop nq2 --zone europe-west4-a

You can always check your running instances via: gcloud compute instances list or using the console GUI.

Setup

From a machine that has gcloud installed:

# Download the script to start the instances
wget https://raw.githubusercontent.com/see--/natural-question-answering/master/start_instance.py
python3 start_instance.py

This will create a VM named "nq2" with Python-3.6.9 and all the required libraries installed. It will also create a "v3-8" TPU in the zone "europe-west4-a" with the same name.

For the following commands, please ssh into your VM. You can get the command from https://console.cloud.google.com -> Compute Engine -> VM instances -> Connect -> View gcloud command.

It should be similar to:

gcloud beta compute --project "MYPROJECT" ssh --zone "europe-west4-a" "nq2"

Get the data

sudo pip install --upgrade kaggle
mkdir .kaggle
# Replace "MYUSER" and "MYKEY" with your credentials. You can create them on:
# `https://www.kaggle.com` -> `My Account` -> `Create New API Token`
echo '{"username":"MYUSER","key":"MYKEY"}' > ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json
kaggle competitions download -c tensorflow2-question-answering
for f in *.zip; do unzip $f; done
rm *.zip

Training / Inference

# get the code
git clone https://github.com/see--/natural-question-answering.git
# install `transformers`
cd natural-question-answering/transformers_repo
sudo python3 setup.py install
cd ..
# move the data to the root directory
mv ../*.jsonl .
# run the training and evaluation
python3 nq_to_squad.py; python3 train_eval.py
# run inference on the test set
python3 nq_to_squad.py --fn simplified-nq-test.jsonl; python3 train_eval.py --do_not_train --predict_fn nq-test-v1.0.1.json

The training and evaluation should finish within 5 hours and you should get a local validation score of ~0.72 and public LB of ~0.73. The weights are stored in nq_bert_uncased_0/checkpoint-015400/weights.h5. For inference please check my Kaggle Notebook.

TensorFlow Hub

The trained model can be loaded from TensorFlow Hub. A simple usage example is given in the the demo script:

import os
import tensorflow as tf
import tensorflow_hub as hub
from transformers import BertTokenizer


def main():
  os.system('wget https://github.com/see--/natural-question-answering/releases/download/v0.0.1/tokenizer_tf2_qa.zip')
  os.system('unzip tokenizer_tf2_qa.zip')
  tokenizer = BertTokenizer.from_pretrained('tokenizer_tf2_qa')
  model = hub.load(
      'https://github.com/see--/natural-question-answering/releases/download/v0.0.1/model.tar.gz')
  questions = [
      'How long did it take to find the answer?',
      'What\'s the answer to the great question?',
      'What\'s the name of the computer?']
  paragraph = '''<p>The computer is named Deep Thought.</p>.
                 <p>After 46 million years of training it found the answer.</p>
                 <p>However, nobody was amazed. The answer was 42.</p>'''

  for question in questions:
    question_tokens = tokenizer.tokenize(question)
    paragraph_tokens = tokenizer.tokenize(paragraph)
    tokens = ['[CLS]'] + question_tokens + ['[SEP]'] + paragraph_tokens + ['[SEP]']
    input_word_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_mask = [1] * len(input_word_ids)
    input_type_ids = [0] * (1 + len(question_tokens) + 1) + [1] * (len(paragraph_tokens) + 1)

    input_word_ids, input_mask, input_type_ids = map(lambda t: tf.expand_dims(
        tf.convert_to_tensor(t, dtype=tf.int32), 0), (input_word_ids, input_mask, input_type_ids))
    outputs = model([input_word_ids, input_mask, input_type_ids])
    # using `[1:]` will enforce an answer. `outputs[:][0][0]` is the ignored '[CLS]' token logit.
    short_start = tf.argmax(outputs[0][0][1:]) + 1
    short_end = tf.argmax(outputs[1][0][1:]) + 1
    answer_tokens = tokens[short_start: short_end + 1]
    answer = tokenizer.convert_tokens_to_string(answer_tokens)
    print(f'Question: {question}')
    print(f'Answer: {answer}')


if __name__ == '__main__':
  main()

Notice

Parts of this solution are copied or modified from the following repositories:

Thanks to the authors!

natural-question-answering's People

Contributors

dependabot[bot] avatar see-- avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

natural-question-answering's Issues

Memory leak in model loaded from tf-hub

First of all thank you for the great work. Your Q&A model rocks. Really interesting to see what is possible for next level Q&A.

I played around with the model in tf-hub and noticed it has a memory leak.
Here is my code:

import tensorflow as tf
import tensorflow_hub as hub
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("tokenizer_tf2_qa")
model = hub.load("https://tfhub.dev/see--/bert-uncased-tf2-qa/1")

for question, context in data:
    
    # create input vector representation
    encoded = tokenizer.encode_plus(question, context, add_special_tokens=True)
    input_word_ids = encoded["input_ids"]
    input_mask = encoded["attention_mask"]
    input_type_ids = encoded["token_type_ids"]

    # convert to tf.int32 and pass through model
    input_word_ids, input_mask, input_type_ids = map(
        lambda t: tf.expand_dims(tf.convert_to_tensor(t, dtype=tf.int32), 0),
        (input_word_ids, input_mask, input_type_ids),
    )
    outputs = model([input_word_ids, input_mask, input_type_ids])

I tested for both Tensorflow 2.1.0 and 2.2.0 on a cpu machine.

I wonder if this warning is related to the memory leak:

WARNING:tensorflow:5 out of the last 5 calls to <function recreate_function.<locals>.restored_function_body at 0x7f23a70d9680> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings is likely due to passing python objects instead of tensors. Also, tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. Please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details.

Any idea what could be the problem?

Error while running demo.py

When I am trying to run the demo.py file on Google Colab, I am getting the following error with the tokenizer.

ValueError: Non-consecutive added token 'td_colspan' found. Should have index 30522 but has index 1 in saved vocabulary.

Please help me resolve this error.

Model

Is there a pretrained model available?

Error while executing demo.py on cpu?

Hello,
I am getting following error while executing demo.py with custom document text(more text) on CPU. It works fine with GPU though.

tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,512] = 512 is not in [0, 512)
[[{{node StatefulPartitionedCall/StatefulPartitionedCall/tf_bert_for_natural_question_answering/StatefulPartitionedCall/bert/StatefulPartitionedCall/embeddings/position_embeddings/embedding_lookup}}]] [Op:__inference_restored_function_body_89164]

Thanks
Mahesh

Trying to Run Tokenizer

Hi see--

I am trying to add tokens by
tokenizer.add_tokens(add_tokens, offset=offset)

But I got error

TypeError: add_tokens() got an unexpected keyword argument 'offset'

Are you using anything different?

Regards,
Ankur

Unable to run tfhub sample

I'm trying to run the tfhub sample, got the following error
Model name 'tokenizer_tf2_qa' was not found in tokenizers model name list
(I did pip install transformers),
can you help?

TypeError: add_tokens() got an unexpected keyword argument 'offset'

Hi everyone,
I try to run this repo, but I met error below:

Traceback (most recent call last):  
  File "train_eval.py", line 481, in <module>
    main()
  File "train_eval.py", line 437, in main
    num_added = tokenizer.add_tokens(add_tokens, offset=offset)
TypeError: add_tokens() got an unexpected keyword argument 'offset' 

My environment:

  • Python: 3.7.0
  • Transformers: 2.2.0
  • Tensorflow: 2.0

I checked all of version of the transformer, but I haven't found version that has offset argument in the add_tokens method

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.