Overview

In this repository you will find the code for the TensorFlow 2.0 Question Answering Challenge. My team finished as 2nd / 1239 with a micro F1-score of 0.71 on the private test set. The challenge was to develop better algorithms for natural question answering. You can can find more details about the task and the dataset in these resources:

Important: This solution uses cloud instances. Don't forget to stop / delete them if you don't need these instances anymore. Unwanted costs might occur otherwise.

GUI: To stop / delete your VMs go to: https://console.cloud.google.com -> Compute Engine -> VM instances. To stop / delete your TPU instances go to https://console.cloud.google.com -> Compute Engine -> TPUs.
CLI: To stop your VMs: sudo shutdown -h now. To stop your TPU instances: gcloud compute tpus stop nq2 --zone europe-west4-a

You can always check your running instances via: gcloud compute instances list or using the console GUI.

Setup

From a machine that has gcloud installed:

# Download the script to start the instances
wget https://raw.githubusercontent.com/see--/natural-question-answering/master/start_instance.py
python3 start_instance.py

This will create a VM named "nq2" with Python-3.6.9 and all the required libraries installed. It will also create a "v3-8" TPU in the zone "europe-west4-a" with the same name.

For the following commands, please ssh into your VM. You can get the command from https://console.cloud.google.com -> Compute Engine -> VM instances -> Connect -> View gcloud command.

It should be similar to:

gcloud beta compute --project "MYPROJECT" ssh --zone "europe-west4-a" "nq2"

Get the data

sudo pip install --upgrade kaggle
mkdir .kaggle
# Replace "MYUSER" and "MYKEY" with your credentials. You can create them on:
# `https://www.kaggle.com` -> `My Account` -> `Create New API Token`
echo '{"username":"MYUSER","key":"MYKEY"}' > ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json
kaggle competitions download -c tensorflow2-question-answering
for f in *.zip; do unzip $f; done
rm *.zip

Training / Inference

# get the code
git clone https://github.com/see--/natural-question-answering.git
# install `transformers`
cd natural-question-answering/transformers_repo
sudo python3 setup.py install
cd ..
# move the data to the root directory
mv ../*.jsonl .
# run the training and evaluation
python3 nq_to_squad.py; python3 train_eval.py
# run inference on the test set
python3 nq_to_squad.py --fn simplified-nq-test.jsonl; python3 train_eval.py --do_not_train --predict_fn nq-test-v1.0.1.json

The training and evaluation should finish within 5 hours and you should get a local validation score of ~0.72 and public LB of ~0.73. The weights are stored in nq_bert_uncased_0/checkpoint-015400/weights.h5. For inference please check my Kaggle Notebook.

TensorFlow Hub

The trained model can be loaded from TensorFlow Hub. A simple usage example is given in the the demo script:

import os
import tensorflow as tf
import tensorflow_hub as hub
from transformers import BertTokenizer


def main():
  os.system('wget https://github.com/see--/natural-question-answering/releases/download/v0.0.1/tokenizer_tf2_qa.zip')
  os.system('unzip tokenizer_tf2_qa.zip')
  tokenizer = BertTokenizer.from_pretrained('tokenizer_tf2_qa')
  model = hub.load(
      'https://github.com/see--/natural-question-answering/releases/download/v0.0.1/model.tar.gz')
  questions = [
      'How long did it take to find the answer?',
      'What\'s the answer to the great question?',
      'What\'s the name of the computer?']
  paragraph = '''<p>The computer is named Deep Thought.</p>.
                 <p>After 46 million years of training it found the answer.</p>
                 <p>However, nobody was amazed. The answer was 42.</p>'''

  for question in questions:
    question_tokens = tokenizer.tokenize(question)
    paragraph_tokens = tokenizer.tokenize(paragraph)
    tokens = ['[CLS]'] + question_tokens + ['[SEP]'] + paragraph_tokens + ['[SEP]']
    input_word_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_mask = [1] * len(input_word_ids)
    input_type_ids = [0] * (1 + len(question_tokens) + 1) + [1] * (len(paragraph_tokens) + 1)

    input_word_ids, input_mask, input_type_ids = map(lambda t: tf.expand_dims(
        tf.convert_to_tensor(t, dtype=tf.int32), 0), (input_word_ids, input_mask, input_type_ids))
    outputs = model([input_word_ids, input_mask, input_type_ids])
    # using `[1:]` will enforce an answer. `outputs[:][0][0]` is the ignored '[CLS]' token logit.
    short_start = tf.argmax(outputs[0][0][1:]) + 1
    short_end = tf.argmax(outputs[1][0][1:]) + 1
    answer_tokens = tokens[short_start: short_end + 1]
    answer = tokenizer.convert_tokens_to_string(answer_tokens)
    print(f'Question: {question}')
    print(f'Answer: {answer}')


if __name__ == '__main__':
  main()

Notice

Parts of this solution are copied or modified from the following repositories:

Thanks to the authors!

Memory leak in model loaded from tf-hub

First of all thank you for the great work. Your Q&A model rocks. Really interesting to see what is possible for next level Q&A.

I played around with the model in tf-hub and noticed it has a memory leak.
Here is my code:

import tensorflow as tf
import tensorflow_hub as hub
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("tokenizer_tf2_qa")
model = hub.load("https://tfhub.dev/see--/bert-uncased-tf2-qa/1")

for question, context in data:
    
    # create input vector representation
    encoded = tokenizer.encode_plus(question, context, add_special_tokens=True)
    input_word_ids = encoded["input_ids"]
    input_mask = encoded["attention_mask"]
    input_type_ids = encoded["token_type_ids"]

    # convert to tf.int32 and pass through model
    input_word_ids, input_mask, input_type_ids = map(
        lambda t: tf.expand_dims(tf.convert_to_tensor(t, dtype=tf.int32), 0),
        (input_word_ids, input_mask, input_type_ids),
    )
    outputs = model([input_word_ids, input_mask, input_type_ids])

I tested for both Tensorflow 2.1.0 and 2.2.0 on a cpu machine.

I wonder if this warning is related to the memory leak:

WARNING:tensorflow:5 out of the last 5 calls to <function recreate_function.<locals>.restored_function_body at 0x7f23a70d9680> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings is likely due to passing python objects instead of tensors. Also, tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. Please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details.

Any idea what could be the problem?

see-- / natural-question-answering Goto Github PK

natural-question-answering's Introduction

Overview

Setup

Get the data

Training / Inference

TensorFlow Hub

Notice

natural-question-answering's People

Contributors

Stargazers

Watchers

Forkers

natural-question-answering's Issues

Recommend Projects

Recommend Topics

Recommend Org