facebookresearch / drqa Goto Github PK

Reading Wikipedia to Answer Open-Domain Questions

License: Other

Python 98.08% Shell 1.92%

drqa's Introduction

DrQA

This is a PyTorch implementation of the DrQA system described in the ACL 2017 paper Reading Wikipedia to Answer Open-Domain Questions.

Quick Links

About
Demo
Installation
Components

Machine Reading at Scale

DrQA is a system for reading comprehension applied to open-domain question answering. In particular, DrQA is targeted at the task of "machine reading at scale" (MRS). In this setting, we are searching for an answer to a question in a potentially very large corpus of unstructured documents (that may not be redundant). Thus the system has to combine the challenges of document retrieval (finding the relevant documents) with that of machine comprehension of text (identifying the answers from those documents).

Our experiments with DrQA focus on answering factoid questions while using Wikipedia as the unique knowledge source for documents. Wikipedia is a well-suited source of large-scale, rich, detailed information. In order to answer any question, one must first retrieve the few potentially relevant articles among more than 5 million, and then scan them carefully to identify the answer.

Note that DrQA treats Wikipedia as a generic collection of articles and does not rely on its internal graph structure. As a result, DrQA can be straightforwardly applied to any collection of documents, as described in the retriever README.

This repository includes code, data, and pre-trained models for processing and querying Wikipedia as described in the paper -- see Trained Models and Data. We also list several different datasets for evaluation, see QA Datasets. Note that this work is a refactored and more efficient version of the original code. Reproduction numbers are very similar but not exact.

Quick Start: Demo

Install DrQA and download our models to start asking open-domain questions!

Run python scripts/pipeline/interactive.py to drop into an interactive session. For each question, the top span and the Wikipedia paragraph it came from are returned.

>>> process('What is question answering?')

Top Predictions:
+------+----------------------------------------------------------------------------------------------------------+--------------------+--------------+-----------+
| Rank |                                                  Answer                                                  |        Doc         | Answer Score | Doc Score |
+------+----------------------------------------------------------------------------------------------------------+--------------------+--------------+-----------+
|  1   | a computer science discipline within the fields of information retrieval and natural language processing | Question answering |    1917.8    |   327.89  |
+------+----------------------------------------------------------------------------------------------------------+--------------------+--------------+-----------+

Contexts:
[ Doc = Question answering ]
Question Answering (QA) is a computer science discipline within the fields of
information retrieval and natural language processing (NLP), which is
concerned with building systems that automatically answer questions posed by
humans in a natural language.

>>> process('What is the answer to life, the universe, and everything?')

Top Predictions:
+------+--------+---------------------------------------------------+--------------+-----------+
| Rank | Answer |                        Doc                        | Answer Score | Doc Score |
+------+--------+---------------------------------------------------+--------------+-----------+
|  1   |   42   | Phrases from The Hitchhiker's Guide to the Galaxy |    47242     |   141.26  |
+------+--------+---------------------------------------------------+--------------+-----------+

Contexts:
[ Doc = Phrases from The Hitchhiker's Guide to the Galaxy ]
The number 42 and the phrase, "Life, the universe, and everything" have
attained cult status on the Internet. "Life, the universe, and everything" is
a common name for the off-topic section of an Internet forum and the phrase is
invoked in similar ways to mean "anything at all". Many chatbots, when asked
about the meaning of life, will answer "42". Several online calculators are
also programmed with the Question. Google Calculator will give the result to
"the answer to life the universe and everything" as 42, as will Wolfram's
Computational Knowledge Engine. Similarly, DuckDuckGo also gives the result of
"the answer to the ultimate question of life, the universe and everything" as
42. In the online community Second Life, there is a section on a sim called
43. "42nd Life." It is devoted to this concept in the book series, and several
attempts at recreating Milliways, the Restaurant at the End of the Universe, were made.

>>> process('Who was the winning pitcher in the 1956 World Series?')

Top Predictions:
+------+------------+------------------+--------------+-----------+
| Rank |   Answer   |       Doc        | Answer Score | Doc Score |
+------+------------+------------------+--------------+-----------+
|  1   | Don Larsen | New York Yankees |  4.5059e+06  |   278.06  |
+------+------------+------------------+--------------+-----------+

Contexts:
[ Doc = New York Yankees ]
In 1954, the Yankees won over 100 games, but the Indians took the pennant with
an AL record 111 wins; 1954 was famously referred to as "The Year the Yankees
Lost the Pennant". In , the Dodgers finally beat the Yankees in the World
Series, after five previous Series losses to them, but the Yankees came back
strong the next year. On October 8, 1956, in Game Five of the 1956 World
Series against the Dodgers, pitcher Don Larsen threw the only perfect game in
World Series history, which remains the only perfect game in postseason play
and was the only no-hitter of any kind to be pitched in postseason play until
Roy Halladay pitched a no-hitter on October 6, 2010.

Try some of your own! Of course, DrQA might provide alternative facts, so enjoy the ride.

Installing DrQA

Setting up DrQA is easy!

DrQA requires Linux/OSX and Python 3.5 or higher. It also requires installing PyTorch version 1.0. Its other dependencies are listed in requirements.txt. CUDA is strongly recommended for speed, but not necessary.

Run the following commands to clone the repository and install DrQA:

git clone https://github.com/facebookresearch/DrQA.git
cd DrQA; pip install -r requirements.txt; python setup.py develop

Note: requirements.txt includes a subset of all the possible required packages. Depending on what you want to run, you might need to install an extra package (e.g. spacy).

If you use the CoreNLPTokenizer or SpacyTokenizer you also need to download the Stanford CoreNLP jars and spaCy en model, respectively. If you use Stanford CoreNLP, have the jars in your java CLASSPATH environment variable, or set the path programmatically with:

import drqa.tokenizers
drqa.tokenizers.set_default('corenlp_classpath', '/your/corenlp/classpath/*')

IMPORTANT: The default tokenizer is CoreNLP so you will need that in your CLASSPATH to run the README examples.

Ex: export CLASSPATH=$CLASSPATH:/path/to/corenlp/download/*.

If you do not already have a CoreNLP download you can run:

./install_corenlp.sh

Verify that it runs:

from drqa.tokenizers import CoreNLPTokenizer
tok = CoreNLPTokenizer()
tok.tokenize('hello world').words()  # Should complete immediately

For convenience, the Document Reader, Retriever, and Pipeline modules will try to load default models if no model argument is given. See below for downloading these models.

Trained Models and Data

To download all provided trained models and data for Wikipedia question answering, run:

./download.sh

Warning: this downloads a 7.5GB tarball (25GB untarred) and will take some time.

This stores the data in data/ at the file paths specified in the various modules' defaults. This top-level directory can be modified by setting a DRQA_DATA environment variable to point to somewhere else.

Default directory structure (see embeddings for more info on additional downloads for training):

DrQA
├── data (or $DRQA_DATA)
    ├── datasets
    │   ├── SQuAD-v1.1-<train/dev>.<txt/json>
    │   ├── WebQuestions-<train/test>.txt
    │   ├── freebase-entities.txt
    │   ├── CuratedTrec-<train/test>.txt
    │   └── WikiMovies-<train/test/entities>.txt
    ├── reader
    │   ├── multitask.mdl
    │   └── single.mdl
    └── wikipedia
        ├── docs.db
        └── docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz

Default model paths for the different modules can also be modified programmatically in the code, e.g.:

import drqa.reader
drqa.reader.set_default('model', '/path/to/model')
reader = drqa.reader.Predictor()  # Default model loaded for prediction

Document Retriever

TF-IDF model using Wikipedia (unigrams and bigrams, 2^24 bins, simple tokenization), evaluated on multiple datasets (test sets, dev set for SQuAD):

Model	SQuAD P@5	CuratedTREC P@5	WebQuestions P@5	WikiMovies P@5	Size
TF-IDF model	78.0	87.6	75.0	69.8	~13GB

P@5 here is defined as the % of questions for which the answer segment appears in one of the top 5 documents.

Document Reader

Model trained only on SQuAD, evaluated in the SQuAD setting:

Model	SQuAD Dev EM	SQuAD Dev F1	Size
Single model	69.4	78.9	~130MB

Model trained with distant supervision without NER/POS/lemma features, evaluated on multiple datasets (test sets, dev set for SQuAD) in the full Wikipedia setting:

Model	SQuAD EM	CuratedTREC EM	WebQuestions EM	WikiMovies EM	Size
Multitask model	29.5	27.2	18.5	36.9	~270MB

Wikipedia

Our full-scale experiments were conducted on the 2016-12-21 dump of English Wikipedia. The dump was processed with the WikiExtractor and filtered for internal disambiguation, list, index, and outline pages (pages that are typically just links). We store the documents in an sqlite database for which drqa.retriever.DocDB provides an interface.

Database	Num. Documents	Size
Wikipedia	5,075,182	~13GB

QA Datasets

The datasets used for DrQA training and evaluation can be found here:

SQuAD: train, dev
WebQuestions: train, test, entities
WikiMovies: train/test/entities (Rehosted in expected format from https://research.fb.com/downloads/babi/)
CuratedTrec: train/test (Rehosted in expected format from https://github.com/brmson/dataset-factoid-curated)

Format A

The retriever/eval.py, pipeline/eval.py, and distant/generate.py scripts expect the datasets as a .txt file where each line is a JSON encoded QA pair, like so:

'{"question": "q1", "answer": ["a11", ..., "a1i"]}'
...
'{"question": "qN", "answer": ["aN1", ..., "aNi"]}'

Scripts to convert SQuAD and WebQuestions to this format are included in scripts/convert. This is automatically done in download.sh.

Format B

The reader directory scripts expect the datasets as a .json file where the data is arranged like SQuAD:

file.json
├── "data"
│   └── [i]
│       ├── "paragraphs"
│       │   └── [j]
│       │       ├── "context": "paragraph text"
│       │       └── "qas"
│       │           └── [k]
│       │               ├── "answers"
│       │               │   └── [l]
│       │               │       ├── "answer_start": N
│       │               │       └── "text": "answer"
│       │               ├── "id": "<uuid>"
│       │               └── "question": "paragraph question?"
│       └── "title": "document id"
└── "version": 1.1

Entity lists

Some datasets have (potentially large) candidate lists for selecting answers. For example, WikiMovies' answers are OMDb entries while WebQuestions is based on Freebase. If we have known candidates, we can impose that all predicted answers must be in this list by discarding any higher scoring spans that are not.

DrQA Components

Document Retriever

DrQA is not tied to any specific type of retrieval system -- as long as it effectively narrows the search space and focuses on relevant documents.

Following classical QA systems, we include an efficient (non-machine learning) document retrieval system based on sparse, TF-IDF weighted bag-of-word vectors. We use bags of hashed n-grams (here, unigrams and bigrams).

To see how to build your own such model on new documents, see the retriever README.

To interactively query Wikipedia:

python scripts/retriever/interactive.py --model /path/to/model

If model is left out our default model will be used (assuming it was downloaded).

To evaluate the retriever accuracy (% match in top 5) on a dataset:

python scripts/retriever/eval.py /path/to/format/A/dataset.txt --model /path/to/model

Document Reader

DrQA's Document Reader is a multi-layer recurrent neural network machine comprehension model trained to do extractive question answering. That is, the model tries to find the answer to any question as a text span in one of the returned documents.

The Document Reader was inspired by, and primarily trained on, the SQuAD dataset. It can also be used standalone on such SQuAD-like tasks where a specific context is supplied with the question, the answer to which is contained in the context.

To see how to train the Document Reader on SQuAD, see the reader README.

To interactively ask questions about text with a trained model:

python scripts/reader/interactive.py --model /path/to/model

Again, here model is optional; a default model will be used if it is left out.

To run model predictions on a dataset:

python scripts/reader/predict.py /path/to/format/B/dataset.json --model /path/to/model

DrQA Pipeline

The full system is linked together in drqa.pipeline.DrQA.

To interactively ask questions using the full DrQA:

python scripts/pipeline/interactive.py

Optional arguments:

--reader-model    Path to trained Document Reader model.
--retriever-model Path to Document Retriever model (tfidf).
--doc-db          Path to Document DB.
--tokenizer      String option specifying tokenizer type to use (e.g. 'corenlp').
--candidate-file  List of candidates to restrict predictions to, one candidate per line.
--no-cuda         Use CPU only.
--gpu             Specify GPU device id to use.

To run predictions on a dataset:

python scripts/pipeline/predict.py /path/to/format/A/dataset.txt

Optional arguments:

--out-dir             Directory to write prediction file to (<dataset>-<model>-pipeline.preds).
--reader-model        Path to trained Document Reader model.
--retriever-model     Path to Document Retriever model (tfidf).
--doc-db              Path to Document DB.
--embedding-file      Expand dictionary to use all pretrained embeddings in this file (e.g. all glove vectors to minimize UNKs at test time).
--candidate-file      List of candidates to restrict predictions to, one candidate per line.
--n-docs              Number of docs to retrieve per query.
--top-n               Number of predictions to make per query.
--tokenizer           String option specifying tokenizer type to use (e.g. 'corenlp').
--no-cuda             Use CPU only.
--gpu                 Specify GPU device id to use.
--parallel            Use data parallel (split across GPU devices).
--num-workers         Number of CPU processes (for tokenizing, etc).
--batch-size          Document paragraph batching size (Reduce in case of GPU OOM).
--predict-batch-size  Question batching size (Reduce in case of CPU OOM).

Distant Supervision (DS)

DrQA's performance improves significantly in the full-setting when provided with distantly supervised data from additional datasets. Given question-answer pairs but no supporting context, we can use string matching heuristics to automatically associate paragraphs to these training examples.

Question: What U.S. state’s motto is “Live free or Die”?

Answer: New Hampshire

DS Document: Live Free or Die “Live Free or Die” is the official motto of the U.S. state of New Hampshire, adopted by the state in 1945. It is possibly the best-known of all state mottos, partly because it conveys an assertive independence historically found in American political philosophy and partly because of its contrast to the milder sentiments found in other state mottos.

The scripts/distant directory contains code to generate and inspect such distantly supervised data. More information can be found in the distant supervision README.

Tokenizers

We provide a number of different tokenizer options for convenience. Each has its own pros/cons based on how many dependencies it requires, overhead for running it, speed, and performance. For our reported experiments we used CoreNLP (but results are all similar).

Available tokenizers:

CoreNLPTokenizer: Uses Stanford CoreNLP (option: 'corenlp'). We used v3.7.0. Requires Java 8.
SpacyTokenizer: Uses spaCy (option: 'spacy').
RegexpTokenizer: Custom regex-based PTB-style tokenizer (option: 'regexp').
SimpleTokenizer: Basic alpha-numeric/non-whitespace tokenizer (option: 'simple').

See the list of mappings between string option names and tokenizer classes.

Citation

Please cite the ACL paper if you use DrQA in your work:

@inproceedings{chen2017reading,
  title={Reading {Wikipedia} to Answer Open-Domain Questions},
  author={Chen, Danqi and Fisch, Adam and Weston, Jason and Bordes, Antoine},
  booktitle={Association for Computational Linguistics (ACL)},
  year={2017}
}

DrQA Elsewhere

Connection with ParlAI

This implementation of the DrQA Document Reader is closely related to the one found in ParlAI. Here, however, the work is extended to interact with the Document Retriever in the open-domain setting. On the other hand, the implementation in ParlAI is more general, and follows the appropriate API to work in more QA/Dialog settings.

Web UI

Hamed Zaghaghi has provided a wrapper for a Web UI.

License

DrQA is BSD-licensed.

drqa's People

Contributors

Stargazers

Watchers

Forkers

amitbend xpertasks ml-lab stevenlol roshanraj amarviswanathan tonydeep techscientist hengkar little1tow ponlee colinsongf sungjinlees yanghaha11514 singularscience theanhle lxww301pku abunaser71 code-learner cclauss 19ai tngamemo tony32769 sunsj2014 ilyi1116 codingwangfeng merajat malcolmgreaves lishengfever quinton42 xsongx mydp2017 zidane1980slab zhenyangiacas kormilitzin emmanuelezenwere thelongestusernameofall 872409 jankim data258 zxsted wangjian1119 leonardovaleriano tomzhang alaakh42 seanlee97 zhaoerchao frankblood zorrock lparam codeaudit pgnepal shadowkun dangminh24 phpmind codehacken vermuz bilalharoon amritraj yueshu7939 sojvai daniel-007 tangj2017 4t-shirt zhangxt satchel9 benjamesbabala wenjixin huuthonguyen76 starryocean monarchwise neo4reo tranan191 magicknight coderx7 nangal rajarshd tejamukka javatyrant zhyr sevenkili lyon-neu rishkarajgi imismecom wenwei-dev icycloid airwindow gaoliyao shanwf multiplecrashes zhangzhiyong111 libcorner hhy5277 yongchun samithaj binbinbian threefoldo zencoding sjturan1 pc895530

drqa's Issues

Distant Supervision

Hi,
Love your work, would like to know more on getting Distant Supervision to work, especially specifics on how to generate and train on distantly supervised data.
Thanks,
Joe

Timeouts

I'm getting timeouts when I try to run the pipeline:

[amos:~/projects/DrQA] [pytorch] master* ± python ./scripts/pipeline/interactive.py --no-cuda --tokenizer=corenlp
07/26/2017 08:27:04 PM: [ Running on CPU only. ]
07/26/2017 08:27:04 PM: [ Initializing pipeline... ]
07/26/2017 08:27:04 PM: [ Initializing document ranker... ]
07/26/2017 08:27:04 PM: [ Loading /Users/amos/projects/DrQA/data/wikipedia/docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz ]
07/26/2017 08:27:49 PM: [ Initializing model... ]
07/26/2017 08:27:49 PM: [ Loading model /Users/amos/projects/DrQA/data/reader/multitask.mdl ]
07/26/2017 08:27:52 PM: [ Initializing tokenizers and document retrievers... ]

Interactive DrQA
>> process(question, candidates=None, top_n=1, n_docs=5)
>> usage()

>>> process("How many states in the United States?")
07/26/2017 08:28:25 PM: [ Processing 1 queries... ]
07/26/2017 08:28:25 PM: [ Retrieving top 5 docs... ]
Process ForkPoolWorker-1:
Process ForkPoolWorker-2:
Process ForkPoolWorker-3:
Process ForkPoolWorker-4:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/site-packages/pexpect/expect.py", line 99, in expect_loop
    incoming = spawn.read_nonblocking(spawn.maxread, timeout)
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/site-packages/pexpect/pty_spawn.py", line 462, in read_nonblocking
    raise TIMEOUT('Timeout exceeded.')
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/site-packages/pexpect/expect.py", line 99, in expect_loop
    incoming = spawn.read_nonblocking(spawn.maxread, timeout)
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/site-packages/pexpect/expect.py", line 99, in expect_loop
    incoming = spawn.read_nonblocking(spawn.maxread, timeout)
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/site-packages/pexpect/pty_spawn.py", line 462, in read_nonblocking
    raise TIMEOUT('Timeout exceeded.')
Traceback (most recent call last):
pexpect.exceptions.TIMEOUT: Timeout exceeded.
pexpect.exceptions.TIMEOUT: Timeout exceeded.
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/site-packages/pexpect/expect.py", line 99, in expect_loop
    incoming = spawn.read_nonblocking(spawn.maxread, timeout)

During handling of the above exception, another exception occurred:

  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/site-packages/pexpect/pty_spawn.py", line 462, in read_nonblocking
    raise TIMEOUT('Timeout exceeded.')
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/site-packages/pexpect/pty_spawn.py", line 462, in read_nonblocking
    raise TIMEOUT('Timeout exceeded.')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
Traceback (most recent call last):
pexpect.exceptions.TIMEOUT: Timeout exceeded.
pexpect.exceptions.TIMEOUT: Timeout exceeded.
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()

During handling of the above exception, another exception occurred:

  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/multiprocessing/pool.py", line 103, in worker
    initializer(*initargs)
Traceback (most recent call last):
  File "/Users/amos/projects/DrQA/drqa/pipeline/drqa.py", line 39, in init
    PROCESS_TOK = tokenizer_class(**tokenizer_opts)
  File "/usr/local/anaconda3/envs/pytorch/lib/pytho  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
n3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/amos/projects/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 33, in __init__
    self._launch()
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/multiprocessing/pool.py", line 103, in worker
    initializer(*initargs)
  File "/Users/amos/projects/DrQA/drqa/pipeline/drqa.py", line 39, in init
    PROCESS_TOK = tokenizer_class(**tokenizer_opts)
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/amos/projects/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 33, in __init__
    self._launch()
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/multiprocessing/pool.py", line 103, in worker
    initializer(*initargs)
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/multiprocessing/pool.py", line 103, in worker
    initializer(*initargs)
  File "/Users/amos/projects/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 61, in _launch
    self.corenlp.expect_exact('NLP>', searchwindowsize=100)
  File "/Users/amos/projects/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 61, in _launch
    self.corenlp.expect_exact('NLP>', searchwindowsize=100)
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/site-packages/pexpect/spawnbase.py", line 390, in expect_exact
    return exp.expect_loop(timeout)
  File "/Users/amos/projects/DrQA/drqa/pipeline/drqa.py", line 39, in init
    PROCESS_TOK = tokenizer_class(**tokenizer_opts)
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/site-packages/pexpect/expect.py", line 107, in expect_loop
    return self.timeout(e)
  File "/Users/amos/projects/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 33, in __init__
    self._launch()
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/site-packages/pexpect/spawnbase.py", line 390, in expect_exact
    return exp.expect_loop(timeout)
  File "/Users/amos/projects/DrQA/drqa/pipeline/drqa.py", line 39, in init
    PROCESS_TOK = tokenizer_class(**tokenizer_opts)
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/site-packages/pexpect/expect.py", line 107, in expect_loop
    return self.timeout(e)
  File "/Users/amos/projects/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 61, in _launch
    self.corenlp.expect_exact('NLP>', searchwindowsize=100)
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/site-packages/pexpect/expect.py", line 70, in timeout
    raise TIMEOUT(msg)
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/site-packages/pexpect/spawnbase.py", line 390, in expect_exact
    return exp.expect_loop(timeout)
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/site-packages/pexpect/expect.py", line 70, in timeout
    raise TIMEOUT(msg)
  File "/Users/amos/projects/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 33, in __init__
    self._launch()
  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/site-packages/pexpect/expect.py", line 107, in expect_loop
    return self.timeout(e)
  File "/Users/amos/projects/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 61, in _launch
    self.corenlp.expect_exact('NLP>', searchwindowsize=100)
pexpect.exceptions.TIMEOUT: Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x425d10c18>
command: /bin/bash
args: ['/bin/bash']
buffer (last 100 chars): b'm~/projects/DrQA\x1b(B\x1b[m] [\x1b[36mpytorch\x1b(B\x1b[m] \x1b[32mmaster\x1b[31m*\x1b[31m\x1b(B\x1b[m \x1b[35m1\x1b(B\x1b[m \x1b[1m\xc2\xb1\x1b(B\x1b[m '
before (last 100 chars): b'm~/projects/DrQA\x1b(B\x1b[m] [\x1b[36mpytorch\x1b(B\x1b[m] \x1b[32mmaster\x1b[31m*\x1b[31m\x1b(B\x1b[m \x1b[35m1\x1b(B\x1b[m \x1b[1m\xc2\xb1\x1b(B\x1b[m '
after: <class 'pexpect.exceptions.TI  File "/usr/local/anaconda3/envs/pytorch/lib/python3.5/site-packages/pexpect/expect.py", line 70, in timeout
    raise TIMEOUT(msg)
MEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 41203
child_fd: 16
closed: False
timeout: 60
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 100000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_string:
    0: "b'NLP>'"
pexpect.exceptions.TIMEOUT: Timeout exceeded.

Is this a configuration issue? Any suggestions?

i meet some problem in preprocess.py ->preprocess_dataset and train.py

I want to train the DocReader, I follow the introduction of 'DrQA/scripts/reader/'
but when i run preprocess.py this happened:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/pexpect/expect.py", line 97, in expect_loop
incoming = spawn.read_nonblocking(spawn.maxread, timeout)
File "/usr/lib/python3/dist-packages/pexpect/pty_spawn.py", line 452, in read_nonblocking
raise TIMEOUT('Timeout exceeded.')
pexpect.exceptions.TIMEOUT: Timeout exceeded.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.5/multiprocessing/pool.py", line 103, in worker
initializer(*initargs)
File "preprocess.py", line 30, in init
TOK = tokenizer_class(**options)
File "/home/dzj/facebook_mc/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 33, in init
self._launch()
File "/home/dzj/facebook_mc/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 61, in _launch
self.corenlp.expect_exact('NLP>', searchwindowsize=100)
File "/usr/lib/python3/dist-packages/pexpect/spawnbase.py", line 384, in expect_exact
return exp.expect_loop(timeout)
File "/usr/lib/python3/dist-packages/pexpect/expect.py", line 104, in expect_loop
return self.timeout(e)
File "/usr/lib/python3/dist-packages/pexpect/expect.py", line 68, in timeout
raise TIMEOUT(msg)
pexpect.exceptions.TIMEOUT: Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7fd4215bb588>
command: /bin/bash
args: ['/bin/bash']
searcher: None
buffer (last 100 chars): b'plearning2: ~~/facebook_mc/DrQA/scripts/reader\x07root@deeplearning2:~~/facebook_mc/DrQA/scripts/reader# '
before (last 100 chars): b'plearning2: ~~/facebook_mc/DrQA/scripts/reader\x07root@deeplearning2:~~/facebook_mc/DrQA/scripts/reader# '
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 76522
child_fd: 10
closed: False
timeout: 10
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 100000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0
delayafterclose: 0.1
delayafterterminate: 0.1

though i do this in the dir DrQA or DrQA/scripts/reader/ , this bug won't happend:

from drqa import tokenizers
tok = tokenizers.CoreNLPTokenizer()
tok.tokenize(text).words()

so i rewrite a no so good preprocess.py -> (change this two def )

def tokenizeSingle(aWorker,text):
    """Call the global process tokenizer on the input text."""
    #global TOK
    #tokens = TOK.tokenize(text)
    #print(text)
    #tokens = tokenizers.CoreNLPTokenizer().tokenize(text)
    tokens = aWorker.tokenize(text)
    #tokens = TOK.tokenizers(text)

    output = {
        'words': tokens.words(),
        'offsets': tokens.offsets(),
        'pos': tokens.pos(),
        'lemma': tokens.lemmas(),
        'ner': tokens.entities(),
    }
    return output

def process_dataset_test(data, tokenizer, workers=None):
    """Iterate processing (tokenize, parse, etc) dataset multithreaded."""
    aWorker = tokenizers.CoreNLPTokenizer()
    q_tokens = []
    for aStr in data['questions']:
        q_tokens.append(tokenizeSingle(aWorker,aStr))
    c_tokens = []
    for aStr in data['contexts']:
        c_tokens.append(tokenizeSingle(aWorker,aStr))
    for idx in range(len(data['qids'])):
        question = q_tokens[idx]['words']
        qlemma = q_tokens[idx]['lemma']
        document = c_tokens[data['qid2cid'][idx]]['words']
        offsets = c_tokens[data['qid2cid'][idx]]['offsets']
        lemma = c_tokens[data['qid2cid'][idx]]['lemma']
        pos = c_tokens[data['qid2cid'][idx]]['pos']
        ner = c_tokens[data['qid2cid'][idx]]['ner']
        ans_tokens = []
        if len(data['answers']) > 0:
            for ans in data['answers'][idx]:
                found = find_answer(offsets,
                                    ans['answer_start'],
                                    ans['answer_start'] + len(ans['text']))
                if found:
                    ans_tokens.append(found)
        yield {
            'id': data['qids'][idx],
            'question': question,
            'document': document,
            'offsets': offsets,
            'answers': ans_tokens,
            'qlemma': qlemma,
            'lemma': lemma,
            'pos': pos,
            'ner': ner,
        }

but i found that , in the output the value of the 'pos' and 'ner' are null
i don't know whether i had writen a right process_dataset()

then i run the train.py , i get this bug...

09/30/2017 03:15:53 PM: [ Starting training... ]
THCudaCheck FAIL file=/b/wheel/pytorch-src/torch/lib/THC/THCCachingHostAllocator.cpp line=258 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
File "train.py", line 562, in
main(args)
File "train.py", line 500, in main
train(args, train_loader, model, stats)
File "train.py", line 218, in train
train_loss.update(*model.update(ex))
File "/home/dzj/facebook_mc/DrQA/drqa/reader/model.py", line 218, in update
score_s, score_e = self.network(*inputs)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "/home/dzj/facebook_mc/DrQA/drqa/reader/rnn_reader.py", line 110, in forward
training=self.training)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/functional.py", line 366, in dropout
return functions.dropout.Dropout(p, training, inplace)(input)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/functions/dropout.py", line 29, in forward
self.noise.bernoulli(1 - self.p).div(1 - self.p)
RuntimeError: Creating MTGP constants failed. at /b/wheel/pytorch-src/torch/lib/THC/THCTensorRandom.cu:33

1 can you give me a sample of the output of preprocess.py?
2 why this TIMEOUT('Timeout exceeded.') will happened, my test has passed ...
3 What is the last bug?

Error running downlaod.sh

On running download.sh I am getting the following error. Any insight into this?

data/
data/datasets/
data/datasets/CuratedTrec-test.txt
data/datasets/CuratedTrec-train.txt
data/datasets/WikiMovies-entities.txt
data/datasets/WikiMovies-test.txt
data/datasets/WikiMovies-train.txt
data/reader/
data/reader/multitask.mdl
data/reader/single.mdl
data/wikipedia/
data/wikipedia/docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz
tar: Skipping to next header
tar: Exiting with failure status due to previous errors

DrQA integration with Solr

Hi Adam,

Thanks for the excellent QA system, we installed the system and it is working good on wikipedia dataset.
We want to use DrQa with Solr as backend, we already have solr server with crawled data and need to integrate it with the DrQA system. Can you please let us know what all needs to be changed in the pipeline.

Thanks & Regards,
M Swathi Mithran

Are there any changes to the license?

thank you!

CoreNLP tokenizer never quits

When using DrQA as a library configured to use the CoreNLP tokenizer, each prediction made will start a new Java CoreNLP process. These processes never quit until the parent Python process quits.

#48 is a pull request to fix this.

Use of manual features in the model.py

Hello,

Could you please highlight how you use the manual features which is extracted from build_feature_dict in utils.py, in the model.py

I could see its been passed as an argument for RnnDocReader, but couldn't figure how it is being used in the model or in the rnn_reader.

Thanks

MemoryError

Hi. First of all, thanks for sharing this project. I am actually learning a lot from it.

I guess memory is a big problem when dealing with a large document collection. I was trying to run it on pubmed corpus which has +800k documents.

When running the script to build tfidf (scripts/retriever/build_tfidf.py), it gives me an MemoryError as below:

Traceback (most recent call last):
  File "code/DrQA/scripts/retriever/build_tfidf.py", line 183, in <module>
    args, 'sqlite', {'db_path': args.db_path}
  File "code/DrQA/scripts/retriever/build_tfidf.py", line 123, in get_count_matrix
  File "/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py", line 51, in __init__
    other = self.__class__(coo_matrix(arg1, shape=shape))
  File "/usr/local/lib/python3.5/dist-packages/scipy/sparse/coo.py", line 149, in __init__
    self.row = np.array(row, copy=copy, dtype=idx_dtype)
MemoryError

I've tried to reduce the size of dtype with no success. It either fails at sp.csr_matrix() or count_matrix.sum_duplicates(). Is there a way to properly handle this issue?

I think I have plenty of memory space, but not sure about the count matrix.

cat /proc/meminfo
MemTotal: 131914996 kB
MemFree: 56948712 kB
MemAvailable: 69331336 kB

ImportError: No module named 'pexpect'

(VENVpytorch3) mldl@mldlUB1604:/ub16_prj/DrQA$ sudo -H pip3 install pexpect
Requirement already satisfied: pexpect in /usr/lib/python3/dist-packages
(VENVpytorch3) mldl@mldlUB1604:/ub16_prj/DrQA$
(VENVpytorch3) mldl@mldlUB1604:/ub16_prj/DrQA$
(VENVpytorch3) mldl@mldlUB1604:/ub16_prj/DrQA$
(VENVpytorch3) mldl@mldlUB1604:~/ub16_prj/DrQA$ python3
Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.

import drqa.reader
Traceback (most recent call last):
File "", line 1, in
File "/home/mldl/ub16_prj/DrQA/drqa/init.py", line 20, in
from . import tokenizers
File "/home/mldl/ub16_prj/DrQA/drqa/tokenizers/init.py", line 20, in
from .corenlp_tokenizer import CoreNLPTokenizer
File "/home/mldl/ub16_prj/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 14, in
import pexpect
ImportError: No module named 'pexpect'

MemoryError

Tried running interactive demo. Got the below error.

08/02/2017 03:41:14 PM: [ Initializing pipeline... ]
08/02/2017 03:41:14 PM: [ Initializing document ranker... ]
08/02/2017 03:41:14 PM: [ Loading /home/ritwik/smartron/rd/DrQA/data/wikipedia/docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz ]
Traceback (most recent call last):
  File "scripts/pipeline/interactive.py", line 70, in <module>
    tokenizer=args.tokenizer
  File "/home/ritwik/smartron/rd/DrQA/drqa/pipeline/drqa.py", line 109, in __init__
    self.ranker = ranker_class(**ranker_opts)
  File "/home/ritwik/smartron/rd/DrQA/drqa/retriever/tfidf_doc_ranker.py", line 37, in __init__
    matrix, metadata = utils.load_sparse_csr(tfidf_path)
  File "/home/ritwik/smartron/rd/DrQA/drqa/retriever/utils.py", line 34, in load_sparse_csr
    matrix = sp.csr_matrix((loader['data'], loader['indices'],
  File "/usr/local/lib/python3.5/dist-packages/numpy/lib/npyio.py", line 233, in __getitem__
    pickle_kwargs=self.pickle_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/numpy/lib/format.py", line 664, in read_array
    array = numpy.ndarray(count, dtype=dtype)
MemoryError```

clear doubts

hi, can anyone please tell me the difference between the reader sections in DrQA/drqa/reader and DrQA/scripts/reader. what exactly is going in these two sections and the aim for it?

What should I do if I want to make it work with Chinese Language?

Will DrQA support Chinese language? If I use Jieba as the Tokenizer, what should I do to make it work?

Scoring

Thanks very much for open sourcing DrQA. It's a great piece of work and I'm learning a lot.

I have a question about reader scoring. In the reader, are the scores for different questions on the same document comparable?

ie, if I use the document https://en.wikipedia.org/wiki/Apple_Inc. and ask "Was Steve Jobs in charge of Apple?" and "Was John Sculley in charge of Apple?" is it reasonable to look at the scores of the answers to both of those and assume the one with the higher score is the better answer (in some way)?

Develop in windows

AttributeError: '_io.TextIOWrapper' object has no attribute 'split' in setup.py at reqs.strip().split('\n')

Changed
with open('README.md') as f: readme = f.read()
with
readme = open('README.md', encoding="utf8")

how to use a text document as the knowledge base to perform Question answering on that document

I have a some pdf files of some reference books and I converted it to .txt format. It is a raw text data. I want to perform Question answering over these text documents. How will I do it? Please can anyone help me with this.

How to get DrQA answers in JSON format...??

Currently, I am getting answers with use python scripts/pipeline/interactive.py and format is something like that

>>> process('What is question answering?')

Top Predictions:
+------+----------------------------------------------------------------------------------------------------------+--------------------+--------------+-----------+
| Rank |                                                  Answer                                                  |        Doc         | Answer Score | Doc Score |
+------+----------------------------------------------------------------------------------------------------------+--------------------+--------------+-----------+
|  1   | a computer science discipline within the fields of information retrieval and natural language processing | Question answering |    1917.8    |   327.89  |
+------+----------------------------------------------------------------------------------------------------------+--------------------+--------------+-----------+

Contexts:
[ Doc = Question answering ]
Question Answering (QA) is a computer science discipline within the fields of
information retrieval and natural language processing (NLP), which is
concerned with building systems that automatically answer questions posed by
humans in a natural language.

But it is not much use for my project.
So can anyone tell how to get the answers in JSON format or in one line which I can....

Option to load multiple DrQA pipeline

Hi,
First of all, Tons of thanks for this repo. I would like to load multiple DrQA Pipeline to serve specific users separately. i.e., there are so many individual group of documents. I should serve answers for specific people from specific group of documents. So, I planned to load multiple DrQA pipeline using web server. I chose Flask/Tornado web server to serve the DrQA results. But, due to multiple loading of DrQA, resources exceeds (each core allocated for each DrQA pipeline) and I couldn't load more DrQA pipeline.

Is there any option to use DrQA with web server along with multi model loading?

-Raisudeen.

Doc Score

trying to retrain the model using own dataset, would like to know the performance of the re-trained model by calculating the answer and doc score. Any formula that I can refer to?

Btw, it's a robust system, thanks for the share!

cuda execution failed

python scripts/pipeline/interactive.py
08/12/2017 04:22:03 PM: [ CUDA enabled (GPU -1) ]
08/12/2017 04:22:03 PM: [ Initializing pipeline... ]
08/12/2017 04:22:03 PM: [ Initializing document ranker... ]
08/12/2017 04:22:03 PM: [ Loading /home/chenguanghui/DrQA/data/wikipedia/docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz ]
08/12/2017 04:22:29 PM: [ Initializing model... ]
08/12/2017 04:22:29 PM: [ Loading model /home/chenguanghui/DrQA/data/reader/multitask.mdl ]
08/12/2017 04:22:36 PM: [ Initializing tokenizers and document retrievers... ]

Interactive DrQA

process(question, candidates=None, top_n=1, n_docs=5)
usage()

process('Mars')
08/12/2017 04:25:06 PM: [ Processing 1 queries... ]
08/12/2017 04:25:06 PM: [ Retrieving top 5 docs... ]
08/12/2017 04:25:08 PM: [ Reading 421 paragraphs... ]
Traceback (most recent call last):
File "", line 1, in
File "scripts/pipeline/interactive.py", line 81, in process
question, candidates, top_n, n_docs, return_context=True
File "/home/chenguanghui/DrQA/drqa/pipeline/drqa.py", line 184, in process
top_n, n_docs, return_context
File "/home/chenguanghui/DrQA/drqa/pipeline/drqa.py", line 264, in process_batch
handle = self.reader.predict(batch, async_pool=self.processes)
File "/home/chenguanghui/DrQA/drqa/reader/model.py", line 287, in predict
score_s, score_e = self.network(*inputs)
File "/home/chenguanghui/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 224, in call
result = self.forward(*input, **kwargs)
File "/home/chenguanghui/DrQA/drqa/reader/rnn_reader.py", line 122, in forward
doc_hiddens = self.doc_rnn(torch.cat(drnn_input, 2), x1_mask)
File "/home/chenguanghui/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 224, in call
result = self.forward(*input, **kwargs)
File "/home/chenguanghui/DrQA/drqa/reader/layers.py", line 61, in forward
output = self._forward_padded(x, x_mask)
File "/home/chenguanghui/DrQA/drqa/reader/layers.py", line 137, in _forward_padded
outputs.append(self.rnnsi[0])
File "/home/chenguanghui/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 224, in call
result = self.forward(*input, **kwargs)
File "/home/chenguanghui/anaconda3/lib/python3.5/site-packages/torch/nn/modules/rnn.py", line 162, in forward
output, hidden = func(input, self.all_weights, hx)
File "/home/chenguanghui/anaconda3/lib/python3.5/site-packages/torch/nn/_functions/rnn.py", line 351, in forward
return func(input, *fargs, **fkwargs)
File "/home/chenguanghui/anaconda3/lib/python3.5/site-packages/torch/autograd/function.py", line 284, in _do_forward
flat_output = super(NestedIOFunction, self)._do_forward(*flat_input)
File "/home/chenguanghui/anaconda3/lib/python3.5/site-packages/torch/autograd/function.py", line 306, in forward
result = self.forward_extended(*nested_tensors)
File "/home/chenguanghui/anaconda3/lib/python3.5/site-packages/torch/nn/_functions/rnn.py", line 293, in forward_extended
cudnn.rnn.forward(self, input, hx, weight, output, hy)
File "/home/chenguanghui/anaconda3/lib/python3.5/site-packages/torch/backends/cudnn/rnn.py", line 319, in forward
ctypes.c_void_p(workspace.data_ptr()), workspace.size(0)
File "/home/chenguanghui/anaconda3/lib/python3.5/site-packages/torch/backends/cudnn/init.py", line 255, in check_error
raise CuDNNError(status)
torch.backends.cudnn.CuDNNError: 8: b'CUDNN_STATUS_EXECUTION_FAILED'

BrokenPipeError

While building the tfidf model i get the brokenpipe error after the batch 9 of 11.
Any ideas as to what could be going wrong?

System: MacBook Pro
macOS: Sierra/3.3GHz intel core i7/16G memory

no answer, just top rank documents.

Hi,

I had a try on this model, but I seems not be able to get an "answer". I can only see "Rank; Doc ID; Doc Score".

BTW, I used 'SpaCy' as the tokenizer.

>>> process('Who was the winning pitcher in the 1956 World Series?')
+------+-------------------------+-----------+
| Rank |          Doc Id         | Doc Score |
+------+-------------------------+-----------+
|  1   | Yankees–Red Sox rivalry |   290.44  |
+------+-------------------------+-----------+
>>> process('Who won english primier league in 2015')
+------+--------------+-----------+
| Rank |    Doc Id    | Doc Score |
+------+--------------+-----------+
|  1   | Reza Bahmaei |   172.38  |
+------+--------------+-----------+
>>> process('What is the answer to life, the universe, and everything?')
+------+---------------------------------------------------+-----------+
| Rank |                       Doc Id                      | Doc Score |
+------+---------------------------------------------------+-----------+
|  1   | Phrases from The Hitchhiker's Guide to the Galaxy |   141.26  |
+------+---------------------------------------------------+-----------+
>>> quit()

`TIMEOUT: Timeout exceeded` error trying `tok = CoreNLPTokenizer()`

WhenI try

>>> from drqa.tokenizers import CoreNLPTokenizer
>>> tok = CoreNLPTokenizer()
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pexpect/expect.py", line 99, in expect_loop
    incoming = spawn.read_nonblocking(spawn.maxread, timeout)
  File "/usr/local/lib/python3.5/dist-packages/pexpect/pty_spawn.py", line 462, in read_nonblocking
    raise TIMEOUT('Timeout exceeded.')
pexpect.exceptions.TIMEOUT: Timeout exceeded.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ritwik/rd/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 33, in __init__
    self._launch()
  File "/home/ritwik/rd/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 61, in _launch
    self.corenlp.expect_exact('NLP>', searchwindowsize=100)
  File "/usr/local/lib/python3.5/dist-packages/pexpect/spawnbase.py", line 390, in expect_exact
    return exp.expect_loop(timeout)
  File "/usr/local/lib/python3.5/dist-packages/pexpect/expect.py", line 107, in expect_loop
    return self.timeout(e)
  File "/usr/local/lib/python3.5/dist-packages/pexpect/expect.py", line 70, in timeout
    raise TIMEOUT(msg)
pexpect.exceptions.TIMEOUT: Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7ff89a70f128>
command: /bin/bash
args: ['/bin/bash']
buffer (last 100 chars): b'@stagwiki: ~/rd/DrQA/data/corenlp\x07\x1b[01;32mritwik@stagwiki\x1b[00m:\x1b[01;34m~/rd/DrQA/data/corenlp\x1b[00m$ '
before (last 100 chars): b'@stagwiki: ~/rd/DrQA/data/corenlp\x07\x1b[01;32mritwik@stagwiki\x1b[00m:\x1b[01;34m~/rd/DrQA/data/corenlp\x1b[00m$ '
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 17048
child_fd: 5
closed: False
timeout: 60
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 100000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_string:
    0: "b'NLP>'"

CLASSPATH is set properly

corenlp$ echo $CLASSPATH
/home/ritwik/rd/DrQA/data/corenlp/ejml-0.23.jar /home/ritwik/rd/DrQA/data/corenlp/javax.json-api-1.0-sources.jar /home/ritwik/rd/DrQA/data/corenlp/javax.json.jar /home/ritwik/rd/DrQA/data/corenlp/joda-time-2.9-sources.jar /home/ritwik/rd/DrQA/data/corenlp/joda-time.jar /home/ritwik/rd/DrQA/data/corenlp/jollyday-0.4.9-sources.jar /home/ritwik/rd/DrQA/data/corenlp/jollyday.jar /home/ritwik/rd/DrQA/data/corenlp/protobuf.jar /home/ritwik/rd/DrQA/data/corenlp/slf4j-api.jar /home/ritwik/rd/DrQA/data/corenlp/slf4j-simple.jar /home/ritwik/rd/DrQA/data/corenlp/stanford-corenlp-3.8.0.jar /home/ritwik/rd/DrQA/data/corenlp/stanford-corenlp-3.8.0-javadoc.jar /home/ritwik/rd/DrQA/data/corenlp/stanford-corenlp-3.8.0-models.jar /home/ritwik/rd/DrQA/data/corenlp/stanford-corenlp-3.8.0-sources.jar /home/ritwik/rd/DrQA/data/corenlp/xom-1.2.10-src.jar /home/ritwik/rd/DrQA/data/corenlp/xom.jar

other tokenizers are not working,

with spacy

spacy tokenizer is is not imported some lib error

with simple and regxp , below error is coming
Traceback (most recent call last):
File "/usr/lib/python3.5/code.py", line 91, in runcode
exec(code, self.locals)
File "", line 1, in
File "scripts/reader/interactive.py", line 62, in process
predictions = predictor.predict(document, question, candidates, top_n)
File "/usr/local/lib/python3.5/dist-packages/drqa-0.1.0-py3.5.egg/drqa/reader/predictor.py", line 86, in predict
results = self.predict_batch([(document, question, candidates,)], top_n)
File "/usr/local/lib/python3.5/dist-packages/drqa-0.1.0-py3.5.egg/drqa/reader/predictor.py", line 126, in predict_batch
batch_exs = batchify([vectorize(e, self.model) for e in examples])
File "/usr/local/lib/python3.5/dist-packages/drqa-0.1.0-py3.5.egg/drqa/reader/predictor.py", line 126, in
batch_exs = batchify([vectorize(e, self.model) for e in examples])
File "/usr/local/lib/python3.5/dist-packages/drqa-0.1.0-py3.5.egg/drqa/reader/vector.py", line 33, in vectorize
q_lemma = {w for w in ex['qlemma']} if args.use_lemma else None
TypeError: 'NoneType' object is not iterable

Transfer Learning

Fantastic work so far! I'm curious if you have done any work introducing transfer learning DrQA in a similar style to how this has been done so well for ImageNet?

I could see many use-cases of slightly customizing Wikipedia with a smaller corpus (a law firm's documents, medical dataset, etc) that aren't big enough by themselves to learn from.

look for idea and suggestion on port it to support big character set language like Chinese or Japanese?

TypeError: 'NoneType' object is not subscriptable when building the TF-IDF N-grams

When I want to build the TF-IDF N-grams using build_tfidf.py, I get this weird error.

(py3) C:\Users\cguzelha\PycharmProjects\DrQA>python scripts\retriever\build_tfidf.py data\manuals.db data\ --tokenizer spacy --num-workers 1
09/01/2017 01:59:27 PM: [ Counting words... ]
09/01/2017 01:59:27 PM: [ Mapping... ]
09/01/2017 01:59:27 PM: [ -------------------------Batch 1/10------------------------- ]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "C:\Users\cguzelha\AppData\Local\Continuum\Anaconda3\envs\py3\lib\multiprocessing\pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\cguzelha\PycharmProjects\DrQA\scripts\retriever\build_tfidf.py", line 81, in count
    col.extend([DOC2IDX[doc_id]] * len(counts))
TypeError: 'NoneType' object is not subscriptable
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "scripts\retriever\build_tfidf.py", line 183, in <module>
    args, 'sqlite', {'db_path': args.db_path}
  File "scripts\retriever\build_tfidf.py", line 114, in get_count_matrix
    for b_row, b_col, b_data in workers.imap_unordered(_count, batch):
  File "C:\Users\cguzelha\AppData\Local\Continuum\Anaconda3\envs\py3\lib\multiprocessing\pool.py", line 735, in next
    raise value
TypeError: 'NoneType' object is not subscriptable

When I print the type of DOC2IDX in get_count_matrix, it is still a dictionary. However, when I print it again in count, it becomes NoneType. The input database manuals.db is built using build_db.py.

Run inteactive.py error

Run this script:

python3 scripts/retriever/interactive.py --model data/reader/single.mdl

Got error:

➜  DrQA git:(master) python3 scripts/retriever/interactive.py --model data/reader/single.mdl 
07/30/2017 10:44:31 AM: [ Initializing ranker... ]
07/30/2017 10:44:31 AM: [ Loading data/reader/single.mdl ]
Traceback (most recent call last):
  File "scripts/retriever/interactive.py", line 27, in <module>
    ranker = retriever.get_class('tfidf')(tfidf_path=args.model)
  File "/media/jintian/Netac/CodeSpace/AISpace/pytorch_space/other_repos/DrQA/drqa/retriever/tfidf_doc_ranker.py", line 37, in __init__
    matrix, metadata = utils.load_sparse_csr(tfidf_path)
  File "/media/jintian/Netac/CodeSpace/AISpace/pytorch_space/other_repos/DrQA/drqa/retriever/utils.py", line 34, in load_sparse_csr
    matrix = sp.csr_matrix((loader['data'], loader['indices'],
TypeError: 'int' object is not subscriptable

Any idea?

python scripts/reader/interactive.py timed out?

I installed the dependencies (pip3 install -r requirements.txt, and ran the setup.py script with "develop"). The data was also downloaded with "download.sh". However, when I ran

python scripts/reader/interactive.py

I got the following error.

pexpect.exceptions.TIMEOUT: Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7fd5912b47f0>

The 'retriever' did work. Can you please provide some pointer about it?

Thanks!

CC-BY-NC

Any plan or chance I could use it for commercial application? thanks.

how to trainging from scratch?

is it possible to use alternative articles? instead of wikipedia

Just trying to see if this model can be transferred into other domain.

OSError: [Errno 12] Cannot allocate memory

When trying to build the TF-IDF model after the 8th Batch I get the above-mentioned error:

System Information: Data Science Virtual Machine running Linux on Azure with 56G memory and 330G Hard drive.

I am running everything from a directory mounted on /dev/sdb1

The following is the descriptive error message:

10/24/2017 03:32:38 PM: [ -------------------------Batch 7/11------------------------- ]
10/24/2017 03:36:43 PM: [ -------------------------Batch 8/11------------------------- ]
Exception in thread Thread-1:
Traceback (most recent call last):
File "/anaconda/envs/py35/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/anaconda/envs/py35/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/anaconda/envs/py35/lib/python3.5/multiprocessing/pool.py", line 366, in _handle_workers
pool._maintain_pool()
File "/anaconda/envs/py35/lib/python3.5/multiprocessing/pool.py", line 240, in _maintain_pool
self._repopulate_pool()
File "/anaconda/envs/py35/lib/python3.5/multiprocessing/pool.py", line 233, in _repopulate_pool
w.start()
File "/anaconda/envs/py35/lib/python3.5/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/anaconda/envs/py35/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
return Popen(process_obj)
File "/anaconda/envs/py35/lib/python3.5/multiprocessing/popen_fork.py", line 20, in init
self._launch(process_obj)
File "/anaconda/envs/py35/lib/python3.5/multiprocessing/popen_fork.py", line 67, in _launch
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

There was a previous issue in the log that was closed stating that it was a windows machine. Any pointers on this one?

UnicodeDecodeError

I was able to create a sqlite db earlier, as described in the retriever Readme as a test, but after deleting the initial test db, I tried again a several hours later and got an UnicodeDecodeError: utf-8 codec can't decode byte 0xf9 in position 98: invalid start byte.

The text document has the same format as in the Readme: {"id": "doc1", "text": "text of doc1"}

Is there something I missed?

Full error message:

$ python build_db.py /home/HT/DrQA/data/test/ /home/HT/DrQA/data/test/test.db
10/02/2017 10:52:12 AM: [ Reading into database... ]
0%| | 0/2 [00:00<?, ?it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/HT/anaconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "build_db.py", line 72, in get_contents
for line in f:
File "/home/HT/anaconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf9 in position 98: invalid start byte
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "build_db.py", line 136, in
args.data_path, args.save_path, args.preprocess, args.num_workers
File "build_db.py", line 109, in store_contents
for pairs in tqdm(workers.imap_unordered(get_contents, files)):
File "/home/HT/anaconda3/lib/python3.6/site-packages/tqdm/_tqdm.py", line 872, in iter
for obj in iterable:
File "/home/HT/anaconda3/lib/python3.6/multiprocessing/pool.py", line 699, in next
raise value
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf9 in position 98: invalid start byte

When I ran python setup.py develop,it happened errors

When I ran python setup.py develop, it happened errors like this:
root@e93927f380d3:~/DrQA# python setup.py develop
Traceback (most recent call last):
File "setup.py", line 12, in
readme = f.read()
File "/root/anaconda3/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8661: ordinal not in range(128)

How can I do to fix that problems.
Thank you.

answering from the given candidates

Suppose I want to give a passage and do not want to consider any other docs from wiki what is the use model

process(question, candidates) does not seem to look at candidates at all - still fetching it from wiki docs

Thanks

Training with Hindi wikipedia text

Hi Adam,

Thanks for the excellent QA system, we installed the system and it is working good on wikipedia dataset.
We need to train the system with Hindi wikipedia, please let us know if this is feasible and if yes, How we need to proceed.

Thanks & Regards,
M Swathi Mithran

cuda runtime error (77)

I tried to run the demo on my local machine(Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-89-generic x86_64), 64G RAM, 2 TITAN X (Pascal)) using the following command:

python3 scripts/pipeline/interactive.py

The command above succeeded. Following the instructions showed in the the interactive env:

Interactive DrQA
>> process(question, candidates=None, top_n=1, n_docs=5)
>> usage()

I input:

process("who is bob dylan?")

and then I encountered the following exception prompt:

09/07/2017 03:06:52 PM: [ Processing 1 queries... ]
09/07/2017 03:06:52 PM: [ Retrieving top 5 docs... ]
09/07/2017 03:07:14 PM: [ Reading 459 paragraphs... ]
THCudaCheck FAIL file=/b/wheel/pytorch-src/torch/lib/THC/THCCachingHostAllocator.cpp line=258 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
  File "/usr/lib/python3.5/code.py", line 91, in runcode
    exec(code, self.locals)
  File "<console>", line 1, in <module>
  File "scripts/pipeline/interactive.py", line 81, in process
    question, candidates, top_n, n_docs, return_context=True
  File "/home/zmx/facebook_mc/DrQA/drqa/pipeline/drqa.py", line 184, in process
    top_n, n_docs, return_context
  File "/home/zmx/facebook_mc/DrQA/drqa/pipeline/drqa.py", line 252, in process_batch
    for batch in self._get_loader(examples, num_loaders):
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 192, in __next__
    batch = pin_memory_batch(batch)
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 124, in pin_memory_batch
    return [pin_memory_batch(sample) for sample in batch]
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 124, in <listcomp>
    return [pin_memory_batch(sample) for sample in batch]
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 118, in pin_memory_batch
    return batch.pin_memory()
  File "/usr/local/lib/python3.5/dist-packages/torch/tensor.py", line 78, in pin_memory
    return type(self)().set_(storage.pin_memory()).view_as(self)
  File "/usr/local/lib/python3.5/dist-packages/torch/storage.py", line 84, in pin_memory
    return type(self)(self.size(), allocator=allocator).copy_(self)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /b/wheel/pytorch-src/torch/lib/THC/THCCachingHostAllocator.cpp:258

I noticed that the RAM almost ran out while GPU RAM only used less than 600M, so I tried to minus the n_docs parameter and input:

process("who is bob dylan?", candidates=None, top_n=1, n_docs=1)

But it didn't work.

However, after I used --no-cuda , it finally worked.

python3 scripts/pipeline/interactive.py --no-cuda

The interaction is as follows:

09/07/2017 03:51:01 PM: [ Running on CPU only. ]
09/07/2017 03:51:01 PM: [ Initializing pipeline... ]
09/07/2017 03:51:01 PM: [ Initializing document ranker... ]
09/07/2017 03:51:01 PM: [ Loading /home/zmx/facebook_mc/DrQA/data/wikipedia/docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz ]
09/07/2017 03:52:12 PM: [ Initializing document reader... ]
09/07/2017 03:52:12 PM: [ Loading model /home/zmx/facebook_mc/DrQA/data/reader/multitask.mdl ]
09/07/2017 03:52:18 PM: [ Initializing tokenizers and document retrievers... ]

Interactive DrQA
>> process(question, candidates=None, top_n=1, n_docs=5)
>> usage()

>>> process("who is bob dylan?")
09/07/2017 03:52:40 PM: [ Processing 1 queries... ]
09/07/2017 03:52:40 PM: [ Retrieving top 5 docs... ]
09/07/2017 03:52:42 PM: [ Reading 459 paragraphs... ]
09/07/2017 03:52:54 PM: [ Processed 1 queries in 14.5980 (s) ]
Top Predictions:
+------+------------+---------------------------+--------------+-----------+
| Rank |   Answer   |            Doc            | Answer Score | Doc Score |
+------+------------+---------------------------+--------------+-----------+
|  1   | songwriter | Another Side of Bob Dylan |    1169.7    |   254.78  |
+------+------------+---------------------------+--------------+-----------+

Contexts:
[ Doc = Another Side of Bob Dylan ]
Another Side of Bob Dylan is the fourth studio album by American singer and songwriter Bob Dylan, released on August 8, 1964 by Columbia Records.

Is there anybody can solve my question? --no-cuda can only temporarily solve the problem. However it is too slow for interaction.

Cannot allocate memory when the pipeline is used

Hello,

I am trying to use the pipeline with my own database, which contains 50 files. When I execute the script for the pipeline, it works fine. However, when I try to process a question, it crashes.

(py36) can@can-VirtualBox:~/DrQA$ python scripts/pipeline/interactive.py --reader-model data/reader/multitask.mdl --retriever-model data/manuals-tfidf-ngram\=2-hash\=16777216-tokenizer\=corenlp.npz --doc-db data/manuals.db --tokenizer corenlp
09/25/2017 04:20:50 PM: [ Running on CPU only. ]
09/25/2017 04:20:50 PM: [ Initializing pipeline... ]
09/25/2017 04:20:50 PM: [ Initializing document ranker... ]
09/25/2017 04:20:50 PM: [ Loading data/manuals-tfidf-ngram=2-hash=16777216-tokenizer=corenlp.npz ]
09/25/2017 04:20:52 PM: [ Initializing document reader... ]
09/25/2017 04:20:52 PM: [ Loading model data/reader/multitask.mdl ]
09/25/2017 04:20:58 PM: [ Initializing tokenizers and document retrievers... ]

Interactive DrQA  
>> process(question, candidates=None, top_n=1, n_docs=5)
>> usage()

>>> process('Which one of the following would stop the machine if two plugs at the same time would enter into one of the channels located on the hopper drum?')
09/25/2017 04:21:14 PM: [ Processing 1 queries... ]
09/25/2017 04:21:14 PM: [ Retrieving top 5 docs... ]
09/25/2017 04:21:31 PM: [ Reading 5 paragraphs... ]
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/can/anaconda3/envs/py36/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/can/anaconda3/envs/py36/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/can/anaconda3/envs/py36/lib/python3.6/multiprocessing/pool.py", line 405, in _handle_workers
    pool._maintain_pool()
  File "/home/can/anaconda3/envs/py36/lib/python3.6/multiprocessing/pool.py", line 246, in _maintain_pool
    self._repopulate_pool()
  File "/home/can/anaconda3/envs/py36/lib/python3.6/multiprocessing/pool.py", line 239, in _repopulate_pool
    w.start()
  File "/home/can/anaconda3/envs/py36/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/can/anaconda3/envs/py36/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/home/can/anaconda3/envs/py36/lib/python3.6/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/home/can/anaconda3/envs/py36/lib/python3.6/multiprocessing/popen_fork.py", line 67, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

From other issues that are posted here, and from the error, I understand that the allocated memory is not enough. I am using an Ubuntu VirtualBox to run the script and the system has access to 4 GB RAM.

In one post, it is said that >20 GB RAM would be needed to use Wikipedia articles. Most of it is allocated to load the TF-IDF model, isn't it ? My TF-IDF model is 205 MB. Is it possible to estimate the memory usage of the Document Reader module ?

Thank you.

Timeout exceeded while running the interactive demo for reader system QandA model

08/02/2017 10:12:38 AM: [ CUDA enabled (GPU -1) ]
08/02/2017 10:12:38 AM: [ Initializing model... ]
08/02/2017 10:12:38 AM: [ Loading model /usr/local/lib/python3.5/dist-packages/drqa-0.1.0-py3.5.egg/data/reader/single.mdl ]
08/02/2017 10:12:44 AM: [ Initializing tokenizer... ]
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/pexpect/expect.py", line 99, in expect_loop
incoming = spawn.read_nonblocking(spawn.maxread, timeout)
File "/usr/local/lib/python3.5/dist-packages/pexpect/pty_spawn.py", line 462, in read_nonblocking
raise TIMEOUT('Timeout exceeded.')
pexpect.exceptions.TIMEOUT: Timeout exceeded.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "scripts/reader/interactive.py", line 50, in
predictor = Predictor(args.model, args.tokenizer, num_workers=0)
File "/usr/local/lib/python3.5/dist-packages/drqa-0.1.0-py3.5.egg/drqa/reader/predictor.py", line 82, in init
self.tokenizer = tokenizer_class(annotators=annotators)
File "/usr/local/lib/python3.5/dist-packages/drqa-0.1.0-py3.5.egg/drqa/tokenizers/corenlp_tokenizer.py", line 33, in init
self._launch()
File "/usr/local/lib/python3.5/dist-packages/drqa-0.1.0-py3.5.egg/drqa/tokenizers/corenlp_tokenizer.py", line 61, in _launch
self.corenlp.expect_exact('NLP>', searchwindowsize=100)
File "/usr/local/lib/python3.5/dist-packages/pexpect/spawnbase.py", line 390, in expect_exact
return exp.expect_loop(timeout)
File "/usr/local/lib/python3.5/dist-packages/pexpect/expect.py", line 107, in expect_loop
return self.timeout(e)
File "/usr/local/lib/python3.5/dist-packages/pexpect/expect.py", line 70, in timeout
raise TIMEOUT(msg)
pexpect.exceptions.TIMEOUT: Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7f2aa70a0a58>
command: /bin/bash
args: ['/bin/bash']
buffer (last 100 chars): b'0;ubuntu@ip-172-31-55-21: ~~/DrQA/DrQA\x07\x1b[01;32mubuntu@ip-172-31-55-21\x1b[00m:\x1b[01;34m~~/DrQA/DrQA\x1b[00m$ '
before (last 100 chars): b'0;ubuntu@ip-172-31-55-21: ~~/DrQA/DrQA\x07\x1b[01;32mubuntu@ip-172-31-55-21\x1b[00m:\x1b[01;34m~~/DrQA/DrQA\x1b[00m$ '
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 15216
child_fd: 12
closed: False
timeout: 60
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 100000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_string:
0: "b'NLP>'"

Difference between the downloadable model and the one used for published results?

>>> process('What is question answering?')
07/27/2017 11:57:54 AM: [ Processing 1 queries... ]
07/27/2017 11:57:54 AM: [ Retrieving top 5 docs... ]
07/27/2017 11:57:54 AM: [ Reading 106 paragraphs... ]
07/27/2017 11:57:56 AM: [ Processed 1 queries in 1.7293 (s) ]
Top Predictions:
+------+--------+----------------------------+--------------+-----------+
| Rank | Answer |            Doc             | Answer Score | Doc Score |
+------+--------+----------------------------+--------------+-----------+
|  1   |   QA   | Social information seeking |    2964.4    |   212.63  |
+------+--------+----------------------------+--------------+-----------+

Contexts:
[ Doc = Social information seeking ]
Social information seeking is often materialized in online question-answering (QA) websites, which are driven by a community. Such QA sites have emerged in the past few years as an enormous market, so to speak, for the fulfillment of information needs. Estimates of the volume of questions answered are difficult to come by, but it is likely that the number of questions answered on social/community QA (cQA) sites far exceeds the number of questions answered by library reference services, which until recently were one of the few institutional sources for such question answering. cQA sites make their content – questions and associated answers submitted on the site – available on the open web, and indexable by search engines, thus enabling web users to find answers provided for previously asked questions in response to new queries.

Time out exceeded

This test command got error, I have already installed corenlp

>>> tok = CoreNLPTokenizer()
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pexpect/expect.py", line 97, in expect_loop
    incoming = spawn.read_nonblocking(spawn.maxread, timeout)
  File "/usr/lib/python3/dist-packages/pexpect/pty_spawn.py", line 452, in read_nonblocking
    raise TIMEOUT('Timeout exceeded.')
pexpect.exceptions.TIMEOUT: Timeout exceeded.

how can we remove embedding layer ...

for example, we want training the character model in some Asia language like Chinese..

Document Retriever: UnicodeDecodeError- Error and Resolution

I am experimenting with the code to work with the German wikipedia documents on a MacBook Pro. When loading the extracted documents to the NoSQL db, I use the prep_wikipedia.py as a preprocessing routine as described.

There was, however, an error that I keep getting: UnicodeDecodeError: the 'utf-8' codec cannot decode 0x80 at position 3131, invalid start byte.

Although the original extracted documents are in the 'utf-8' encoding, this error therefore was misleading. On digging this up a bit more I discovered that the script build_db.py cycles through a list of directories in the data path. When this script is run on a MacBook, in addition to the list of directories in the data path, a hidden system directory .DS_Store is also returned. This was the reason of the error.

To fix this, I simply made a change to the build_db.py to ignore all the folders that startwith a '.'
It worked then.

Numpy memory error

When I am running python scripts/retriever/interactive.py command then it shows me below error.
root@ubuntu-2gb-nyc3-01:~/DrQA# python scripts/retriever/interactive.py
08/21/2017 08:13:28 AM: [ Initializing ranker... ]
08/21/2017 08:13:28 AM: [ Loading /root/DrQA/data/wikipedia/docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz ]
Traceback (most recent call last):
File "scripts/retriever/interactive.py", line 27, in
ranker = retriever.get_class('tfidf')(tfidf_path=args.model)
File "/root/DrQA/drqa/retriever/tfidf_doc_ranker.py", line 37, in init
matrix, metadata = utils.load_sparse_csr(tfidf_path)
File "/root/DrQA/drqa/retriever/utils.py", line 34, in load_sparse_csr
matrix = sp.csr_matrix((loader['data'], loader['indices'],
File "/root/anaconda3/lib/python3.6/site-packages/numpy/lib/npyio.py", line 233, in getitem
pickle_kwargs=self.pickle_kwargs)
File "/root/anaconda3/lib/python3.6/site-packages/numpy/lib/format.py", line 664, in read_array
array = numpy.ndarray(count, dtype=dtype)
MemoryError

I am using it without GPU and below is my system information.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 4
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Stepping: 1
CPU MHz: 2199.998
BogoMIPS: 4399.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0-3
Can some one help me to resolve that problem..??

Thank You

Building a JavaScript Wrapper for DrQA

I can only do JavaScript and therefore would like to build a wrapper around DrQA so it can be easily used in the Node.JS applications in production too. Is this possible to do?

./download.sh error with 'Connection reset by peer'

Awesome project! I really want to try. But error occured when I run ./download.sh :

` 45800K .......... .......... .......... .......... .......... 0% 14.3K 4d1h
45850K .......... .......... .......... .......... .......... 0% 14.3K 4d1h
45900K .......... .......... .......... .......... .......... 0% 26.2K 4d1h
45950K .......... .......... .......... .......... .......... 0% 1.90K 4d4h
46000K .......... .......... .......... .......... .......... 0% 2.60K 4d6h
46050K .......... .......... .......... .......... .......... 0% 3.50K 4d7h
46100K .......... .......... .......... .......... .......... 0% 4.33K 4d8h
46150K ... 0% 70.2M=15m8s

2017-09-03 13:18:44 (20.9 KB/s) - Read error at byte 47261019/8104538193 (Connection reset by peer). `

My net Enviroment is in China. Could you provide another way to get the train data?
Thank u!

will this support chinese?

will this support chinese, and the file are too large to download in china

is there any suggestions?

cannot allocate memory

when I running interactive.py ,there are some error:
08/04/2017 09:36:03 AM: [ CUDA enabled (GPU -1) ]
08/04/2017 09:36:03 AM: [ Initializing pipeline... ]
08/04/2017 09:36:03 AM: [ Initializing document ranker... ]
08/04/2017 09:36:03 AM: [ Loading /home/li/software/DrQA/data/wikipedia/docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz ]
08/04/2017 09:42:02 AM: [ Initializing model... ]
08/04/2017 09:42:04 AM: [ Loading model /home/li/software/DrQA/data/reader/multitask.mdl ]
08/04/2017 09:43:42 AM: [ Initializing tokenizers and document retrievers... ]
Traceback (most recent call last):
File "interactive.py", line 70, in
tokenizer=args.tokenizer
File "/home/li/software/DrQA/drqa/pipeline/drqa.py", line 140, in init
initargs=(tok_class, tok_opts, db_class, db_opts, fixed_candidates)
File "/usr/lib/python3.5/multiprocessing/context.py", line 118, in Pool
context=self.get_context())
File "/usr/lib/python3.5/multiprocessing/pool.py", line 168, in init
self._repopulate_pool()
File "/usr/lib/python3.5/multiprocessing/pool.py", line 233, in _repopulate_pool
w.start()
File "/usr/lib/python3.5/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 20, in init
self._launch(process_obj)
File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 67, in _launch
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
I would appreciate it if you could give me some advice,thanks

Could I have a sample file of json text when calling predict.py?

Could I have a sample file of json text when calling predict.py (for batch processing)? I try following the format on readme document but failed, thanks.

facebookresearch / drqa Goto Github PK

drqa's Introduction

DrQA

Quick Links

Machine Reading at Scale

Quick Start: Demo

Installing DrQA

Trained Models and Data

Document Retriever

Document Reader

Wikipedia

QA Datasets

Format A

Format B

Entity lists

DrQA Components

Document Retriever

Document Reader

DrQA Pipeline

Distant Supervision (DS)

Tokenizers

Citation

DrQA Elsewhere

Connection with ParlAI

Web UI

License

drqa's People

Contributors

Stargazers

Watchers

Forkers

drqa's Issues

Recommend Projects

Recommend Topics

Recommend Org