gregdurrett / berkeley-entity Goto Github PK

The Berkeley Entity Resolution System jointly solves the problems of named entity recognition, coreference resolution, and entity linking with a feature-rich discriminative model.

License: GNU General Public License v3.0

Scala 93.24% Shell 0.29% Java 6.47%

berkeley-entity's Introduction

berkeley-entity

The Berkeley Entity Resolution System jointly solves the problems of named entity recognition, coreference resolution, and entity linking with a feature-rich discriminative model.

Preamble

The Berkeley Entity Resolution System is described in:

"A Joint Model for Entity Analysis: Coreference, Typing, and Linking" Greg Durrett and Dan Klein. TACL 2014.

The coreference portion is described in:

"Easy Victories and Uphill Battles in Coreference Resolution." Greg Durrett and Dan Klein. EMNLP 2013.

See http://www.eecs.berkeley.edu/~gdurrett/ for papers and BibTeX.

Questions? Bugs? Email me at [email protected]

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/

Setup

Models

Models are not included in GitHub due to their large size. Download the latest models from http://nlp.cs.berkeley.edu/projects/entity.shtml

Datasets

See the CoNLL 2012 shared task page for more information about the data formats. All of our files (input and output) follow this standard; when we have subsets of annotations, the corresponding columns are simply left blank (i.e. no coreference chunks or NER chunks, vacuous trees, etc.). Entity links are included in a standoff file so that we avoid modifying these files: they are presented as an extra column with the same specification as NER chunks, with the exception that they can overlap.

The system generally takes directories for input and outputs single files with all documents concatenated. Note that a directory can contain a single file of this form. For training, files are required to have auto_conll and gold_conll suffixes as appropriate; for testing, you can filter the documents to read with -docSuffix.

Flattened directories of CoNLL files can be produced from the CoNLL shared task data as follows:

find . -path .*conll | while read file; do
  cp $file path/to/flattened/directory
done

We also require number and gender data produced by Shane Bergsma and Dekang Lin in "Bootstrapping Path-Based Pronoun Resolution" (default path the system expects this data at: data/gender.data) and [Brown clusters] (http://people.csail.mit.edu/maestro/papers/bllip-clusters.gz) (default path: data/bllip-clusters). pull-datasets.sh should pull these datasets for you and put them in the appropriate locations.

CoNLL Scorer

Available at https://code.google.com/p/reference-coreference-scorers/

There will be three things in the download: scorer.pl, CorScorer.pm, and a directory called Algorithm. Put Algorithm and CorScorer.pl in the directory you run the jar from, or in lib/ under that directory. This way they'll be located for scoring. scorer.pl can go anywhere as long as you pass in the appropriate path with -conllEvalScriptPath; the system expects it at scorer/v7/scorer.pl.

Again, pull-datasets.sh will do all this for you.

Note that all results in the paper come from version 7 of the CoNLL scorer. Other versions of the scorer may return different results.

Running the system

The main class is edu.berkeley.nlp.entity.Driver The running of the system is documented more thoroughly there. It supports running pretrained models on raw text as well as training and evaluating new models.

An example run on new data is included in run-test.sh

Note that this example runs purely from raw text and follows the CoNLL annotation standards. Because the CoNLL dataset does not contain supervised entity linking data, the entity linking component of the model does not give the performance indicated in the paper. If you're particularly interested in entity linking, you should pre-extract mentions from your dataset according to the ACE standard and use the ACE version of the model.

A trained model includes not just feature specifications and weights for the joint model, but also trained coarse models for coreference and NER.

To reproduce CoNLL results, run:

java -Xmx8g -jar berkeley-entity-1.0.jar ++config/base.conf -execDir scratch -mode PREDICT_EVALUATE -testPath data/conll-2012-en/test\
  -modelPath "models/joint-onto.ser.gz" -wikipediaPath "models/wiki-db-onto.ser.gz" \
  -docSuffix auto_conll

To reproduce ACE results, run with:

java -Xmx4g -jar berkeley-entity-1.0.jar ++config/base.conf -execDir scratch -mode PREDICT_EVALUATE_ACE -testPath data/ace05/test \
  -modelPath "models/joint-ace.ser.gz" -wikipediaPath "models/wiki-db-ace.ser.gz" \
  -doConllPostprocessing false -useGoldMentions -wikiGoldPath data/ace05/ace05-all-conll-wiki

Note that this requires the ACE data to be in the CoNLL standard with standoff Wikipedia annotations in ace05-all-conll-wiki. This whole process is sensitive to tokenization and sentence-splitting. If you're interested in reproducing these results, please contact me.

Preprocessing

The system is runnable from raw text as input. It runs a sentence splitter (Gillick, 2009), tokenizer (Penn Treebank), and parser (Berkeley parser), or a subset of these. See edu.berkeley.nlp.entity.preprocess.PreprocessingDriver for more information about these tools and command line options. See run-test.sh for an example usage.

Training

The system expects automatic annotations in files ending with auto_conll (i.e. parses) and gold annotations (i.e. coref and NER) in gold_conll files. Currently the OntoNotes version of the system cannot take gold entity links as supervision; email me if you are interested in such functionality.

To train a CoNLL model, run:

java -Xmx47g -jar berkeley-entity-1.0.jar ++config/base.conf -execDir scratch -mode TRAIN_EVALUATE \
  -trainPath data/conll-2012-en/train -testPath data/conll-2012-en/test -modelPath models/joint-new-onto.ser.gz \
  -wikipediaPath models/cached/wiki-db-onto.ser.gz \
  -pruningStrategy build:models/cached/corefpruner-onto.ser.gz:-5:5 \
  -nerPruningStrategy build:models/cached/nerpruner-onto.ser.gz:-9:5 \
  -numItrs 30

To train an ACE model run:

java -Xmx35g -jar berkeley-entity-1.0.jar ++config/base.conf -execDir scratch -mode TRAIN_EVALUATE_ACE \
  -trainPath data/ace05/train -testPath data/ace05/test -modelPath models/cached/joint-new-ace.ser.gz \
  -wikipediaPath models/wiki-db-ace.ser.gz \
  -pruningStrategy build:models/cached/corefpruner-ace.ser.gz:-5:5 \
  -doConllPostprocessing false -useGoldMentions -wikiGoldPath data/ace05/ace05-all-conll-wiki \
  -lossFcn customLoss-1-1-1 \
  -numItrs 20

Note that because ACE NER mentions are synchronous with the coreference mentions, the NER layer is much simpler (isolated random variables rather than a sequence model) and so NER pruning is not necessary here.

Building from source

The easiest way to build is with SBT: https://github.com/harrah/xsbt/wiki/Getting-Started-Setup

then run

sbt assembly

which will compile everything and build a runnable jar.

You can also import it into Eclipse and use the Scala IDE plug-in for Eclipse http://scala-ide.org

Adding features

Features can be specified on the command line and are instantiated in a few different places.

Coreference: edu.berkeley.nlp.entity.coref.PairwiseIndexingFeaturizerJoint, control with -pairwiseFeats

NER: edu.berkeley.nlp.entity.ner.NerFeaturizer, control with -nerFeatureSet

Linking: edu.berkeley.nlp.entity.wiki.QueryChoiceComputer

Joint: edu.berkeley.nlp.entity.joint.JointFeaturizerShared, control with -corefNerFeatures, -wikiNerFeatures, -corefWikiFeatures

Note that turning off all of the joint features by passing in empty strings to each yields the results for independent models (INDEP results from the TACL 2014 paper).

The methods to instantiate features are extensible. Additional information sources can either be passed to the featurizers or accessed in a static fashion.

Troubleshooting

Calling the coreference scorer (in TRAIN_EVALUATE mode) may cause an out-of-memory error because under the hood, Java forks the process and if you're running with a lot of memory, it may crash. You can use the coreference system in COREF_PREDICT or COREF_TRAIN_PREDICT and then evaluate separately to avoid this.

berkeley-entity's People

Contributors

Stargazers

Watchers

berkeley-entity's Issues

conll2003 running

I want to know that with conll2003 ner label and the aida-yago entity linking annotations, how to get the joint sparse features, I am a bit confused of the format (there is no coreference column data)

java OutOfMemoryError: heap space

Hi Greg,
I keep getting java OutOfMemoryError when I try to train a new model. I already switched to COREF_TRAIN_PREDICT mode as you suggest in the readme file, but the error is still there. Increasing the heap size does not help. Any suggestions how to fix that would be very much appreciated!
Thanks!
Yulia

UPD: I found the solution. it happened because my conll file was somehow missing document borders and was read in as a single document. adding document borders ("#begin document..") solved the problem.

Mention Detection not 100% with gold mentions

java -Xmx47g -jar berkeley-entity-1.0.jar ++config/base.conf -execDir scratch -mode COREF_PREDICT -testPath /home/development/darsh/MusOntoLearning/ground_truth/semeval/auto_small/ -modelPath models/joint-new-onto.ser.gz -numItrs 30 -outputPath /tmp/darsh/test_output_small/ -useGoldMentions

This is the command with which I ran the code. I have given the -useGoldMentions flag. I am getting very poor accuracy and the mention detection part is also not 100%.

File not found problem with model file

Hi Greg,
I compiled the source files again and created berkeley-entity jar in target directory. Jar creation is successful. But while running the system with this jar I am getting the following error. Please help.

ERROR: java.lang.RuntimeException: Can't write to models/cached/corefpruner-onto.ser.gz:
edu.berkeley.nlp.entity.GUtil$.save(GUtil.scala:28)
edu.berkeley.nlp.entity.coref.CorefPruner$.trainAndSaveKFoldModels(CorefPruner.scala:103)
edu.berkeley.nlp.entity.coref.CorefPruner$.buildPruner(CorefPruner.scala:93)
edu.berkeley.nlp.entity.coref.CorefSystem$.runTrain(CorefSystem.scala:140)
edu.berkeley.nlp.entity.coref.CorefSystem$.runTrainPredict(CorefSystem.scala:108)
edu.berkeley.nlp.entity.coref.CorefSystem.runTrainPredict(CorefSystem.scala)
edu.berkeley.nlp.entity.Driver.run(Driver.java:343)
edu.berkeley.nlp.futile.fig.exec.Execution.runWithObjArray(Execution.java:479)
edu.berkeley.nlp.futile.fig.exec.Execution.run(Execution.java:432)
edu.berkeley.nlp.entity.Driver.main(Driver.java:319)

Thanks,
Joe

Command: java -Xmx8g -jar target/scala-2.11/berkeley-entity-assembly-1.jar ++config/base.conf -execDir scratch -mode COREF_TRAIN_PREDICT -testPath /tmp/test_input/ -docSuffix auto_conll -trainPath ./small_train/ -modelPath "models/joint-onto.ser.gz" -wikipediaPath "models/wiki-db-onto.ser.gz" -useGoldMentions -pruningStrategy build:models/cached/corefpruner-onto.ser.gz:-5:5 -nerPruningStrategy build:models/cached/nerpruner-onto.ser.gz:-9:5 -outputPath /tmp/test_output/

Coreference resolution with provided entities

Hi Greg,

I'm trying to use your code to generate coreferences of a document. What I have is a document and a list of entities [w1,w2,w3,w4,...,wn]. I want to find references of all these entities in the document. Is there a way to achieve this? I appreciate it a lot if you can provide some guidance.

Thanks,
Ming

odd wikification behavior

Hi,
Thanks for making this code available! I'm trying it on some fake text and getting unexpected results in the wikification. I have the following meaningless blah.txt (just playing around with different types of entities):

Michael Jackson was born in the United Kingdom and his dog was born in Japan.  He became president of Microsoft in March 2016.  Jackson owns a golf course in the UK and loves to listen to Freaky Girl.

I made a WikipediaInterface that includes a bunch of entities, including most of those in blah.txt.

Running the Driver produces the following output-wiki.conll:

#begin document (test2/text/blah.txt); part 000
(Michael Jackson*
*)
*
*
*
(United Kingdom*
*
*)
*
(Dog -LRB-zodiac-RRB-(-EXCLUDE-*)
*)
*
*
*
(Japan*)
*

(-EXCLUDE-*)
*
(President of the United States*
*
(-NIL-*))
*
(-NIL-*
*)
*

(Lauren Jackson*)
*
(-NIL-*
*
*
*
(-NIL-*
*))
*
*
*
*
*
(-NIL-*
*)
*

#end document

Questions:

Why would it guess "Lauren Jackson" for the last "Jackson"? The coreference system knows that these are the same reference id, so I could feasibly resolve there. But I'm also wondering why it might pick Lauren Jackson, given my wikipediaInterface -- here's what queryDisambigs is giving:

ArrayBuffer([Jackson, Mississippi : 1,269, Jackson, Michigan : 357, Edwin Jackson : 346, Jackson, Tennessee : 315, Lauren 
Jackson : 269, Jackson County, Missouri : 227, Jackson County, West Virginia : 146, ...])

Similarly, not sure how "Dog (zodiac)" skipped over "Dog", given queryDisambigs:

ArrayBuffer([], [Dog : 927, Dog (zodiac) : 173, Hurricane Dog (1950) : 7, Dog (film) : 4, Dog (album) : 4, Police dog : 3, Dog meat : 3, Dog (single) : 2, Dog (band) : 2], [])

"President of the United States" is wrong, and it misses "the UK" ...
Do you have code that gives the single most likely wikipedia entity for all references with a particular id? e.g. "Michael Jackson", "his", "He", and "Jackson" are all resolved by coreference, but with different wikiChunks (Michael Jackson vs Lauren Jackson). I would think, since you're doing coref & NER jointly, you'd have that functionality but I haven't found it.

I've been trying to debug, but it gets a bit opaque once I get into the BP nodes.

Can the training be done with gold_conll files alone ?

Hi,

I was trying to do training with ontonotes train data(*gold_conll) I have. When I give the train path to this data, the system is asking for auto_conll files as well. Can the training be done only with gold_conll files ?
Correct me if there is some problem with my understanding.

Thanks,
Joe

"Loading -1 docs from /home/joe/music_ontology/MusOntoLearning/ground_truth/ontonotes/train/ ending with auto_conll"

Couldn't parse even with backoff parser!

Hi, I am trying retrieving the entities in Austen's Pride and Prejudice and I get the error: Couldn't parse even with backoff parser! on a sentence in Chapter 6:

"No, indeed. I do not wish to avoid the walk. The distance is nothing, when one has a motive; only three miles. I shall be back by dinner."

The error instead does NOT happen if the sentence is unquoted:

No, indeed. I do not wish to avoid the walk. The distance is nothing, when one has a motive; only three miles. I shall be back by dinner.

What could be the problem? How could I fix it?

-useGoldMentions not functioning

Hi Greg,

I tried with -useGoldMentions option to try coreference resolution with gold mentions. But it seems this is not happening. When I do the evaluation with the output conll file, the mention detection is below 100%. I tried this with some ontonotes conll files. I even tried making useGoldMentions variable true within the code.

Can you please check or am I doing something wrong ?

Thanks,
Joe

are the mention ids in order of their occurrence in the document ?

Hi Greg,

I was using the i,j used as indices in prunedEdges to determine the order of the anaphoric and antecedent mentions in the document. For the extending the pruning mechanism I am making use of an external file which has mention ids which corresponds to the order of the mention in the document(first mention has id 1).

After looking at the pruned arcs in this system, I felt that this id doesn't correspond to the order of the mention in the document. So I cannot link with the mention id in the external file. Is there a way to find the order of the mention ?

Thanks,
Joe

it reports error on windows

Hi ! When I use the sbt assembly, it just reports that it cannot find the wikipediaInterface sort of problem, is there any suggestions ?

Mention pair pruning

Hi Greg,

When prunedEdges(i)(j) is true, does that mean the mention pair ith mention and jth mention is ignored(or avoided from further processing) ?
I got confused when I printed the mention pairs after pruning.

code snippet

def printPrunedEdges(docGraphs:Seq[DocumentGraph])= {

      for(i <- 0 until docGraphs.size){
            println("PRUNED EDGES");

           for(j1<-0 until docGraphs(i).prunedEdges.size) {

               for(j2<-0 until docGraphs(i).prunedEdges(j1).size)

                       if(docGraphs(i).prunedEdges(j1)(j2) == true){

                          println(j1 + " " + docGraphs(i).getMention(j1).words + ": " + j2 + " " + docGraphs(i).getMention(j2).words);

                  }

           }

      }

  }

Escaped Characters

Is there a list of characters which are escaped using \ in the coreference resolution model?

Mention detection (question)

Hi,

I was trying to use the mention detection module of this project. Can you please help me with pointers on which class and function does the mention detection. Even if it is not quite straight forward some directions will help.

Thanks,
Joe

gregdurrett / berkeley-entity Goto Github PK

berkeley-entity's Introduction

berkeley-entity

Preamble

License

Setup

Models

Datasets

CoNLL Scorer

Running the system

Preprocessing

Training

Building from source

Adding features

Troubleshooting

berkeley-entity's People

Contributors

Stargazers

Watchers

Forkers

berkeley-entity's Issues

Recommend Projects

Recommend Topics

Recommend Org