Giter VIP home page Giter VIP logo

open-sesame's Introduction

Open-SESAME

A frame-semantic parser for automatically detecting FrameNet frames and their frame-elements from sentences. The model is based on softmax-margin segmental recurrent neural nets, described in our paper Frame-Semantic Parsing with Softmax-Margin Segmental RNNs and a Syntactic Scaffold. An example of a frame-semantic parse is shown below

Frame-semantics example

Installation

This project is built on python==3.7.9 and the DyNet library. Additionally, it uses some packages from NLTK.

$ pip install dynet==2.0.3
$ pip install nltk==3.5
$ python -m nltk.downloader averaged_perceptron_tagger wordnet

Data Preprocessing

This codebase only handles data in the XML format specified under FrameNet. The default version is FrameNet 1.7, but the codebase is backward compatible with versions 1.6 and 1.5.

As a first step the data is preprocessed for ease of readability.

  1. Clone the repository.
$ git clone https://github.com/swabhs/open-sesame.git
$ cd open-sesame/
  1. Create a directory for the data, $DATA, containing the (extracted) FrameNet version 1.7 data. This should be under $DATA/fndata-1.7/.

  2. Second, this project uses pretrained GloVe word embeddings of 100 dimensions, trained on 6B tokens. Download and extract under $DATA/embeddings_glove/.

  3. Optionally, make alterations to the configurations in configurations/global_config.json, if you have decided to either use a different version of FrameNet, or different pretrained embeddings, and so on.

  4. In this repository, data is formatted in a format similar to CoNLL 2009, but with BIO tags, for ease of reading, compared to the original XML format. See sample CoNLL formatting here. Preprocess the data by executing:

$ python -m sesame.preprocess

The above script writes the train, dev and test files in the required format into the data/neural/fn1.7/ directory. A large fraction of the annotations are either incomplete, or inconsistent. Such annotations are discarded, but logged under preprocess-fn1.7.log, along with the respective error messages. To include exemplars, use the option --exemplar with the above command.

Training

Frame-semantic parsing involves target identification, frame identification and argument identification --- each step is trained independently of the others. Details can be found in our paper, and also below.

To train a model, execute:

$ python -m sesame.$MODEL --mode train --model_name $MODEL_NAME

The $MODELs are specified below. Training saves the best model on validation data in the directory logs/$MODEL_NAME/best-$MODEL-1.7-model. The same directory will also save a configurations.json containing current model configuration.

If training gets interrupted, it can be restarted from the last saved checkpoint by specifying --mode refresh.

Pre-trained Models

The downloads need to be placed under the base-directory. On extraction, these will create a logs/ directory containing pre-trained models for target identification, frame identification using gold targets, and argument identification using gold targets and frames.

Note There is a known open issue about pretrained models not being able to replicate the reported performance on a different machine. It is recommended to train and test from scratch - performance can be replicated (within a small margin of error) to the performance reported below.

FN 1.5 Dev FN 1.5 Test FN 1.5 Models FN 1.7 Dev FN 1.7 Test FN 1.7 Models
Target ID 79.85 73.23 Download 80.26 73.25 Download
Frame ID 89.27 86.40 Download 89.74 86.55 Download
Arg ID 60.60 59.48 Download 61.21 61.36 Download

Test

The different models for target identification, frame identification and argument identification, need to be executed in that order. To test under a given model, execute:

$ python -m sesame.$MODEL --mode test --model_name $MODEL_NAME

The output, in a CoNLL 2009-like format will be written to logs/$MODEL_NAME/predicted-1.7-$MODEL-test.conll and in the frame-elements file format to logs/$MODEL_NAME/predicted-1.7-$MODEL-test.fes for frame and argument identification.

1. Target Identification

$MODEL = targetid

A bidirectional LSTM model takes into account the lexical unit index in FrameNet to identify targets. This model has not been described in the paper. Moreover, FN 1.7 exemplars cannot be used for target identification.

2. Frame Identification

$MODEL = frameid

Frame identification is based on a bidirectional LSTM model. Targets and their respective lexical units need to be identified before this step. At test time, example-wise analysis is logged in the model directory. Exemplars can be used for frame identification using the --exemplar flag during training, but do not help (in fact reduce performance to 88.17).

3. Argument (Frame-Element) Identification

$MODEL = argid

Argument identification is based on a segmental recurrent neural net, used as the baseline in the paper. Targets and their respective lexical units need to be identified, and frames corresponding to the LUs predicted before this step. At test time, example-wise analysis is logged in the model directory. Exemplars can be used for argument identification using the --exemplar flag during training.

Prediction on unannotated data

For predicting targets, frames and arguments on unannotated data, pretrained models are needed. Input needs to be specified in a file containing one sentence per line. The following steps result in the full frame-semantic parsing of the sentences:

$ python -m sesame.targetid --mode predict \
                            --model_name fn1.7-pretrained-targetid \
                            --raw_input sentences.txt
$ python -m sesame.frameid --mode predict \
                           --model_name fn1.7-pretrained-frameid \
                           --raw_input logs/fn1.7-pretrained-targetid/predicted-targets.conll
$ python -m sesame.argid --mode predict \
                         --model_name fn1.7-pretrained-argid \
                         --raw_input logs/fn1.7-pretrained-frameid/predicted-frames.conll

The resulting frame-semantic parses will be written to logs/fn1.7-pretrained-argid/predicted-args.conll in the same CoNLL 2009-like format.

Contact and Reference

For questions and usage issues, please contact [email protected]. If you use open-sesame for research, please cite our paper as follows:

@article{swayamdipta:17,
  title={{Frame-Semantic Parsing with Softmax-Margin Segmental RNNs and a Syntactic Scaffold}},
  author={Swabha Swayamdipta and Sam Thomson and Chris Dyer and Noah A. Smith},
  journal={arXiv preprint arXiv:1706.09528},
  year={2017}
}

Copyright [2018] [Swabha Swayamdipta]

open-sesame's People

Contributors

ayush-pancholy avatar sammthomson avatar swabhs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

open-sesame's Issues

Magnitude of gradient is bad: inf

Now I am trying to train 'Arg Identification' using the code "segrnn-argid.py"

When I trained 'target identification' using the code "frameid.py",
Everything is ok.

However, "segrnn-argid.py" code returns an error:

RuntimeError: Magnitude of gradient is bad: inf

How can I handle it?

Add target identification

I have trained the biLSTM model successfully, but it's not clear to me how I could use this on other data. Is there a way to input either a raw sentence or a simply tagged CoNLL file (Token, Lemma, POS) for tagging? identify_frames depends on the indexes of frame elements, which are available in the Framenet data, but not for an unseen sentence. Am I missing something obvious, like a method to achieve this?

Exception: ('Rule not defined for part-of-speech word', u'wp', u'who') + Fix for it

Hi,
I came across a bug when retraining open-sesame on FrameNet 1.7 data:
Exception: ('Rule not defined for part-of-speech word', u'wp', u'who')

I can see that this is fixable by adding 'wp' to the POS tag mappings in targetid.py, I added it to the noun mapping as this seems to be the best fit?

Just raising this issue in case anyone else stumbles across it, and because it can easily be fixed.

Thanks!

Dimensionality Mismatch While Trying to Run Prediction

Hi,

Thanks for this work. I was trying to test on unannoatated data using the predict model. But, when I try to run the targetid, it throws following error. I think this has been reported before but not sure how to solve this. Also, I just downloaded the code and model so they are up to date.
Dimensions of lookup parameter /_0 lookup up from file ({100,400574}) do not match parameters to be populated ({100,400001})
Thanks

Got an Error while Running a Basic Prediction Task

Hello, thank you for the great program.
I was trying to predict a sample sentence.

Hoover Dam played a major role in preventing Las Vegas from drying up.

I saved this sentence on sample.txt and execute the following command.

$ python -m sesame.frameid --mode predict \
                           --model_name fn1.7-pretrained-frameid \
                           --raw_input sample.txt

However, I got the following error.

Reading logs/example.txt ...
0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/open-sesame/sesame/frameid.py", line 110, in <module>
    instances, _, _ = read_conll(options.raw_input)
  File "/open-sesame/sesame/dataio.py", line 73, in read_conll
    elements.append(CoNLL09Element(l, read_depsyn))
  File "/open-sesame/sesame/conll09.py", line 40, in __init__
    self.id = int(ele[0])
ValueError: invalid literal for int() with base 10: 'Hoover Dam played a major role in preventing Las Vegas from drying up.'

Did I miss any steps?

training question

In both tagetid and frameid testing I eventually bail out with "Ran out of patience, ending training." before I get to the end of the defined epoch. Is this a hyper hyperparameter problem and not finding a local minimum?

Arguments prefix

Hi everybody.

I am just starting to learn frame-semantic parsing, and I have a question:

Given the follow example:

1	she	_	She	_	PRP	4	_	_	_	_	_	_	_	S-Ingestor
2	ate	_	eat	_	VBD	4	_	_	_	_	_	eat.v	Ingestion	O
3	porridge	_	UNK	_	NN	4	_	_	_	_	_	_	_	S-Ingestibles
4	.	_	.	_	.	4	_	_	_	_	_	_	_	O

What means "S" before of Ingestor, for example?

PS: English is not my mother language, so excuse me if I sound odd sometimes.

Thank you.

problem with python 3.6

Dimensions of lookup parameter /_0 lookup up from file ({100,400574}) do not match parameters to be populated ({100,410050})

get postags and lemmas of sentences all zero when running prediction

I run python -m sesame.targetid --mode predict --model_name fn1.7-pretrained-targetid --raw_input sentences.txt and find the ouput in logs/fn1.7-pretrained-targetid/predicted-targets.conll is total blank. It turned out that instance.postags and instance.lemmas are all lists of zero on line 435, sesame/targetid.py, which is a result of line 28&30, sesame/conll09.py.
I wonder how can I solve this problem.

dynet problem: RuntimeError: Magnitude of gradient is bad: inf

I use:

python -m sesame.$MODEL --model_name sample-$MODEL --mode train

where MODEL=segrnn-argid. But I got:

Traceback (most recent call last):
  File "/gds/miniconda2/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/gds/miniconda2/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/workbench/open-sesame/sesame/segrnn-argid.py", line 956, in <module>
    adam.update()
  File "_dynet.pyx", line 5728, in _dynet.Trainer.update
  File "_dynet.pyx", line 5733, in _dynet.Trainer.update
RuntimeError: Magnitude of gradient is bad: inf

Could anyone suggest for this problem? I wonder whether it is caused by the version of dynet?

I can't install dynet for python27.

I can't install dynet for python27. (It is ok in python 3 )
error: make not found, and MAKE is not set
do you have any suggestion for me to use the open-sesame?

Framenet issues, missing glove, and Type Error

Hello,

I am trying to use open-sesame, but Iโ€™m running into a few issues. First, I downloaded the most recent version of FrameNet (1.7), and the paths do not match up. I fixed that, and attach a modified globalconfig.py file which has flexibility between FrameNet 1.5 and 1.7. Next, there is no glove file or information on how to get it. I inferred from the error that Stanfordโ€™s glove file was used and downloaded the glove.6b.100d. It seems that it should be filtered somehow, though, and Iโ€™m not sure how to go about that. Lastly, with those two things somewhat fixed, Iโ€™m getting a non obvious error during arg identification. Iโ€™ll include the error file for that, but here is the exception.

Traceback (most recent call last):
  File "segrnn-argid.py", line 917, in <module>
    goldfes=trex.invertedfes)
  File "segrnn-argid.py", line 795, in identify_fes
    embpos_x = get_base_embeddings(trainmode, unkdtoks, tg_start, sentence)
  File "segrnn-argid.py", line 285, in get_base_embeddings
    dist_x = [scalarInput(abs(i - tg_start) + 1) for i in xrange(sentlen)]
TypeError: Argument 'x' has incorrect type (expected _dynet.Expression, got int)

globalconfig.py.txt
err.txt

/opt/conda/lib/python3.6/runpy.py:125: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour

I realize that this project is implemented in python2.7 but I am running it in python3.6 because the former is ~deprecated.

Getting an error indicating double import of dependency. See here:
https://stackoverflow.com/questions/43393764/python-3-6-project-structure-leads-to-runtimewarning

If I get to it, I will submit a PR to address this.

Thanks
--aaron

annotation hack, framenet, glove and dynet

Hi, I tried to try it out, but I failed.
First, the preprocess script broke, because that "annotation_id" was not defined, despite I don't think, that it wasn't defined, but that was the error. So I deleted the hack and it worked.

Second, I tried it with the newest version of FrameNet, 1.7, and it failed, somehow looking for version 1.5., which was obviosly not there. So I tried version 1.5. and I got it.

Then I tried to train the parser, but it looked automatically for the glove vectors in the data directory, which weren't there (although I didn't tell it, to use any glove). After I put the files there, it did as it should

But now it complains:
AttributeError: '_dynet.ParameterCollection' object has no attribute 'load'

I installed DyNet version 2. Hope that this is right. Can't see anything else, but the Model() and .load() calls seems to be deprecated. I tried to change it to dynet.ParamCollection() and .populate(), but then it couldn't read the model: "RuntimeError: Could not read model from model.frameid.1.5"

Evaluation

When I test the trained argument identification model using
python segrnn-argid.py

It returns three types of F1 scores:
wf1, uf1, and lf1

I cannot find a documentation for each evaluation metric.
Could you explain about those?

And, my another question is, is it evaluated by 'Gold Frames'? or 'Predicted Frames'?
if it is gold frames, How can I evaluate the end-to-end performance which is shown in Table 5 in your paper?

Run Pip?

Couldn't you put the installation and use via pip?
Thanks!

RuntimeError: Magnitude of gradient is bad: -nan` when trying to train frameid

I am trying to train the frameid model, but I get this error at the very beginning of training. I am using the latest version of dynet (2.1). I have ported the open-sesame to python 3, and I am using the python 3 version for training, but even with python 2.7 version, I am still getting the same error.

Traceback (most recent call last): File "/home/anaconda3/envs/pytorch_dynet_copy/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/anaconda3/envs/pytorch_dynet_copy/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/testframenet/open-sesame/sesame/frameid.py", line 295, in <module> trainer.update() File "_dynet.pyx", line 6198, in _dynet.Trainer.update File "_dynet.pyx", line 6203, in _dynet.Trainer.update RuntimeError: Magnitude of gradient is bad: -nan

Normal for prediction on unannotated data to take a long time?

When I run

python -m sesame.targetid --mode predict \
                        --model_name fn1.7-pretrained-targetid \
                        --raw_input myinput.txt

it just keeps running (the longest I let it run was about 3.5 hours before stopping it). The output looks like

[lr=0.01 clips=94 updates=100] epoch = 2.700 loss = 6.098552 train f1 = 0.7560
[lr=0.01 clips=94 updates=100] epoch = 2.800 loss = 6.038111 train f1 = 0.7584
[lr=0.01 clips=93 updates=100] epoch = 2.900 loss = 6.046757 train f1 = 0.7628
[dev epoch=2] loss = 3.448724 p = 0.7872 (1779.0/2260.0) r = 0.7513 (1779.0/2368.0) f1 = 0.7688 -- saving to logs/predict/best-targetid-1.7-model
[lr=0.01 clips=96 updates=100] epoch = 2.1000 loss = 6.093396 train f1 = 0.7125
[lr=0.01 clips=94 updates=100] epoch = 2.1100 loss = 6.180117 train f1 = 0.7512
[lr=0.01 clips=93 updates=100] epoch = 2.1200 loss = 6.178361 train f1 = 0.7239
[dev epoch=2] loss = 3.113013 p = 0.7686 (1857.0/2416.0) r = 0.7842 (1857.0/2368.0) f1 = 0.7763 -- saving to logs/predict/best-targetid-1.7-model

and keeps spilling over.

Is this normal for the first time? I'm using the pretrained models, so I don't know why it's running forever like this (I assume that training the models is what takes forever)

Error While Training: no attribute '__reduce_cython__'

I encountered a strange error while training some of the models: 'scipy.interpolate.interpnd.array' has no attribute '__reduce_cython__'

Some googling for similar errors indicated that adjusting the version of Cython might resolve the issue, so I reverted to Cython==0.29.12. This seems to have solved the problem -- I hope this helps anyone else who might be running into something similar!

Google colab

Is it possible to install this project to google colab?

Token size mismatch for pre-trained model

I met with the new pre-trained model is that there is still a RuntimeError saying that:

RuntimeError: Dimensions of lookup parameter /_0 lookup up from file ({100,400574}) do not match parameters to be populated ({100,400575}

It seems like the tokens used for the pre-trained models (totally 400575) has one more token than current code implementation (which is 400574). This happens for all three models targetid, frameid and argid in test or predict mode.

I got this error when using the commit e5b8ad4b8d9ea76473365558cb234e8bad4af874.

How to run the SemEval script on your system's output

Hi,

Thank you very much for this contribution! I just read your paper, and I am now trying out the code.

I am wondering how to put the CoNLL output of the frame semantic parser into the SemEval XML format so I can run the script provided by them as you did in your paper.

Thank you in advance for your time,
Breno

Framenet test set for comparison

Hi, I saw that in your paper about open-sesame you said that you used the same test set as Das and Smith 2011, which is a subset of FrameNet 1.5 fulltext annotations. However, in this code it seems that you're dealing with the 1.7 version.
Is there any way to get the test set annotation IDs from 1.5 that you (and Das and Smith 2011 used) to run some comparable experiments, without converting everything to formats that might be not needed?

Thanks

Understanding the output

For each sentence, the argid model is producing 2 conll2009 matrices that can contain conflicting values for ROLE variable. Can someone explain why there are 2 conll2009 entries for each sentence? Why would they be in conflict with each other? See below example:

1	the	_	The	_	DT	0	_	_	_	_	_	_	_	O
2	stock	_	stock	_	NN	0	_	_	_	_	_	stock.n	Store	S-Supply
3	was	_	be	_	VBD	0	_	_	_	_	_	_	_	O
4	bought	_	buy	_	VBN	0	_	_	_	_	_	_	_	O
5	by	_	by	_	IN	0	_	_	_	_	_	_	_	O
6	bob	_	Bob	_	NNP	0	_	_	_	_	_	_	_	O
7	.	_	.	_	.	0	_	_	_	_	_	_	_	O

1	the	_	The	_	DT	0	_	_	_	_	_	_	_	B-Goods
2	stock	_	stock	_	NN	0	_	_	_	_	_	_	_	I-Goods
3	was	_	be	_	VBD	0	_	_	_	_	_	_	_	O
4	bought	_	buy	_	VBN	0	_	_	_	_	_	buy.v	Commerce_buy	O
5	by	_	by	_	IN	0	_	_	_	_	_	_	_	B-Goods
6	bob	_	Bob	_	NNP	0	_	_	_	_	_	_	_	I-Goods
7	.	_	.	_	.	0	_	_	_	_	_	_	_	O

1	bob	_	Bob	_	NNP	1	_	_	_	_	_	_	_	S-Goods
2	bought	_	buy	_	VBD	1	_	_	_	_	_	buy.v	Commerce_buy	O
3	the	_	the	_	DT	1	_	_	_	_	_	_	_	B-Goods
4	stock	_	stock	_	NN	1	_	_	_	_	_	_	_	I-Goods
5	.	_	.	_	.	1	_	_	_	_	_	_	_	O

1	bob	_	Bob	_	NNP	1	_	_	_	_	_	_	_	O
2	bought	_	buy	_	VBD	1	_	_	_	_	_	_	_	O
3	the	_	the	_	DT	1	_	_	_	_	_	_	_	O
4	stock	_	stock	_	NN	1	_	_	_	_	_	stock.n	Store	S-Supply
5	.	_	.	_	.	1	_	_	_	_	_	_	_	O

Couple questions

  1. is there a resume function to training that enables one to not start from scratch each time?
  2. approximately how long should I expect to train each network frame, target and arg on a GPU with 3GB memory?
  3. can you point to any visualization tools for looking at the training data and predicted data? I think they are all in CONLL2009 format
  4. can you recommend a good way think about the number of epochs per training round? How was 5 chosen?
  5. why does predict mode for frameid.py require me to read the training data?

Thanks
--aaron

output description

I predicted on unannotated data using pre-trained models.
But, I don't understand result..

Can I get description of the columns in predicted-args.conll?

KeyError on `factorexprs` in `get_loss`

  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/ws/nbs/frame-sem-parse/open-sesame/sesame/argid.py", line 924, in <module>
    goldfes=trex.invertedfes)
  File "/ws/nbs/frame-sem-parse/open-sesame/sesame/argid.py", line 813, in identify_fes
    segrnnloss = get_loss(factor_exprs, goldfes, valid_fes, sentlen)
  File "/ws/nbs/frame-sem-parse/open-sesame/sesame/argid.py", line 674, in get_loss
    numeratorexprs = [factorexprs[gf] for gf in goldfactors]
  File "/ws/nbs/frame-sem-parse/open-sesame/sesame/argid.py", line 674, in <listcomp>
    numeratorexprs = [factorexprs[gf] for gf in goldfactors]
KeyError: <sesame.housekeeping.Factor object at 0x7f320dac0f98>```

Are the keys of `factorexprs` supposed to be `sesame.housekeeping.Factor` objects?
Any idea what is going on here?  The line right above 
`numeratorexprs = [factorexprs[gf] for gf in goldfactors]`
is where the Factor objects are instantiated:
`goldfactors = [Factor(span[0], span[1], feid) for feid in gold_fes for span in gold_fes[feid]]`

Clarify required DyNet version

The README says DyNet-v2, but using 2.0 (from pip) fails:

AttributeError: '_dynet.ParameterCollection' object has no attribute 'load'

Also, 2.0 doesn't require Boost.
Is it actually 1.1?

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)

Hi,

I tried using pretrained model to annotate a corpus, I first tried a small example where the sentences.txt file has only 5 sentences and it worked well.
Then I switched to my own dataset which is a lot bigger, and I am getting this error in the first step when running targetid prediction:

Any suggestion?

_____________________
COMMAND: /home/hannah/open-sesame/sesame/targetid.py --mode predict --model_name fn1.7-pretrained-targetid --raw_input stories.dev
MODEL FOR TEST / PREDICTION:    logs/fn1.7-pretrained-targetid/best-targetid-1.7-model
PARSING MODE:   predict
_____________________


Reading data/neural/fn1.7/fn1.7.fulltext.train.syntaxnet.conll ...
# examples in data/neural/fn1.7/fn1.7.fulltext.train.syntaxnet.conll : 19391 in 3413 sents
# examples with missing arguments : 526
Combined 19391 instances in data into 3413 instances.

Reading the lexical unit index file: data/fndata-1.7/luIndex.xml
# unique targets = 9421
# total targets = 13572
# targets with multiple LUs = 4151
# max LUs per target = 5


Reading pretrained embeddings from data/glove.6B.100d.txt ...
Traceback (most recent call last):
  File "/home/hannahbrahman/anaconda3/envs/py27/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/home/hannahbrahman/anaconda3/envs/py27/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/hannah/open-sesame/sesame/targetid.py", line 87, in <module>
    instances = [make_data_instance(line, i) for i,line in enumerate(fin)]
  File "sesame/raw_data.py", line 18, in make_data_instance
    for i in range(len(tokenized))]
  File "/home/hannahbrahman/anaconda3/envs/py27/lib/python2.7/site-packages/nltk/stem/wordnet.py", line 41, in lemmatize
    lemmas = wordnet._morphy(word, pos)
  File "/home/hannahbrahman/anaconda3/envs/py27/lib/python2.7/site-packages/nltk/corpus/reader/wordnet.py", line 1909, in _morphy
    forms = apply_rules([form])
  File "/home/hannahbrahman/anaconda3/envs/py27/lib/python2.7/site-packages/nltk/corpus/reader/wordnet.py", line 1889, in apply_rules
    if form.endswith(old)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)

Using pretrained model to get argids produces the same output as frameid identification

Hi,

I was able to run the first two steps of annotating a corpus using pretrained model, i.e. target id and frame id identification.
But when I run the code for the argid identification, it produces the same output as its raw_input which is: logs/fn1.7-pretrained-frameid/predicted-frames.conll

Any one encouter the same issue? What am I doing wrong?

Thanks

use pre-trained model have problem.

I run :

python -m sesame.targetid --model_name pretrained-targetid --mode predict --raw_input sentences.txt

But got:

Traceback (most recent call last):
  File "/gds/miniconda2/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/gds/miniconda2/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/workbench/open-sesame/sesame/targetid.py", line 432, in <module>
    model.populate(model_file_name)
  File "_dynet.pyx", line 1061, in _dynet.ParameterCollection.populate
  File "_dynet.pyx", line 1116, in _dynet.ParameterCollection.populate_from_textfile
RuntimeError: Dimensions of lookup parameter /_0 lookup up from file ({100,74152}) do not match parameters to be populated ({100,10411})

How can I solve that?

Missing sentences

Currently, if no targets are found for a particular input sentence, no output at all will be produced for this sentence. This causes problems if you are trying to get frame annotations for some raw input, and then try to map these frame annotations back to the corresponding sentences in the input, since the number of input sentences and output sentences won't match.

The relevant code is:

def print_as_conll(gold_examples, predicted_target_dict):
"""
Creates a CoNLL object with predicted target and lexical unit.
Spits out one CoNLL for each LU.
"""
with codecs.open(out_conll_file, "w", "utf-8") as conll_file:
for gold, pred in zip(gold_examples, predicted_target_dict):
for target in sorted(pred):
result = gold.get_predicted_target_conll(target, pred[target][0]) + "\n"
conll_file.write(result)
conll_file.close()

If pred is empty for a particular instance, nothing will be written.

Would it be possible to (optionally, to not break the CONLL format) write some kind of null output (e.g. "SKIPPED_SENTENCE"), so that, when processing the output, it's easier to find which sentences are missing?

Error in Prediction on unannotated data

Hi! I want to prediction on unannotated data.
I made "sentences.txt" in root directory.
I made "logs" directory in root directroy and extract pre-trained models in it.

I run this command :
python -m sesame.targetid --mode predict --model_name fn1.7-pretrained-targetid --raw_input sentences.txt

result logs :

DATA_DIRECTORY: data/
DEBUG_MODE:     False
EMBEDDINGS_FILE:        data/glove.6B.100d.txt
VERSION:        1.7
_____________________
COMMAND: /data/open-sesame/sesame/targetid.py --mode predict --model_name fn1.7-pretrained-targetid --raw_input sentences.txt
MODEL FOR TEST / PREDICTION:    logs/fn1.7-pretrained-targetid/best-targetid-1.7-model
PARSING MODE:   predict
_____________________


Reading data/neural/fn1.7/fn1.7.fulltext.train.syntaxnet.conll ...
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/data/open-sesame/sesame/targetid.py", line 62, in <module>
    train_examples, _, _ = read_conll(train_conll)
  File "sesame/dataio.py", line 32, in read_conll
    with codecs.open(conll_file, "r", "utf-8") as cf:
  File "/usr/lib/python2.7/codecs.py", line 896, in open
    file = __builtin__.open(filename, mode, buffering)
IOError: [Errno 2] No such file or directory: u'data/neural/fn1.7/fn1.7.fulltext.train.syntaxnet.conll'

Did I miss anything?
Thanks!

Error in readme

Running:

python -m sesame.targetid --mode predict --model_name pretrained-targetid --raw_input sentences.txt

Returns targetid.py: error: option --mode: invalid choice: 'predict' (choose from 'convert_conll_to_fe', 'count_frame_elements', 'compare_fefiles')

This may be env related or something else I overlooked.
Python 2.7 / FN 1.7

Using evaluation.py

Hello,

Thank you very much for this contribution.
I would like to test open-sesame in new available FrameNet corpus.

I'm at the evaluation step
But I'm having a hard time using the evaluation.py script
I see that it takes as input a file as the ones auto-generated in
"./data/neural/fn1.6/fn1.6.test.syntaxnet.conll"

but it fails while loading the default TEST_FILE
I get the following stack trace

File "evaluation.py", line 216, in
goldexamples, _, _ = read_conll(goldfile)
File "/FRAME_PARSERS/open-sesame/src/dataio.py", line 38, in read_conll
e = CoNLL09Example(sentence, elements)
File "/FRAME_PARSERS/open-sesame/src/conll09.py", line 120, in init
if FEDICT.getid(NOTANFE) in self.invertedfes:
File "/FRAME_PARSERS/open-sesame/src/housekeeping.py", line 52, in getid
raise Exception("not in dictionary, but can be added", id)
Exception: ('not in dictionary, but can be added', )

Any ideas on how to fix this issue?
Kind regards,
Gabriel M

Using Open-Sesame To Parse Multiple Inputs?

I'm currently on a project that's attempting to use Open-Sesame to parse multiple text inputs at once--calling Open-Sesame to predict on all these sentences one by one from the command line has proved to be prohibitively slow, but I've noticed that most of the time from predicting on a sentence seems to come from model loading rather than from actual prediction. Is there any way to have open-sesame loaded and continuously running somewhere so that it can be called to predict on text inputs without having to be loaded every time? I believe this was done for another ASRL package, SEMAFOR, whose website (although it's currently down) seemed to be running SEMAFOR on a separate server, where it was always loaded, which it then queried to get parses for inputs without the degree of delay that calling open-sesame from the command line has--is that possible to replicate here?

Mismatched parameters between loaded and populated

When trying to either test or predict, I run into the an error. I'm new to DyNet but something similar happens here: clab/dynet#1221 to which @neubig suggests:

This sort of error normally happens when you have a different model defined at training and test time. I'd make sure that you're calling exactly the same constructor code during training and test.

ERROR:


Reading model from logs/fn1.7-pretrained-targetid/best-targetid-1.7-model ...
Traceback (most recent call last):
File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/nas/home/thawani/MCS/open-sesame/sesame/targetid.py", line 431, in
model.populate(model_file_name)
File "_dynet.pyx", line 1461, in _dynet.ParameterCollection.populate
File "_dynet.pyx", line 1516, in _dynet.ParameterCollection.populate_from_textfile
RuntimeError: Number of parameter/lookup parameter objects loaded from file (20/4) did not match number to be populated (20/5)


Here's the log before the error:


[dynet] random seed: 1798024527
[dynet] allocating memory: 512MB
[dynet] memory allocation done.
DATA_DIRECTORY: data/
DEBUG_MODE: False
EMBEDDINGS_FILE: data/glove.6B.100d.txt
VERSION: 1.7


COMMAND: /nas/home/thawani/MCS/open-sesame/sesame/targetid.py --mode predict --model_name fn1.7-pretrained-targetid --raw_input raw.txt
MODEL FOR TEST / PREDICTION: logs/fn1.7-pretrained-targetid/best-targetid-1.7-model
PARSING MODE: predict


Reading data/neural/fn1.7/fn1.7.fulltext.train.syntaxnet.conll ...
#examples in data/neural/fn1.7/fn1.7.fulltext.train.syntaxnet.conll : 19391 in 3413 sents
#examples with missing arguments : 526
Combined 19391 instances in data into 3413 instances.

Reading the lexical unit index file: data/fndata-1.7/luIndex.xml
#unique targets = 9421
#total targets = 13572
#targets with multiple LUs = 4151
#max LUs per target = 5

Reading pretrained embeddings from data/glove.6B.100d.txt ...

PARSER SETTINGS (see logs/fn1.7-pretrained-targetid/configuration.json)


DEV_EVAL_EPOCH_FREQUENCY: 3
DROPOUT_RATE: 0.01
EVAL_AFTER_EVERY_EPOCHS: 100
HIDDEN_DIM: 100
LEMMA_DIM: 100
LSTM_DEPTH: 2
LSTM_DIM: 100
LSTM_INPUT_DIM: 100
NUM_EPOCHS: 100
PATIENCE: 25
POS_DIM: 100
PRETRAINED_EMBEDDING_DIM: 100
TOKEN_DIM: 100
TRAIN: data/neural/fn1.7/fn1.7.fulltext.train.syntaxnet.conll
UNK_PROB: 0.1
USE_DROPOUT: True

#Tokens = 400574
Unseen in dev/test = 0
Unlearnt in dev/test = 390524
#POS tags = 45
Unseen in dev/test = 0
Unlearnt in dev/test = 1
#Lemmas = 9349
Unseen in dev/test = 2
Unlearnt in dev/test = 3


Command:
python -m sesame.targetid --mode predict --model_name fn1.7-pretrained-targetid --raw_input raw.txt


Online Demo?

Hello everyone!
Can you provide an online demo for this project? It will be of great help!
Thanks!

FN 1.7 pre-trained targetid model throws AttributeError: 'Sentence' object has no attribute 'tokens'

I have an error running the targetid model that was pre-trained on FN 1.7. I am trying to do prediction on unannotated data. The error appears when loading the pretrained Glove embeddings. Here is the console output:

Reading pretrained embeddings from data/glove.6B.100d.txt ...
Traceback (most recent call last):
  File "{$HOME}/anaconda3/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "{$HOME}/anaconda3/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "{$HOME}/open-sesame/sesame/targetid.py", line 87, in <module>
    instances = [make_data_instance(line, i) for i,line in enumerate(fin)]
  File "sesame/raw_data.py", line 25, in make_data_instance
    instance = CoNLL09Example(sentence, elements)
  File "sesame/conll09.py", line 94, in __init__
    FrameSemParse.__init__(self, sentence)
  File "sesame/frame_semantic_graph.py", line 69, in __init__
    self.tokens = sentence.tokens
AttributeError: 'Sentence' object has no attribute 'tokens'

Why do you block overlapping frame elements?

Hi @swabhs,

My question is not exactly an issue; it is more related to a modification I am performing on FrameNet data and how it fits in your code:

In those lines you raise an exception if there are any overlapping frame element annotations:

raise Exception("\t\tIssue: duplicate FE at ", idx, self.fe)

and
raise Exception("duplicate FE at ", idx, offset, arglabel)

I am working on a data augmentation strategy that would add some overlapping frame element annotations (from different frames). What happens if I lift this restriction and allow overlapping frame elements? Is there a way I can add those new annotations to the training data without running into any issues?

Thank you for your time :)

Error when Dynet is saving the model after training

I have this line in globalconfig.py:20 VERSION="1.7" and I'm working with fn1.7 data.
I ran the preprocess steps.

Now I'm trying to run the training with this command: python segrnn-argid.py

I'm getting this error:

Traceback (most recent call last):
  File "segrnn-argid.py", line 98, in <module>
    wvs = get_wvec_map()
  File "/Users/pinouchon/code/huggingface/open-sesame/src/dataio.py", line 276, in get_wvec_map
    raise Exception("word vector file not found!", FILTERED_WVECS_FILE)
Exception: ('word vector file not found!', '../data/glove.6B.100d.framenet.txt')

The error goes away if I replace
FILTERED_WVECS_FILE = DATADIR + "glove.6B.100d.framenet.txt" with
FILTERED_WVECS_FILE = DATADIR + "glove.6B.100d.txt"
in globalconfig.py:85.

Now the training starts without errors (python segrnn-argid.py).

But after about 40min of training, I get this new error:

[dev epoch=0 after=2001] lprec = 0.40382 lrec = 0.14518 lf1 = 0.21358 -- savinglibc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Could not write model to tmp/1.7model.sra-1527520332.05

This looks like a low-level error inside dynet. I cannot find what is causing it with google/stackoverflow.
I installed dynet with pip install dynet. I'm running OSX High Sierra and python 2.7.10 inside a virtualenv.
I ran the training twice with both times the same error (and a different tmp file name in each case)
The error doesn't look related to my fix in globalconfig.py:85.
Any pointers?

Full output of the training:

[dynet] random seed: 1594657864
[dynet] allocating memory: 512MB
[dynet] memory allocation done.

COMMAND: segrnn-argid.py

PARSER SETTINGS
_____________________
PARSING MODE:   	train
USING EXEMPLAR? 	False
USING SPAN CLIP?	True
LOSS TYPE:      	softmaxm
COST TYPE:      	recall
R-O COST VALUE: 	2
USING DROPOUT?  	True
USING WORDVECS? 	True
USING HIERARCHY?	False
USING D-SYNTAX? 	False
USING C-SYNTAX? 	False
USING PTB-CLOSS?	False
MODEL WILL BE SAVED TO	tmp/1.7model.sra-1527520332.05
_____________________
reading ../data/neural/fn1.7/fn1.7.fulltext.train.syntaxnet.conll...
# examples in ../data/neural/fn1.7/fn1.7.fulltext.train.syntaxnet.conll : 19391 in 3413 sents
# examples with missing arguments : 526

reading the frame-element - frame map from ../data/fndata-1.7/frame/...
# max FEs for frame: 32 in Frame(Traversing)

reading the word vectors file from ../data/glove.6B.100d.txt...
using pretrained embeddings of dimension 100
# words in vocab:       400575
# POS tags:             45
# lexical units:        9441
# LU POS tags:          14
# frames:               1223
# FEs:                  1287
# dependency relations: 1
# constituency labels:  1

clipping spans longer than 20...
longest span size: 102
longest FE span size: 89
# train examples before filter: 19391
# train examples after filter: 19391

reading ../data/neural/fn1.7/fn1.7.dev.syntaxnet.conll...
# examples in ../data/neural/fn1.7/fn1.7.dev.syntaxnet.conll : 2272 in 326 sents
# examples with missing arguments : 73

unknowns in dev

_____________________
# unseen, unlearnt test words in vocab: (45, 390570)
# unseen, unlearnt test POS tags:       (0, 1)
# unseen, unlearnt test lexical units:  (0, 6444)
# unseen, unlearnt test LU pos tags:    (0, 3)
# unseen, unlearnt test frames:         (0, 469)
# unseen, unlearnt test FEs:            (0, 521)
# unseen, unlearnt test deprels:        (0, 1)
# unseen, unlearnt test constit labels: (0, 1)

[lr=0.0005 clips=99 updates=100] 100 loss = 38.650128 [took 46.383 s]
[lr=0.0005 clips=100 updates=100] 200 loss = 20.779121 [took 50.716 s]
[lr=0.0005 clips=100 updates=100] 300 loss = 17.716823 [took 46.031 s]
[lr=0.0005 clips=99 updates=100] 400 loss = 18.769036 [took 41.463 s]
[lr=0.0005 clips=100 updates=100] 500 loss = 18.951144 [took 49.424 s]
[lr=0.0005 clips=100 updates=100] 600 loss = 20.763794 [took 51.008 s]
[lr=0.0005 clips=100 updates=100] 700 loss = 17.897359 [took 45.175 s]
[lr=0.0005 clips=100 updates=100] 800 loss = 17.369590 [took 42.235 s]
[lr=0.0005 clips=98 updates=100] 900 loss = 16.837128 [took 49.753 s]
[lr=0.0005 clips=100 updates=100] 1000 loss = 17.795842 [took 51.235 s]
[dev epoch=0 after=1001] wprec = 0.00000 wrec = 0.00000 wf1 = 0.00000
[dev epoch=0 after=1001] uprec = 0.00000 urec = 0.00000 uf1 = 0.00000
[dev epoch=0 after=1001] lprec = 0.00000 lrec = 0.00000 lf1 = 0.00000 [took 621.073 s]
[lr=0.0005 clips=100 updates=100] 1100 loss = 16.862659 [took 50.687 s]
[lr=0.0005 clips=100 updates=100] 1200 loss = 14.759756 [took 40.827 s]
[lr=0.0005 clips=100 updates=100] 1300 loss = 14.575772 [took 39.446 s]
[lr=0.0005 clips=100 updates=100] 1400 loss = 14.491017 [took 42.966 s]
[lr=0.0005 clips=100 updates=100] 1500 loss = 15.175744 [took 55.345 s]
[lr=0.0005 clips=100 updates=100] 1600 loss = 14.648142 [took 42.464 s]
[lr=0.0005 clips=100 updates=100] 1700 loss = 13.749653 [took 50.359 s]
[lr=0.0005 clips=100 updates=100] 1800 loss = 13.874129 [took 46.874 s]
[lr=0.0005 clips=100 updates=100] 1900 loss = 14.471691 [took 42.907 s]
[lr=0.0005 clips=100 updates=100] 2000 loss = 13.668519 [took 49.962 s]
[dev epoch=0 after=2001] wprec = 0.41848 wrec = 0.06883 wf1 = 0.11822
[dev epoch=0 after=2001] uprec = 0.55100 urec = 0.18472 uf1 = 0.27668
[dev epoch=0 after=2001] lprec = 0.40382 lrec = 0.14518 lf1 = 0.21358 -- savinglibc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Could not write model to tmp/1.7model.sra-1527520332.05
[1]    54727 abort      python segrnn-argid.py

Can the annotations in the LU files be used for training?

Looks like only the fulltext annotations are used for training. There are many partial sentence annotations done as part of the Lexical Unit's (present in the lu directory). These are partial as only the relevant part for lu/frame is annotated and not all the frames in the sentence.

Can these partial annotations be used in training the models to get better results? Or will the partial annotations hamper the learning process.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.