akb89 / pyfn Goto Github PK

View Code? Open in Web Editor NEW

23.0 3.0 5.0 484 KB

A python module to process data for Frame Semantic Parsing

License: MIT License

Shell 26.02% Python 73.98%

framenet framenet-xml-data frame-semantic-parsing preprocessing pipeline semafor open-sesame coling2018

pyfn's Introduction

pyfn

Welcome to pyfn, a Python module to process FrameNet annotation.

pyfn can be used to:

convert data to and from FRAMENET XML, SEMEVAL XML, SEMAFOR CoNLL, BIOS and CoNLL-X
preprocess FrameNet data using a standardized state-of-the-art pipeline
run the SEMAFOR, OPEN-SESAME and SIMPLEFRAMEID frame semantic parsers for frame and/or argument identification on the FrameNet 1.5, 1.6 and 1.7 datasets
build your own frame semantic parser using a standard set of python models to marshall/unmarshall FrameNet XML data

This repository also accompanies the (Kabbach et al., 2018) paper:

@InProceedings{C18-1267,
  author = 	"Kabbach, Alexandre
		and Ribeyre, Corentin
		and Herbelot, Aur{\'e}lie",
  title = 	"Butterfly Effects in Frame Semantic Parsing: impact of data processing on model ranking",
  booktitle = 	"Proceedings of the 27th International Conference on Computational Linguistics",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"3158--3169",
  location = 	"Santa Fe, New Mexico, USA",
  url = 	"http://aclweb.org/anthology/C18-1267"
}

Dependencies

On Unix, you may need to install the following packages:

libxml2 libxml2-dev libxslt1-dev python-3.x-dev

Install

pip3 install pyfn

Use

When using pyfn, your FrameNet splits directory structure should follow:

.
|-- fndata-1.x-with-dev
|   |-- train
|   |   |-- fulltext
|   |   |-- lu
|   |-- dev
|   |   |-- fulltext
|   |   |-- lu
|   |-- test
|   |   |-- fulltext
|   |   |-- lu
|   |-- frame
|   |-- frRelation.xml
|   |-- semTypes.xml

Conversion

pyfn can be used to convert data to and from:

FRAMENET XML: the format of the released FrameNet XML data
SEMEVAL XML: the format of the SEMEVAL 2007 shared task 19 on frame semantic structure extraction
SEMAFOR CoNLL: the format used by the SEMAFOR parser
BIOS: the format used by the OPEN-SESAME parser
CoNLL-X: the format used by various state-of-the-art POS taggers and dependency parsers (see preprocessing considerations for frame semantic parsing below)

As well as to generate the .csv hierarchy files used by both SEMAFOR and OPEN-SESAME parsers to integrate the hierarchy feature (see (Kshirsagar et al., 2015) for details).

For an exhaustive description of all formats, check out FORMAT.md.

HowTo

The following sections provide examples of commands to convert FN data to and from different formats. All commands can make use of the following options:

--splits: specify which splits should be converted. --splits train will generate all train/dev/test splits, according to data found under the fndata-1.x/{train/dev/test} directories. --splits dev will generate the dev and test splits according to data found under the fndata-1.x/{dev/test} directories. This option will skip the train splits but generate the same dev/test splits that would have been generated with --splits train. --splits test will generate the test splits according to data found under the fndata-1.x/test directory, and skip the train/dev splits. The test splits generated with --splits test will be the same as those generated with the --splits train and --splits dev. Default to --splits test.
--output_sentences: if specified, will output a .sentences file in the process, containing all raw annotated sentences, one sentence per line.
--with_exemplars: if specified, will process the exemplars (data under the lu directory) in addition to fulltext.
--filter: specify data filtering options (see details below).

For details on pyfn usage, do:

pyfn --help
pyfn generate --help
pyfn convert --help

From FN XML to BIOS

To convert data from FrameNet XML format to BIOS format, do:

pyfn convert \
  --from fnxml \
  --to bios \
  --source /abs/path/to/fndata-1.x \
  --target /abs/path/to/xp/data/output/dir \
  --splits train \
  --output_sentences \
  --filter overlap_fes

Using --filter overlap_fes will skip all annotationsets with overlapping frame elements, as those cases are not supported by the BIOS format.

From FN XML to SEMAFOR CoNLL

To generate the train.frame.elements file used to train SEMAFOR, and the {dev,test}.frames file used for decoding, do:

pyfn convert \
  --from fnxml \
  --to semafor \
  --source /abs/path/to/fndata-1.x \
  --target /abs/path/to/xp/data/output/dir \
  --splits train \
  --output_sentences

From FN XML to SEMEVAL XML

To generate the {dev,test}.gold.xml gold files in SEMEVAL format for scoring, do:

pyfn convert \
  --from fnxml \
  --to semeval \
  --source /abs/path/to/fndata-1.x \
  --target /abs/path/to/xp/data/output/dir \
  --splits {dev,test}

From BIOS to SEMEVAL XML

To convert the decoded BIOS files {dev,test}.bios.semeval.decoded of OPEN-SESAME to SEMEVAL XML format for scoring, do:

pyfn convert \
  --from bios \
  --to semeval \
  --source /abs/path/to/{dev,test}.bios.semeval.decoded \
  --target /abs/path/to/output/{dev,test}.predicted.xml \
  --sent /abs/path/to/{dev,test}.sentences

From SEMAFOR CoNLL to SEMEVAL XML

To convert the decoded {dev,test}.frame.elements files of SEMAFOR to SEMEVAL XML format for scoring, do:

pyfn convert \
  --from semafor \
  --to semeval \
  --source /abs/path/to/{dev,test}.frame.elements \
  --target /abs/path/to/output/{dev,test}.predicted.xml \
  --sent /abs/path/to/{dev,test}.sentences

Generate the hierarchy `.csv` files

pyfn generate \
  --source /abs/path/to/fndata-1.x \
  --target /abs/path/to/xp/data/output/dir

To also process exemplars, add the --with_exemplars option

Preprocessing and Frame Semantic Parsing

pyfn ships in with a set of bash scripts to preprocess FrameNet data with various POS taggers and dependency parsers, as well as to perform frame semantic parsing with a variety of open-source parsers.

Currently supported POS taggers include:

MXPOST (Ratnaparkhi, 1996)
NLP4J (Choi, 2016)

Currently supported dependency parsers include:

MST (McDonald et al., 2006)
BIST BARCH (Kiperwasser and Goldberg, 2016)
BIST BMST (Kiperwasser and Goldberg, 2016)

Currently supported frame semantic parsers include:

SIMPLEFRAMEID (Hartmann et al., 2017) for frame identification
SEMAFOR (Kshirsagar et al., 2015) for argument identification
OPEN-SESAME (Swayamdipta et al., 2017) for argument identification

To request support for a POS tagger, a dependency parser or a frame semantic parser, please create an issue on Github/Gitlab.

Download

To run the preprocessing and frame semantic parsing scripts, first download:

data.7z containing all the FrameNet splits for FN 1.5 and FN 1.7

wget backup.3azouz.net/pyfn/data.7z

lib.7z containing all the different external softwares (taggers, parsers, etc.)

wget backup.3azouz.net/pyfn/lib.7z

resources.7z containing all the required resources

wget backup.3azouz.net/pyfn/resources.7z

scripts.7z containing the set of bash scripts to call the different parsers and preprocessing toolkits

wget backup.3azouz.net/pyfn/scripts.7z

Extract the content of all the archives under a directory named pyfn. Your pyfn folder structure should look like:

.
|-- pyfn
|   |-- data
|   |   |-- fndata-1.5-with-dev
|   |   |-- fndata-1.7-with-dev
|   |-- lib
|   |   |-- bistparser
|   |   |-- jmx
|   |   |-- mstparser
|   |   |-- nlp4j
|   |   |-- open-sesame
|   |   |-- semafor
|   |   |-- semeval
|   |-- resources
|   |   |-- bestarchybrid.model
|   |   |-- bestarchybrid.params
|   |   |-- bestfirstorder.model
|   |   |-- bestfirstorder.params
|   |   |-- config-decode-pos.xml
|   |   |-- nlp4j.plemma.model.all.xz
|   |   |-- sskip.100.vectors
|   |   |-- wsj.model
|   |-- scripts
|   |   |-- CoNLLizer.py
|   |   |-- deparse.sh
|   |   |-- flatten.sh
|   |   |-- ...

Please strictly follow this directory structure to avoid unexpected errors. pyfn relies on a lot of relative path resolutions to make scripts calls shorter, and changing this directory structure can break everything

Setup NLP4J for POS tagging

To use NLP4J for POS tagging, modify the resources/config-decode-pos.xml file by replacing the models.pos absolute path to your resources/nlp4j.plemma.model.all.xz:

<configuration>
	...
	<models>
		<pos>/absolute/path/to/pyfn/resources/nlp4j.plemma.model.all.xz</pos>
	</models>
</configuration>

Setup DyNET for BIST or OPEN-SESAME

If you intend to use the BIST parser for dependency parsing or OPEN-SESAME for frame semantic parsing, you will need to install DyNET 2.0.2 via:

pip install dynet=2.0.2

If you experience problems installing DyNET via pip, follow:

https://dynet.readthedocs.io/en/2.0.2/python.html

Setup SEMAFOR

To use the SEMAFOR frame semantic parser, modify the scripts/setup.sh file:

# SEMAFOR options to be changed according to your env
export JAVA_HOME_BIN="/abs/path/to/java/jdk/bin"
export num_threads=2 # number of threads to use
export min_ram=4g # min RAM allocated to the JVM in GB. Corresponds to the -Xms argument
export max_ram=8g # max RAM allocated to the JVM in GB. Corresponds to the -Xmx argument

# SEMAFOR hyperparameters
export kbest=1 # keep k-best parse
export lambda=0.000001 # hyperparameter for argument identification. Refer to Kshirsagar et al. (2015) for details.
export batch_size=4000 # number of batches processed at once for argument identification.
export save_every_k_batches=400 # for argument identification
export num_models_to_save=60 # for argument identification

Setup SIMPLEFRAMEID

If you intend to use SIMPLEFRAMEID for frame identification, you will need to install the following packages (on python 2.7):

pip install keras==2.0.6 lightfm==1.13 sklearn numpy==1.13.1 networkx==1.11 tensorflow==1.3.0

Using the SEMEVAL PERL evaluation scripts

If you intend to use the SEMEVAL perl evaluation scripts, make sure to have the App::cpanminus and XML::Parser modules installed:

cpan App::cpanminus
cpanm XML::Parser

Using bash scripts

Each script comes with a helper: check it out with --help!

Careful! most scripts expect data output by pyfn convert ... to be located under pyfn/experiments/xp_XYZ/data where XYZ stands for the experiments number and is specified using the -x XYZ argument, and where the experiments directory is located at the same level as the scripts directory. This opinionated choice has proven extremely useful in launching scripts by batch on a large set of experiments as it avoids having to input the full path each time.

Make sure to use

pyfn convert \
  --from ... \
  --to ... \
  --source ... \
  --target /abs/path/to/pyfn/experiments/xp_XYZ/data \
  --splits ...

BEFORE calling preprocess.sh, prepare.sh, semafor.sh or open-sesame.sh

preprocess.sh

Use preprocess.sh to POS-tag and dependency-parse FrameNet splits generated with pyfn convert .... The helper should display:

Usage: ${0##*/} [-h] -x XP_NUM -t {mxpost,nlp4j} -p {semafor,open-sesame} [-d {mst,bmst,barch}] [-v]
Preprocess FrameNet train/dev/test splits.

  -h, --help                           display this help and exit
  -x, --xp      XP_NUM                 xp number written as 3 digits (e.g. 001)
  -t, --tagger  {mxpost,nlp4j}         pos tagger to be used: 'mxpost' or 'nlp4j'
  -p, --parser  {semafor,open-sesame}  frame semantic parser to be used: 'semafor' or 'open-sesame'
  -d, --dep     {mst,bmst,barch}       dependency parser to be used: 'mst', 'bmst' or 'barch'
  -v, --dev                            if set, script will also preprocess dev splits

Suppose you generated FrameNet splits for SEMAFOR using:

pyfn convert \
  --from fnxml \
  --to semafor \
  --source /path/to/fndata-1.7-with-dev \
  --target /path/to/experiments/xp_001/data \
  --splits train \
  --output_sentences

You can preprocess those splits with NLP4J and BMST using

./preprocess.sh -x 001 -t nlp4j -d bmst -p semafor

prepare.sh

Use prepare.sh to automatically generate misc. data required by the frame semantic parsing pipeline, such as gold SEMEVAL XML files for scoring, the framenet.frame.element.map and the hierarchy .csv files used by SEMAFOR, or the frames.xml and frRelations.xml files used by both SEMAFOR and OPEN-SESAME. The helper should display:

Usage: ${0##*/} [-h] -x XP_NUM -p {semafor,open-sesame} -s {dev,test} -f FN_DATA_DIR [-u] [-e]
Prepare misc. data for frame semantic parsing.

  -h, --help                                   display this help and exit
  -x, --xp              XP_NUM                 xp number written as 3 digits (e.g. 001)
  -p, --parser          {semafor,open-sesame}  frame semantic parser to be used: 'semafor' or 'open-sesame'
  -s, --splits          {dev,test}             which splits to score: dev or test
  -f, --fn              FN_DATA_DIR            absolute path to FrameNet data directory
  -u, --with_hierarchy                         if specified, will use the hierarchy feature
  -e, --with_exemplars                         if specified, will use the exemplars

Suppose you generated FrameNet splits for SEMAFOR using:

pyfn convert \
  --from fnxml \
  --to semafor \
  --source /path/to/fndata-1.7-with-dev \
  --target /path/to/experiments/xp_001/data \
  --splits train \
  --output_sentences

You can prepare SEMAFOR data using:

./prepare.sh -x 001 -p semafor -s test -f /path/to/fndata-1.7-with-dev

frameid.sh

Use frameid.sh to perform frame identification using SIMPLEFRAMEID. The helper should display:

Usage: ${0##*/} [-h] -m {train,decode} -x XP_NUM [-p {semafor,open-sesame}]
Perform frame identification.

  -h, --help                            display this help and exit
  -m, --mode                            train on all models or decode using a single model
  -x, --xp       XP_NUM                 xp number written as 3 digits (e.g. 001)
  -p, --parser   {semafor,open-sesame}  formalize decoded frames for specified parser

Suppose you generated FrameNet splits for SEMAFOR using:

pyfn convert \
  --from fnxml \
  --to semafor \
  --source /path/to/fndata-1.7-with-dev \
  --target /path/to/experiments/xp_101/data \
  --splits train \
  --output_sentences

After preprocessing, you can train the SIMPLEFRAMEID parser using:

./frameid.sh -m train -x 101

and decode (before decoding argument identification) using:

./frameid.sh -m decode -x 101 -p semafor

semafor.sh

Use semafor.sh to train the SEMAFOR parser or decode the test/dev splits. The helper should display:

Usage: ${0##*/} [-h] -m {train,decode} -x XP_NUM [-s {dev,test}] [-u]
Train or decode with the SEMAFOR parser.

  -h, --help                             display this help and exit
  -m, --mode            {train,decode}   semafor mode to use: train or decode
  -x, --xp              XP_NUM           xp number written as 3 digits (e.g. 001)
  -s, --splits          {dev,test}       which splits to use in decode mode: dev or test
  -u, --with_hierarchy                   if specified, parser will use the hierarchy feature

Suppose you generated FrameNet splits for SEMAFOR using:

pyfn convert \
  --from fnxml \
  --to semafor \
  --source /path/to/fndata-1.7-with-dev \
  --target /path/to/experiments/xp_001/data \
  --splits train \
  --output_sentences

After preprocessing and preparation, you can train the SEMAFOR parser using:

./semafor.sh -m train -x 001

and decode the test splits using:

./semafor.sh -m decode -x 001 -s test

open-sesame.sh

Use open-sesame.sh to train the OPEN-SESMAE parser or decode the test/dev splits. The helper should display:

Usage: ${0##*/} [-h] -m {train,decode} -x XP_NUM [-s {dev,test}] [-d] [-u]
Train or decode with the OPEN-SESAME parser.

  -h, --help                              display this help and exit
  -m, --mode              {train,decode}  open-sesame mode to use: train or decode
  -x, --xp                XP_NUM          xp number written as 3 digits (e.g. 001)
  -s, --splits            {dev,test}      which splits to use in decode mode: dev or test
  -d, --with_dep_parses                   if specified, parser will use dependency parses
  -u, --with_hierarchy                    if specified, parser will use the hierarchy feature

Suppose you generated FrameNet splits for OPEN-SESAME using:

pyfn convert \
  --from fnxml \
  --to bios \
  --source /path/to/fndata-1.7-with-dev \
  --target /path/to/experiments/xp_002/data \
  --splits train \
  --output_sentences \
  --filter overlap_fes

After preprocessing and preparation, you can train the SEMAFOR parser using:

./open-sesame.sh -m train -x 002

and decode the test splits using:

./open-sesame.sh -m decode -x 002 -s test

score.sh

Use score.sh to obtain P/R/F1 scores for frame semantic parsing on dev/test splits with the SEMEVAL scoring script, using gold of predicted frames. The helper should display:

Usage: ${0##*/} [-h] -x XP_NUM -p {semafor,open-sesame} -s {dev,test} -f {gold,predicted}
Score frame semantic parsing with a modified version of the SEMEVAL scoring script.

  -h, --help                           display this help and exit
  -x, --xp      XP_NUM                 xp number written as 3 digits (e.g. 001)
  -p, --parser  {semafor,open-sesame}  frame semantic parser to be used: 'semafor' or 'open-sesame'
  -s, --splits  {dev,test}             which splits to score: dev or test
  -f, --frames  {gold,predicted}       score with gold or predicted frames

Note that scoring is done with an updated version of the SEMEVAL perl script, in order to obtain more robust scores across setups. For a full account of the modifications, refer to (Kabbach et al., 2018) and to the perl scripts located under lib/semeval/.

To obtain scores for SEMAFOR using gold frames on test splits, use:

./score.sh -x XYZ -p semafor -s test -f gold

To obtain scores for SEMAFOR using predicted frames on test splits, use:

./score.sh -x XYZ -p semafor -s test -f predicted

Replication

The experiments directory provides a detailed set of instructions to replicate all results reported in (Kabbach et al., 2018) on experimental butterfly effects in frame semantic parsing. Those instructions can be used to compare the performances of different frame semantic parsers in various experimental setups.

Marshalling and Unmarshalling FrameNet XML data

pyfn provides a set of Python models to process FrameNet XML data. Those can be used to help you build you own frame semantic parser.

The core of the pyfn models is the AnnotationSet corresponding to an XML <annotationSet> tag. It stores various information regarding a given set of FrameNet annotation for a given target in a given sentence. The notable innovations are the labelstore and the valenceunitstore, which store FrameNet labels (FE/PT/GF) in their original formats, and in custom formats which may prove useful for frame semantic parsing.

Explore the various models under the pyfn.models directory of the pyfn package.

Unmarshalling FrameNet XML data

To convert a list of fulltext.xml files and/or lu.xml files to a generator over pyfn.AnnotationSet objects, with no overlap between train/dev/test splits, use:

import pyfn.marshalling.unmarshallers.framenet as fn_unmarshaller

if __name__ == '__main__':
  splits_dirpath = '/abs/path/to/framenet-1.x-with-dev/'
  splits = 'train'
  with_exemplars = False
  annosets_dict = fn_unmarshaller.get_annosets_dict(splits_dirpath,
                                                    splits, with_exemplars)

splits_dirpath should point at the directory containing train/dev/test splits directories (see detailed structure above).

get_annosets_dict will return a string to AnnotationSet generator dict. It will ensure no overlap between train/dev/test splits.

Calling get_annosets_dict on splits='test' will return a dictionary with a single 'test' key. Calling get_annosets_dict on splits='dev' will return a dictionary with two keys: 'dev' and 'test'. Calling get_annosets_dict on splits='train' will return a dictionary with three keys: 'train', 'dev' and 'test'.

To iterate over the list of AnnotationSet objects of each key, you can then do:

for (splits, annosets) in annosets_dict.items():
  print('Iterating over annotationsets for splits: {}'.format(splits))
  for annoset in annosets:
    print('annoset with #id = {}'.format(annoset._id))

Or simply, to iterate over a specific key values (such as train annosets):

for annoset in annosets_dict['train']:
    print('annoset with #id = {}'.format(annoset._id))

Note that for performance, annosets is not a list but a generator.

Unmarshalling OPEN-SESAME BIOS data

To convert a .bios file with its corresponding .sentences file to a generator over pyfn.AnnotationSet objects, use:

import pyfn.marshalling.unmarshallers.bios as bios_unmarshaller

if __name__ == '__main__':
  bios_filepath = '/abs/path/to/.bios'
  sent_filepath = '/abs/path/to/.sentences'
  annosets = bios_unmarshaller.unmarshall_annosets(bios_filepath,
                                                   sent_filepath)
  for annoset in annosets:
    print('annoset with #id = {}'.format(annoset._id))

Important! the .bios and .sentences files must have been generated with pyfn convert ... --to bios ... with the --filter overlap_fes parameter.

Unmarshalling SEMAFOR CONLL data

To convert a .frame.elements file with its corresponding .sentences file to a generator over pyfn.AnnotationSet objects, use:

import pyfn.marshalling.unmarshallers.semafor as semafor_unmarshaller

if __name__ == '__main__':
  semafor_filepath = '/abs/path/to/.frame.elements'
  sent_filepath = '/abs/path/to/.sentences'
  annosets = semafor_unmarshaller.unmarshall_annosets(semafor_filepath,
                                                      sent_filepath)
  for annoset in annosets:
    print('annoset with #id = {}'.format(annoset._id))

Unmarshalling SEMEVAL XML data

To convert a SEMEVAL .xml file with its corresponding .sentences file to a generator over pyfn.AnnotationSet objects, use:

import pyfn.marshalling.unmarshallers.semeval as semeval_unmarshaller

if __name__ == '__main__':
  xml_filepath = '/abs/path/to/semeval/.xml'
  annosetss = semeval_unmarshaller.unmarshall_annosets(xml_filepath)

By default unmarshall_annosets for SEMEVAL will return a generator over embedded annotationsets. To iterate over a single annotationset, use:

for annosets in annosetss:
  for annoset in annosets:
    print('annoset with #id = {}'.format(annoset._id))

To return a 'flat' list of annosets, pass in the flatten=True parameter:

import pyfn.marshalling.unmarshallers.semeval as semeval_unmarshaller

if __name__ == '__main__':
  xml_filepath = '/abs/path/to/semeval/.xml'
  annosets = semeval_unmarshaller.unmarshall_annosets(xml_filepath, flatten=True)
  for annoset in annosets:
    print('annoset with #id = {}'.format(annoset._id))

Marshalling to OPEN-SESAME BIOS

To convert a dict of splits to pyfn.AnnotationSet objects to OPEN-SESAME-style .bios, refer to pyfn.marshalling.marshallers.bios.marshall_annosets_dict

Marshalling to SEMAFOR CONLL

To convert a dict of splits to pyfn.AnnotationSet objects to SEMAFOR-style .frame.elements, refer to pyfn.marshalling.marshallers.semafor.marshall_annosets_dict

Marshalling to SEMEVAL XML

To convert a list of pyfn.AnnotationSet objects to SEMEVAL-style .xml, refer to pyfn.marshalling.marshallers.semeval.marshall_annosets

Marshalling to .csv hierarchy

To convert a list of relations to a .csv file, refer to pyfn.marshalling.marshallers.hierarchy.marshall_relations

Citation

If you use pyfn please cite:

@InProceedings{C18-1267,
  author = 	"Kabbach, Alexandre
		and Ribeyre, Corentin
		and Herbelot, Aur{\'e}lie",
  title = 	"Butterfly Effects in Frame Semantic Parsing: impact of data processing on model ranking",
  booktitle = 	"Proceedings of the 27th International Conference on Computational Linguistics",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"3158--3169",
  location = 	"Santa Fe, New Mexico, USA",
  url = 	"http://aclweb.org/anthology/C18-1267"
}

pyfn's People

Contributors

Stargazers

Watchers

Forkers

pradipcyb anjapago ftamburin dogblack

pyfn's Issues

Missing `test.predicted.xml` after running decoding with SEMAFOR

May I know what should I expect after running decoding with SEMAFOR? I thought I would receive test.predicted.xml, or a document where it stores the predicted arguments as I learn from running score.sh but I couldn't find the file in /home/zxy485/zxy485gallinahome/week1/pyfn/experiments/xp_001/data

Codes that I run:

$ ./semafor.sh -m decode -x 001 -s test
ROFAMES TRAIN MODE OPTIONS
  JAVA_HOME_BIN = /usr/bin/
  CLASSPATH = /home/zxy485/zxy485gallinahome/week1/pyfn/scripts/../lib/semafor/bin/../rofames-1.0.0.jar
  XP_DIR = /home/zxy485/zxy485gallinahome/week1/pyfn/scripts/../experiments/xp_001
  splits = test
  kbest = 1
  max_ram = 8g
  with_hierarchy = FALSE
  LOGS_DIR = /home/zxy485/zxy485gallinahome/week1/pyfn/scripts/../log

Decoding with ROFAMES...
[INFO] ScoreWithGoldFrames:44  - Initializing parser for scoring...
[INFO] ScoreWithGoldFrames:45  - Extracting dependency-parsed testing sentences...
[INFO] ScoreWithGoldFrames:46  - 	from: /home/zxy485/zxy485gallinahome/week1/pyfn/scripts/../experiments/xp_001/data/test.sentences.conllx
[INFO] ScoreWithGoldFrames:49  - Done extracting dependency-parsed testing sentences
[INFO] ScoreWithGoldFrames:50  - Extracting frames...
[INFO] ScoreWithGoldFrames:51  - 	from: /home/zxy485/zxy485gallinahome/week1/pyfn/scripts/../experiments/xp_001/data/test.frames
[INFO] ScoreWithGoldFrames:55  - Done extracting frames
[INFO] ScoreWithGoldFrames:56  - Extracting argument identification alphabet...
[INFO] ScoreWithGoldFrames:57  - 	from: /home/zxy485/zxy485gallinahome/week1/pyfn/scripts/../experiments/xp_001/model/parser.conf
0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 1100000 1200000 1300000 1400000 1500000 1600000 1700000 1800000 1900000 2000000 2100000 2200000 2300000 2400000 2500000 2600000 2700000
[INFO] ScoreWithGoldFrames:60  - Done extracting argument identification alphabet
[INFO] ScoreWithGoldFrames:61  - Extracting Frame2FrameElement dictionary...
[INFO] ScoreWithGoldFrames:62  - 	from: /home/zxy485/zxy485gallinahome/week1/pyfn/scripts/../experiments/xp_001/data/framenet.frame.element.map
[INFO] ScoreWithGoldFrames:64  - Done extracting Frame2FrameElement dictionary
[INFO] ScoreWithGoldFrames:65  - Initializing decoder...
[INFO] ScoreWithGoldFrames:66  - 	from: /home/zxy485/zxy485gallinahome/week1/pyfn/scripts/../experiments/xp_001/model/argmodel.dat
[INFO] ScoreWithGoldFrames:67  - 	and from: /home/zxy485/zxy485gallinahome/week1/pyfn/scripts/../experiments/xp_001/model/parser.conf
[INFO] ScoreWithGoldFrames:70  - Done initializing decoder
[INFO] ScoreWithGoldFrames:74  - Done initializing parser
[INFO] ScoreWithGoldFrames:86  - Scoring with gold frames...
[INFO] ScoreWithGoldFrames:87  - Predicting arguments...
[INFO] StaticSemafor:111 - sentences.size = 1247
[INFO] StaticSemafor:112 - frameSplitsMap.size = 1247
[INFO] StaticSemafor:129 - There are 0 sentences without frame annotation
[INFO] ScoreWithGoldFrames:91  - Done predicting arguments

Directory and Files in /home/zxy485/zxy485gallinahome/week1/pyfn/experiments/xp_001

# data/
dev.frames
dev.sentences
framenet.frame.element.map
frames.xml
frRelations.xml
test.frame.elements
test.frames
test.gold.xml
test.sentences
test.sentences.conllx
train.frame.elements
train.sentences
train.sentences.conllx
train.sentences.conllx.flattened

# model/
argmodel.dat
featurecache.jobj
parser.conf
train.events.bin
train.sentences.frame.elements.spans

Which formats can be converted to?

Hello,

The readme says:

"pyfn can be used to:

convert data to and from FRAMENET XML, SEMEVAL XML, SEMAFOR CoNLL, BIOS and CoNLL-X"

But after this, CONLL-X is not mentioned. I do not see an option for this in the code, either.

I need to convert to a standard CONLL format so that I can match it to another dataset in another CONLL format. CONLL-X would work for this purpose.* Is there actually a way to convert to this format, or is the readme in error?

I don't think I can use SEMAFOR CONLL because I need a standard CONLL to use a standard CONLL converter to match them. So, unless SEMAFOR CONLL conforms exactly to a standard CONLL format (e.g., CONLL-05, CONLL-12, CONLL-X, CONLL-U), it will not work. If "SEMAFOR CONLL" is just another name for one of these, can someone kindly inform we which one?

Thank you!
Alan

Additional:

After trying a conversion to semafor conll format, I have get this:

dev.frames dev.sentences test.frames test.sentences train.frame.elements train.sentences

x.sentences files are just the raw text sentences. Like so:

" 'The true voodoo-worshipper attempts nothing of importance without certain sacrifices which are intended to propitiate his unclean gods .
" A chaotic case , my dear Watson , " said Holmes over an evening pipe .
" A lamb , I should say , or a kid . "

the x.frames files contain only frame information, no frame element info:

1       0.0     1       Stimulus_focus  nice.a  13      nice    1232
1       0.0     1       Buildings       pub.n   14      pubs    1232
1       0.0     1       Education_teaching      teach.v 3       taught  1233

For the dev and test sets, this is all the information contained in these files. IOW, where is the frame element data for the test and dev splits? Am I confused? Is this info somewhere else?

Thanks again,
Alan

frameid.sh - ValueError: need at least one array to concatenate

Thank you so much for the reply #13 !
I now run into a new error, but I am unsure what's the main cause of it. Is it because my data file is corrupted as the error is ValueError: need at least one array to concatenate?

$ module load python/3.6.6
$ pyfn convert \
  --from fnxml \
  --to semafor \
  --source /path/to/fndata-1.7-with-dev \
  --target /path/to/experiments/xp_101/data \
  --splits train \
  --output_sentences
$ ./preprocess.sh -x 101 -t nlp4j -d bmst -p semafor  # no error
$ ./prepare.sh -x 101 -p semafor -s test -f /home/zxy485/zxy485gallinahome/week1-4/pyfn/data/fndata-1.7-with-dev  # no error
$ module load python2/2.7.13
$ ./frameid.sh -m train -x 101
Preparing files for frame identification...
Converting to .flattened format for the SEMAFOR parser...
Processing file: /home/zxy485/zxy485gallinahome/week1-4/pyfn/scripts/../experiments/xp_101/frameid/data/corpora/test.sentences.conllx
Done
Training frame identification on all models...
Using TensorFlow backend.
train
Starting resource manager
Initializing reporters
Running the experiments!
12 configurations,  1  train-test pairs ->  12  runs
Malformed parse data in sentence 0
Malformed parse data in sentence 1
Malformed parse data in sentence 2
Malformed parse data in sentence 3
Malformed parse data in sentence 4
Malformed parse data in sentence 5
Malformed parse data in sentence 6
Malformed parse data in sentence 7
Malformed parse data in sentence 8
Malformed parse data in sentence 9
Malformed parse data in sentence 10
Malformed parse data in sentence 11
Malformed parse data in sentence 12
Malformed parse data in sentence 13
Malformed parse data in sentence 14
Malformed parse data in sentence 15
Malformed parse data in sentence 16
Malformed parse data in sentence 17
Malformed parse data in sentence 18
Malformed parse data in sentence 19
Malformed parse data in sentence 20
Malformed parse data in sentence 21
Malformed parse data in sentence 22
Malformed parse data in sentence 23
Malformed parse data in sentence 24
Malformed parse data in sentence 25
Malformed parse data in sentence 26
Malformed parse data in sentence 27
Malformed parse data in sentence 28
Malformed parse data in sentence 29
Malformed parse data in sentence 30
Malformed parse data in sentence 31
Malformed parse data in sentence 32
Malformed parse data in sentence 33
Malformed parse data in sentence 34
Malformed parse data in sentence 35
Malformed parse data in sentence 36
Malformed parse data in sentence 37
Malformed parse data in sentence 38
Malformed parse data in sentence 39
Malformed parse data in sentence 40
Malformed parse data in sentence 41
Malformed parse data in sentence 42
Malformed parse data in sentence 43
Malformed parse data in sentence 44
Malformed parse data in sentence 45
Malformed parse data in sentence 46
Malformed parse data in sentence 47
Malformed parse data in sentence 48
Malformed parse data in sentence 49
Malformed parse data in sentence 50
Malformed parse data in sentence 51
Malformed parse data in sentence 52
Malformed parse data in sentence 53
Malformed parse data in sentence 54
Malformed parse data in sentence 55
Malformed parse data in sentence 56
Malformed parse data in sentence 57
Malformed parse data in sentence 58
Malformed parse data in sentence 59
Malformed parse data in sentence 60
Malformed parse data in sentence 61
Malformed parse data in sentence 62
Malformed parse data in sentence 63
Malformed parse data in sentence 64
Malformed parse data in sentence 65
Malformed parse data in sentence 66
Malformed parse data in sentence 67
Malformed parse data in sentence 68
Malformed parse data in sentence 69
Malformed parse data in sentence 70
Malformed parse data in sentence 71
Malformed parse data in sentence 72
Malformed parse data in sentence 73
Malformed parse data in sentence 74
train.sentences.conllx.flattened train.frame.elements labeled: 3362 parsed: 75 graphs: 0
Traceback (most recent call last):
  File "/home/zxy485/zxy485gallinahome/week1-4/pyfn/scripts/../lib/eacl2017-oodFrameNetSRL/simpleFrameId/main.py", line 188, in <module>
    _train_all(HOME, EMBEDDINGS_NAME)
  File "/home/zxy485/zxy485gallinahome/week1-4/pyfn/scripts/../lib/eacl2017-oodFrameNetSRL/simpleFrameId/main.py", line 138, in _train_all
    X_train, y_train, lemmapos_train, gid_train = mapper.get_matrix(g_train)
  File "/mnt/rds/redhen/gallina/home/zxy485/week1-4/pyfn/lib/eacl2017-oodFrameNetSRL/simpleFrameId/representation.py", line 29, in get_matrix
    X = np.vstack(X)
  File "/home/zxy485/.local/lib/python2.7/site-packages/numpy/core/shape_base.py", line 237, in vstack
    return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
ValueError: need at least one array to concatenate
Done

Is there some way to test/predict on new data?

Thanks for your hard work!

I have a question about how to test or predict on new data. If I have some new labeled data (other than official framenet) or just want to do frame semantic parsing on unlabeled sentences using trained semafor and open-sesame, should I prepare the data in the same style as fulltext.xml and then do unmarshalling?

Error with Embeddings

When I run ./frameid.sh -m train -x 101 after conversion and preprocessing with
./preprocess.sh -x 101 -t nlp4j -d bmst -p semafor
I encounter this error:

Preparing files for frame identification...
Converting to .flattened format for the SEMAFOR parser...
Processing file: /home/zxy485/zxy485gallinahome/week1/pyfn/scripts/../experiments/xp_101/frameid/data/corpora/test.sentences.conllx
Done
Training frame identification on all models...
Using TensorFlow backend.
['/home/zxy485/zxy485gallinahome/week1/pyfn/scripts/../lib/eacl2017-oodFrameNetSRL/simpleFrameId/main.py', 'train', '/home/zxy485/zxy485gallinahome/week1/pyfn/scripts/../experiments/xp_101/frameid']
Traceback (most recent call last):
  File "/home/zxy485/zxy485gallinahome/week1/pyfn/scripts/../lib/eacl2017-oodFrameNetSRL/simpleFrameId/main.py", line 186, in <module>
    EMBEDDINGS_NAME = sys.argv[3]
IndexError: list index out of range
Done

It seems like the embedding file is missing. May I know where to seek for the embedding file? Thank you!

pyfn convert crashes when target folder does not exist

When I run convert and use a target folder that does not exist, then pyfn crashes with a unspecific error message.

I see two solutions:

Convert should create the target folder if it does not exist
Log an error

I would like to have 1.

pyfn convert --from fnxml --to semeval --source /home/jck/git/fn/data/fndata-1.7-with-dev --target /home/jck/git/fn/data/fndata-1.7-converted --splits train
INFO - Marshalling pyfn.AnnotationSet objects to SEMEVAL XML...
INFO - Marshalling pyfn.AnnotationSet objects to SEMEVAL XML...
INFO - Marshalling pyfn.AnnotationSet objects to SEMEVAL XML...
INFO - Saving output to /home/jck/git/fn/data/fndata-1.7-converted/train.gold.xml
INFO - Saving output to /home/jck/git/fn/data/fndata-1.7-converted/train.gold.xml
INFO - Saving output to /home/jck/git/fn/data/fndata-1.7-converted/train.gold.xml
Traceback (most recent call last):
  File "/home/jck/git/fn/venv/bin/pyfn", line 11, in <module>
    sys.exit(main())
  File "/home/jck/git/fn/venv/lib/python3.6/site-packages/pyfn/main.py", line 197, in main
    args.func(args)
  File "/home/jck/git/fn/venv/lib/python3.6/site-packages/pyfn/main.py", line 90, in _convert
    args.excluded_annosets)
  File "/home/jck/git/fn/venv/lib/python3.6/site-packages/pyfn/marshalling/marshallers/semeval.py", line 121, in marshall_annosets
    excluded_sentences, excluded_annosets)
  File "/home/jck/git/fn/venv/lib/python3.6/site-packages/pyfn/marshalling/marshallers/semeval.py", line 110, in _marshall_annosets
    pretty_print=True)
  File "src/lxml/etree.pyx", line 2039, in lxml.etree._ElementTree.write
  File "src/lxml/serializer.pxi", line 721, in lxml.etree._tofilelike
  File "src/lxml/serializer.pxi", line 780, in lxml.etree._create_output_buffer
  File "src/lxml/serializer.pxi", line 770, in lxml.etree._create_output_buffer
FileNotFoundError: [Errno 2] No such file or directory
Makefile:4: recipe for target 'convert' failed
make: *** [convert] Error 1

Creating Directory instead of Converting from Semafor to Semeval

When I run the pyfn convert command to convert from SEMAFOR CoNLL format into Semeval XML, I run into IsADirectoryError: [Errno 21] Is a directory error.

I found the command now creates the directory with the inclusion of the filename /home/zxy485/zxy485gallinahome/week1/pyfn/experiments/xp_001/output/test.predicted.xml/

$ pyfn convert \
>   --from semafor \
>   --to semeval \
>   --source /home/zxy485/zxy485gallinahome/week1/pyfn/experiments/xp_001/data/test.frame.elements \
>   --target /home/zxy485/zxy485gallinahome/week1/pyfn/experiments/xp_001/output/test.predicted.xml \
>   --sent /home/zxy485/zxy485gallinahome/week1/pyfn/experiments/xp_001/data/test.sentences
INFO - Marshalling pyfn.AnnotationSet objects to SEMEVAL XML...
INFO - Marshalling pyfn.AnnotationSet objects to SEMEVAL XML...
INFO - Marshalling pyfn.AnnotationSet objects to SEMEVAL XML...
INFO - Saving output to /home/zxy485/zxy485gallinahome/week1/pyfn/experiments/xp_001/output/test.predicted.xml
INFO - Saving output to /home/zxy485/zxy485gallinahome/week1/pyfn/experiments/xp_001/output/test.predicted.xml
INFO - Saving output to /home/zxy485/zxy485gallinahome/week1/pyfn/experiments/xp_001/output/test.predicted.xml
Traceback (most recent call last):
  File "/home/zxy485/.local/bin/pyfn", line 10, in <module>
    sys.exit(main())
  File "/home/zxy485/.local/lib/python3.6/site-packages/pyfn/main.py", line 198, in main
    args.func(args)
  File "/home/zxy485/.local/lib/python3.6/site-packages/pyfn/main.py", line 91, in _convert
    args.excluded_annosets)
  File "/home/zxy485/.local/lib/python3.6/site-packages/pyfn/marshalling/marshallers/semeval.py", line 128, in marshall_annosets
    excluded_sentences, excluded_annosets)
  File "/home/zxy485/.local/lib/python3.6/site-packages/pyfn/marshalling/marshallers/semeval.py", line 110, in _marshall_annosets
    pretty_print=True)
  File "src/lxml/etree.pyx", line 2048, in lxml.etree._ElementTree.write
  File "src/lxml/serializer.pxi", line 721, in lxml.etree._tofilelike
  File "src/lxml/serializer.pxi", line 780, in lxml.etree._create_output_buffer
  File "src/lxml/serializer.pxi", line 770, in lxml.etree._create_output_buffer
IsADirectoryError: [Errno 21] Is a directory

Unclear explanation on `splits` in README

--splits: specify which splits should be converted. Use --splits dev to only process dev and test splits and guarantee no overlap between dev and test. Use --splits train to process train dev and test splits and guarantee no overlap across splits. Default to --splits test.

Is really confusing, as you say it defaults to test, which you do not explain, the example uses more than one split {test,dev} but this explanation uses just one. Is it just using part of the data or splitting all the data into 2 splits instead of three when I say {test,dev}?
The example is:

pyfn convert \
  --from fnxml \
  --to bios \
  --source /abs/path/to/fndata-1.x \
  --target /abs/path/to/xp/data/output/dir \
  --splits train \
  --output_sentences \
  --filter overlap_fes

Document unmarshaller

In my application, I typically need to write parsing code to load FrameNet data. It would be nice if I just could use the data structure and code that is already in pyfn to load FrameNet into my application. As I assume the code for that is already there, it would just need documentation to tell users how to load FrameNet data into Python objects.

pyfn convert crashes because it cannot find logging.yml

Steps to reproduce:

1, Create a virtual environment (I used 3.6 with no site packages)
2. Install pyfn via pip
3. Run pyfn convert, I think that the command line arguments do not matter, it crashes before doing anything.

pyfn convert --from fnxml --to semeval --source XXX --target XXX --splits {train,dev,test}
Traceback (most recent call last):
  File "/home/jck/git/XXXY/venv/bin/pyfn", line 7, in <module>
    from pyfn.main import main
  File "/home/jck/git/XXXY/venv/lib/python3.6/site-packages/pyfn/main.py", line 27, in <module>
    os.path.join(os.path.dirname(__file__), 'logging', 'logging.yml')))
  File "/home/jck/git/XXXY/venv/lib/python3.6/site-packages/pyfn/utils/config.py", line 19, in load
    with open(config_file, 'r') as config_stream:
FileNotFoundError: [Errno 2] No such file or directory: '/home/jck/git/XXXY/venv/lib/python3.6/site-packages/pyfn/logging/logging.yml'

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

akb89 / pyfn Goto Github PK

pyfn's Introduction

pyfn

Dependencies

Install

Use

Conversion

HowTo

From FN XML to BIOS

From FN XML to SEMAFOR CoNLL

From FN XML to SEMEVAL XML

From BIOS to SEMEVAL XML

From SEMAFOR CoNLL to SEMEVAL XML

Generate the hierarchy .csv files

Preprocessing and Frame Semantic Parsing

Download

Setup NLP4J for POS tagging

Setup DyNET for BIST or OPEN-SESAME

Setup SEMAFOR

Setup SIMPLEFRAMEID

Using the SEMEVAL PERL evaluation scripts

Using bash scripts

preprocess.sh

prepare.sh

frameid.sh

semafor.sh

open-sesame.sh

score.sh

Replication

Marshalling and Unmarshalling FrameNet XML data

Unmarshalling FrameNet XML data

Unmarshalling OPEN-SESAME BIOS data

Unmarshalling SEMAFOR CONLL data

Unmarshalling SEMEVAL XML data

Marshalling to OPEN-SESAME BIOS

Marshalling to SEMAFOR CONLL

Marshalling to SEMEVAL XML

Marshalling to .csv hierarchy

Citation

pyfn's People

Contributors

Stargazers

Watchers

Forkers

pyfn's Issues

Recommend Projects

Recommend Topics

Recommend Org

Generate the hierarchy `.csv` files