pku-tangent / neuraleduseg Goto Github PK

View Code? Open in Web Editor NEW

94.0 5.0 42.0 133.6 MB

A toolkit for discourse segmentation (EDU segmentation).

Home Page: https://arxiv.org/abs/1808.09147

License: Apache License 2.0

Python 100.00%

edu-segmentation discourse sentence-segmentation neural-network discourse-segmentation

neuraleduseg's Introduction

Neural-EDU-Segmentation

A toolkit for segmenting Elementary Discourse Units (clauses). We implement it as is described in our EMNLP paper: Toward Fast and Accurate Neural Discourse Segmentation

Requirements

Python 3.5
Tensorflow>=1.5.0
allennlp>=0.4.2
See requirements.txt for the full list of packages

Data

We cannot provide the complete RST-DT corpus due to the LDC copyright. So we only put several samples in ./data/rst/ to test the our code and show the data structure.

If you want to train or evaluate our model on RST-DT, you need to download the data manually and put it in the same folder. Then run the following command to preprocess the data and create the vocabulary:

python run.py --prepare

Evaluate the model on RST-DT:

We provide the vocabulary and a well-trained model in the ./data/ folder. You can evaluate the performance of this model after preparing the RST-DT data as mentioned above:

python run.py --evaluate --test_files ../data/rst/preprocessed/test/*.preprocessed

The performance of current model should be as follows:

'precision': 0.9176470588235294, 'recall': 0.975, 'f1': 0.9454545454545454}

Note that this is slightly better than the results we reported in the paper, since we re-trained the model and there is some randomness here.

Train a new model

You can use the following command to train the model from scratch:

python run.py --train

Hyper-parameters and other training settings can be modified in config.py.

Segmenting raw text into EDUs

You can segment files with raw text into EDUs:

python run.py --segment --input_files ../data/rst/TRAINING/wsj_110*.out --result_dir ../data/results/

The segmented result for each file will be saved to the --result_dir folder with the same name. Each EDU is written as a line.

Citation

Please cite the following paper if you use this toolkit in your work:

@inproceedings{wang2018edu,
  title={Toward Fast and Accurate Neural Discourse Segmentation},
  author={Wang, Yizhong and Li, Sujian and Yang, Jingfeng},
  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
  pages={962--967},
  year={2018}
}

neuraleduseg's People

Contributors

Stargazers

Watchers

neuraleduseg's Issues

Exception

Hi, do you know what might be causing the below exception? I'm parsing the raw data in Penn Treebank.
Thanks

`
Traceback (most recent call last):
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: num_lower must be negative or less or equal to number of rows (4) got: 5
[[{{node encoding/bilinear_attention/MatrixBandPart}} = MatrixBandPart[T=DT_FLOAT, Tindex=DT_INT64, _device="/job:localhost/replica:0/task:0/device:GPU:0"](encoding/bilinear_attention/ones_like, encoding/bilinear_attention/MatrixBandPart/num_lower/_271, encoding/bilinear_attention/MatrixBandPart/num_lower/_271)]]
[[{{node encoding/rnn_2/bidirectional_rnn/bw/bw/stack/_287}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_578_encoding/rnn_2/bidirectional_rnn/bw/bw/stack", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run.py", line 48, in
segment(args)
File "/home/grigorii/homework/cpsc503/project/datasets/discourse_parsers/NeuralEDUSeg/src/api.py", line 158, in segment
batch_pred_segs = model.segment(batch)
File "/home/grigorii/homework/cpsc503/project/datasets/discourse_parsers/NeuralEDUSeg/src/elmo_crf_seg.py", line 78, in segment
scores, trans_params = self.sess.run([self.scores, self.trans_params], feed_dict)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: num_lower must be negative or less or equal to number of rows (4) got: 5
[[node encoding/bilinear_attention/MatrixBandPart (defined at /home/grigorii/homework/cpsc503/project/datasets/discourse_parsers/NeuralEDUSeg/src/layers.py:40) = MatrixBandPart[T=DT_FLOAT, Tindex=DT_INT64, _device="/job:localhost/replica:0/task:0/device:GPU:0"](encoding/bilinear_attention/ones_like, encoding/bilinear_attention/MatrixBandPart/num_lower/_271, encoding/bilinear_attention/MatrixBandPart/num_lower/_271)]]
[[{{node encoding/rnn_2/bidirectional_rnn/bw/bw/stack/_287}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_578_encoding/rnn_2/bidirectional_rnn/bw/bw/stack", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'encoding/bilinear_attention/MatrixBandPart', defined at:
File "run.py", line 48, in
segment(args)
File "/home/grigorii/homework/cpsc503/project/datasets/discourse_parsers/NeuralEDUSeg/src/api.py", line 134, in segment
model = AttnSegModel(args, word_vocab)
File "/home/grigorii/homework/cpsc503/project/datasets/discourse_parsers/NeuralEDUSeg/src/elmo_crf_seg.py", line 14, in init
super().init(args, word_vocab)
File "/home/grigorii/homework/cpsc503/project/datasets/discourse_parsers/NeuralEDUSeg/src/base_seg.py", line 38, in init
self._build_graph()
File "/home/grigorii/homework/cpsc503/project/datasets/discourse_parsers/NeuralEDUSeg/src/lstm_seg.py", line 18, in _build_graph
self._encode()
File "/home/grigorii/homework/cpsc503/project/datasets/discourse_parsers/NeuralEDUSeg/src/atten_seg.py", line 20, in _encode
self.encoded_sent, self.placeholders['input_length'], self.window_size)
File "/home/grigorii/homework/cpsc503/project/datasets/discourse_parsers/NeuralEDUSeg/src/layers.py", line 40, in self_attention
restricted_mask = tf.matrix_band_part(tf.ones_like(logits, dtype=tf.float32), window_size, window_size)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 4142, in matrix_band_part
num_upper=num_upper, name=name)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): num_lower must be negative or less or equal to number of rows (4) got: 5
[[node encoding/bilinear_attention/MatrixBandPart (defined at /home/grigorii/homework/cpsc503/project/datasets/discourse_parsers/NeuralEDUSeg/src/layers.py:40) = MatrixBandPart[T=DT_FLOAT, Tindex=DT_INT64, _device="/job:localhost/replica:0/task:0/device:GPU:0"](encoding/bilinear_attention/ones_like, encoding/bilinear_attention/MatrixBandPart/num_lower/_271, encoding/bilinear_attention/MatrixBandPart/num_lower/_271)]]
[[{{node encoding/rnn_2/bidirectional_rnn/bw/bw/stack/_287}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_578_encoding/rnn_2/bidirectional_rnn/bw/bw/stack", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
`

where is the run.py

dependency conflicts

Hi!

Thank you for this wonderful paper and repo. I meet several dependency conflicts when I try to build the virtual environment.

pip install -r requirements.txt

It returns the following error:

ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

tensorboard 1.14.0 requires setuptools>=41.0.0, but you'll have setuptools 39.0.1 which is incompatible.
tensorflow-gpu 1.14.0 requires numpy<2.0,>=1.14.5, but you'll have numpy 1.14.2 which is incompatible.
thinc 6.10.3 requires wrapt<1.11.0,>=1.10.0, but you'll have wrapt 1.12.1 which is incompatible.
scipy 1.5.2 requires numpy>=1.14.5, but you'll have numpy 1.14.2 which is incompatible.
allennlp 1.0.0 requires spacy<2.3,>=2.1.0, but you'll have spacy 2.0.11 which is incompatible.

If I directly tried to run:

cd src
python run.py --segment --input_files ../data/rst/TRAINING/wsj_110*.out --result_dir ../data/results/

It gives us the following error messages:

ModuleNotFoundError: No module named 'allennlp.commands.elmo'

I believe this message is caused by the version of allennlp (based on https://stackoverflow.com/questions/62884591/modulenotfounderror-no-module-named-allennlp-commands-elmo).

Could you please give me more instructions on how to set up the virtual environment successfully? Thank you!

tensorflow.python.framework.errors_impl.InvalidArgumentError: num_lower must be negative or less or equal to number of rows (3) got: 5

When segmenting very short input sentences, this happens:

$cat /tmp/bad_input.txt 
good food .

$ python run.py --segment --input_files /tmp/bad_input.txt --result_dir /tmp/results/
 2021-02-24 22:00:12,457 - SegEDU - INFO - Running with args : Namespace(batch_size=32, dev_files=None, dropout_keep_prob=0.9, ema_decay=0.9999, epochs=50, evaluate=False, gpu=None, hidden_size=200, input_files=['/tmp/bad_input.txt'], lea
rning_rate=0.001, log_path=None, max_grad_norm=5.0, model_dir='/usr/local/lib/python3.6/site-packages/neuralseg-0.1.0a0-py3.6.egg/data/models', optim='adam', prepare=False, result_dir='/tmp/results/', rst_dir='../data/rst/', seed=123, seg
ment=True, test_files=None, train=False, train_files=None, weight_decay=0.0001, window_size=5, word_embed_path='../data/embeddings/glove.840B.300d.txt', word_embed_size=300, word_vocab_path='../data/vocab/word.vocab')
 2021-02-24 22:00:12,457 - SegEDU - INFO - Loading vocab...
 2021-02-24 22:00:12,522 - SegEDU - INFO - Word vocab size: 17243
 2021-02-24 22:00:12,522 - SegEDU - INFO - Loading the model...
2021-02-24 22:00:12.522775: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:97: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
 2021-02-24 22:00:17,873 - SegEDU - INFO - There are 4043611 parameters in the model
 2021-02-24 22:00:17,873 - SegEDU - INFO - Using Exp Moving Average to train the model with decay 0.9999.
 2021-02-24 22:00:19,063 - SegEDU - INFO - Time to build graph: 6.5401506423950195 s
/usr/local/lib/python3.6/site-packages/sklearn/utils/linear_assignment_.py:22: FutureWarning: The linear_assignment_ module is deprecated in 0.21 and will be removed from 0.23. Use scipy.optimize.linear_sum_assignment instead.
  FutureWarning)
 2021-02-24 22:00:32,036 - SegEDU - INFO - Model restored from /usr/local/lib/python3.6/site-packages/neuralseg-0.1.0a0-py3.6.egg/data/models/best
 2021-02-24 22:00:32,947 - SegEDU - INFO - Segmenting /tmp/bad_input.txt...
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1329, in _run_fn
    status, run_metadata)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: num_lower must be negative or less or equal to number of rows (3) got: 5
         [[Node: encoding/bilinear_attention/MatrixBandPart = MatrixBandPart[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](encoding/bilinear_attention/ones_like, encoding/bilinear_attention/MatrixBandPart/num_lower,
encoding/bilinear_attention/MatrixBandPart/num_lower)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run.py", line 50, in <module>
    segment(args)
  File "/usr/local/lib/python3.6/site-packages/neuralseg-0.1.0a0-py3.6.egg/neuralseg/api.py", line 157, in segment
    batch_pred_segs = model.segment(batch)
  File "/usr/local/lib/python3.6/site-packages/neuralseg-0.1.0a0-py3.6.egg/neuralseg/elmo_crf_seg.py", line 125, in segment
    feed_dict)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1128, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1344, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1363, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: num_lower must be negative or less or equal to number of rows (3) got: 5
         [[Node: encoding/bilinear_attention/MatrixBandPart = MatrixBandPart[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](encoding/bilinear_attention/ones_like, encoding/bilinear_attention/MatrixBandPart/num_lower,
encoding/bilinear_attention/MatrixBandPart/num_lower)]]

Caused by op 'encoding/bilinear_attention/MatrixBandPart', defined at:
  File "run.py", line 50, in <module>
    segment(args)
  File "/usr/local/lib/python3.6/site-packages/neuralseg-0.1.0a0-py3.6.egg/neuralseg/api.py", line 133, in segment
    model = AttnSegModel(args, word_vocab)
  File "/usr/local/lib/python3.6/site-packages/neuralseg-0.1.0a0-py3.6.egg/neuralseg/elmo_crf_seg.py", line 17, in __init__
    super().__init__(args, word_vocab)
  File "/usr/local/lib/python3.6/site-packages/neuralseg-0.1.0a0-py3.6.egg/neuralseg/base_seg.py", line 37, in __init__
    self._build_graph()
  File "/usr/local/lib/python3.6/site-packages/neuralseg-0.1.0a0-py3.6.egg/neuralseg/lstm_seg.py", line 20, in _build_graph
    self._encode()
  File "/usr/local/lib/python3.6/site-packages/neuralseg-0.1.0a0-py3.6.egg/neuralseg/atten_seg.py", line 24, in _encode
    self.window_size)
  File "/usr/local/lib/python3.6/site-packages/neuralseg-0.1.0a0-py3.6.egg/neuralseg/layers.py", line 51, in self_attention
    window_size, window_size)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 2409, in matrix_band_part
    num_upper=num_upper, name=name)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): num_lower must be negative or less or equal to number of rows (3) got: 5
         [[Node: encoding/bilinear_attention/MatrixBandPart = MatrixBandPart[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](encoding/bilinear_attention/ones_like, encoding/bilinear_attention/MatrixBandPart/num_lower,
encoding/bilinear_attention/MatrixBandPart/num_lower)]]

TypeError: ArrayField.empty_field: return type `None` is not a `allennlp.data.fields.field.Field`.

I ran the segmentation example python run.py --segment --input_files ../data/rst/TRAINING/wsj_110*.out --result_dir ../data/results/, but always got this error. And ideas? Thanks.