Giter VIP home page Giter VIP logo

neuraleduseg's Introduction

Neural-EDU-Segmentation

A toolkit for segmenting Elementary Discourse Units (clauses). We implement it as is described in our EMNLP paper: Toward Fast and Accurate Neural Discourse Segmentation

Requirements

  • Python 3.5
  • Tensorflow>=1.5.0
  • allennlp>=0.4.2
  • See requirements.txt for the full list of packages

Data

We cannot provide the complete RST-DT corpus due to the LDC copyright. So we only put several samples in ./data/rst/ to test the our code and show the data structure.

If you want to train or evaluate our model on RST-DT, you need to download the data manually and put it in the same folder. Then run the following command to preprocess the data and create the vocabulary:

python run.py --prepare

Evaluate the model on RST-DT:

We provide the vocabulary and a well-trained model in the ./data/ folder. You can evaluate the performance of this model after preparing the RST-DT data as mentioned above:

python run.py --evaluate --test_files ../data/rst/preprocessed/test/*.preprocessed

The performance of current model should be as follows:

'precision': 0.9176470588235294, 'recall': 0.975, 'f1': 0.9454545454545454}

Note that this is slightly better than the results we reported in the paper, since we re-trained the model and there is some randomness here.

Train a new model

You can use the following command to train the model from scratch:

python run.py --train

Hyper-parameters and other training settings can be modified in config.py.

Segmenting raw text into EDUs

You can segment files with raw text into EDUs:

python run.py --segment --input_files ../data/rst/TRAINING/wsj_110*.out --result_dir ../data/results/

The segmented result for each file will be saved to the --result_dir folder with the same name. Each EDU is written as a line.

Citation

Please cite the following paper if you use this toolkit in your work:

@inproceedings{wang2018edu,
  title={Toward Fast and Accurate Neural Discourse Segmentation},
  author={Wang, Yizhong and Li, Sujian and Yang, Jingfeng},
  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
  pages={962--967},
  year={2018}
}

neuraleduseg's People

Contributors

yizhongw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

neuraleduseg's Issues

Exception

Hi, do you know what might be causing the below exception? I'm parsing the raw data in Penn Treebank.
Thanks

`
Traceback (most recent call last):
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: num_lower must be negative or less or equal to number of rows (4) got: 5
[[{{node encoding/bilinear_attention/MatrixBandPart}} = MatrixBandPart[T=DT_FLOAT, Tindex=DT_INT64, _device="/job:localhost/replica:0/task:0/device:GPU:0"](encoding/bilinear_attention/ones_like, encoding/bilinear_attention/MatrixBandPart/num_lower/_271, encoding/bilinear_attention/MatrixBandPart/num_lower/_271)]]
[[{{node encoding/rnn_2/bidirectional_rnn/bw/bw/stack/_287}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_578_encoding/rnn_2/bidirectional_rnn/bw/bw/stack", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run.py", line 48, in
segment(args)
File "/home/grigorii/homework/cpsc503/project/datasets/discourse_parsers/NeuralEDUSeg/src/api.py", line 158, in segment
batch_pred_segs = model.segment(batch)
File "/home/grigorii/homework/cpsc503/project/datasets/discourse_parsers/NeuralEDUSeg/src/elmo_crf_seg.py", line 78, in segment
scores, trans_params = self.sess.run([self.scores, self.trans_params], feed_dict)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: num_lower must be negative or less or equal to number of rows (4) got: 5
[[node encoding/bilinear_attention/MatrixBandPart (defined at /home/grigorii/homework/cpsc503/project/datasets/discourse_parsers/NeuralEDUSeg/src/layers.py:40) = MatrixBandPart[T=DT_FLOAT, Tindex=DT_INT64, _device="/job:localhost/replica:0/task:0/device:GPU:0"](encoding/bilinear_attention/ones_like, encoding/bilinear_attention/MatrixBandPart/num_lower/_271, encoding/bilinear_attention/MatrixBandPart/num_lower/_271)]]
[[{{node encoding/rnn_2/bidirectional_rnn/bw/bw/stack/_287}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_578_encoding/rnn_2/bidirectional_rnn/bw/bw/stack", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'encoding/bilinear_attention/MatrixBandPart', defined at:
File "run.py", line 48, in
segment(args)
File "/home/grigorii/homework/cpsc503/project/datasets/discourse_parsers/NeuralEDUSeg/src/api.py", line 134, in segment
model = AttnSegModel(args, word_vocab)
File "/home/grigorii/homework/cpsc503/project/datasets/discourse_parsers/NeuralEDUSeg/src/elmo_crf_seg.py", line 14, in init
super().init(args, word_vocab)
File "/home/grigorii/homework/cpsc503/project/datasets/discourse_parsers/NeuralEDUSeg/src/base_seg.py", line 38, in init
self._build_graph()
File "/home/grigorii/homework/cpsc503/project/datasets/discourse_parsers/NeuralEDUSeg/src/lstm_seg.py", line 18, in _build_graph
self._encode()
File "/home/grigorii/homework/cpsc503/project/datasets/discourse_parsers/NeuralEDUSeg/src/atten_seg.py", line 20, in _encode
self.encoded_sent, self.placeholders['input_length'], self.window_size)
File "/home/grigorii/homework/cpsc503/project/datasets/discourse_parsers/NeuralEDUSeg/src/layers.py", line 40, in self_attention
restricted_mask = tf.matrix_band_part(tf.ones_like(logits, dtype=tf.float32), window_size, window_size)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 4142, in matrix_band_part
num_upper=num_upper, name=name)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/grigorii/anaconda3/envs/tfenv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): num_lower must be negative or less or equal to number of rows (4) got: 5
[[node encoding/bilinear_attention/MatrixBandPart (defined at /home/grigorii/homework/cpsc503/project/datasets/discourse_parsers/NeuralEDUSeg/src/layers.py:40) = MatrixBandPart[T=DT_FLOAT, Tindex=DT_INT64, _device="/job:localhost/replica:0/task:0/device:GPU:0"](encoding/bilinear_attention/ones_like, encoding/bilinear_attention/MatrixBandPart/num_lower/_271, encoding/bilinear_attention/MatrixBandPart/num_lower/_271)]]
[[{{node encoding/rnn_2/bidirectional_rnn/bw/bw/stack/_287}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_578_encoding/rnn_2/bidirectional_rnn/bw/bw/stack", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
`

dependency conflicts

Hi!

Thank you for this wonderful paper and repo. I meet several dependency conflicts when I try to build the virtual environment.

pip install -r requirements.txt

It returns the following error:

ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

tensorboard 1.14.0 requires setuptools>=41.0.0, but you'll have setuptools 39.0.1 which is incompatible.
tensorflow-gpu 1.14.0 requires numpy<2.0,>=1.14.5, but you'll have numpy 1.14.2 which is incompatible.
thinc 6.10.3 requires wrapt<1.11.0,>=1.10.0, but you'll have wrapt 1.12.1 which is incompatible.
scipy 1.5.2 requires numpy>=1.14.5, but you'll have numpy 1.14.2 which is incompatible.
allennlp 1.0.0 requires spacy<2.3,>=2.1.0, but you'll have spacy 2.0.11 which is incompatible.

If I directly tried to run:

cd src
python run.py --segment --input_files ../data/rst/TRAINING/wsj_110*.out --result_dir ../data/results/

It gives us the following error messages:

ModuleNotFoundError: No module named 'allennlp.commands.elmo'

I believe this message is caused by the version of allennlp (based on https://stackoverflow.com/questions/62884591/modulenotfounderror-no-module-named-allennlp-commands-elmo).

Could you please give me more instructions on how to set up the virtual environment successfully? Thank you!

tensorflow.python.framework.errors_impl.InvalidArgumentError: num_lower must be negative or less or equal to number of rows (3) got: 5

When segmenting very short input sentences, this happens:

$cat /tmp/bad_input.txt 
good food .

$ python run.py --segment --input_files /tmp/bad_input.txt --result_dir /tmp/results/
 2021-02-24 22:00:12,457 - SegEDU - INFO - Running with args : Namespace(batch_size=32, dev_files=None, dropout_keep_prob=0.9, ema_decay=0.9999, epochs=50, evaluate=False, gpu=None, hidden_size=200, input_files=['/tmp/bad_input.txt'], lea
rning_rate=0.001, log_path=None, max_grad_norm=5.0, model_dir='/usr/local/lib/python3.6/site-packages/neuralseg-0.1.0a0-py3.6.egg/data/models', optim='adam', prepare=False, result_dir='/tmp/results/', rst_dir='../data/rst/', seed=123, seg
ment=True, test_files=None, train=False, train_files=None, weight_decay=0.0001, window_size=5, word_embed_path='../data/embeddings/glove.840B.300d.txt', word_embed_size=300, word_vocab_path='../data/vocab/word.vocab')
 2021-02-24 22:00:12,457 - SegEDU - INFO - Loading vocab...
 2021-02-24 22:00:12,522 - SegEDU - INFO - Word vocab size: 17243
 2021-02-24 22:00:12,522 - SegEDU - INFO - Loading the model...
2021-02-24 22:00:12.522775: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:97: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
 2021-02-24 22:00:17,873 - SegEDU - INFO - There are 4043611 parameters in the model
 2021-02-24 22:00:17,873 - SegEDU - INFO - Using Exp Moving Average to train the model with decay 0.9999.
 2021-02-24 22:00:19,063 - SegEDU - INFO - Time to build graph: 6.5401506423950195 s
/usr/local/lib/python3.6/site-packages/sklearn/utils/linear_assignment_.py:22: FutureWarning: The linear_assignment_ module is deprecated in 0.21 and will be removed from 0.23. Use scipy.optimize.linear_sum_assignment instead.
  FutureWarning)
 2021-02-24 22:00:32,036 - SegEDU - INFO - Model restored from /usr/local/lib/python3.6/site-packages/neuralseg-0.1.0a0-py3.6.egg/data/models/best
 2021-02-24 22:00:32,947 - SegEDU - INFO - Segmenting /tmp/bad_input.txt...
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1329, in _run_fn
    status, run_metadata)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: num_lower must be negative or less or equal to number of rows (3) got: 5
         [[Node: encoding/bilinear_attention/MatrixBandPart = MatrixBandPart[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](encoding/bilinear_attention/ones_like, encoding/bilinear_attention/MatrixBandPart/num_lower,
encoding/bilinear_attention/MatrixBandPart/num_lower)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run.py", line 50, in <module>
    segment(args)
  File "/usr/local/lib/python3.6/site-packages/neuralseg-0.1.0a0-py3.6.egg/neuralseg/api.py", line 157, in segment
    batch_pred_segs = model.segment(batch)
  File "/usr/local/lib/python3.6/site-packages/neuralseg-0.1.0a0-py3.6.egg/neuralseg/elmo_crf_seg.py", line 125, in segment
    feed_dict)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1128, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1344, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1363, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: num_lower must be negative or less or equal to number of rows (3) got: 5
         [[Node: encoding/bilinear_attention/MatrixBandPart = MatrixBandPart[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](encoding/bilinear_attention/ones_like, encoding/bilinear_attention/MatrixBandPart/num_lower,
encoding/bilinear_attention/MatrixBandPart/num_lower)]]

Caused by op 'encoding/bilinear_attention/MatrixBandPart', defined at:
  File "run.py", line 50, in <module>
    segment(args)
  File "/usr/local/lib/python3.6/site-packages/neuralseg-0.1.0a0-py3.6.egg/neuralseg/api.py", line 133, in segment
    model = AttnSegModel(args, word_vocab)
  File "/usr/local/lib/python3.6/site-packages/neuralseg-0.1.0a0-py3.6.egg/neuralseg/elmo_crf_seg.py", line 17, in __init__
    super().__init__(args, word_vocab)
  File "/usr/local/lib/python3.6/site-packages/neuralseg-0.1.0a0-py3.6.egg/neuralseg/base_seg.py", line 37, in __init__
    self._build_graph()
  File "/usr/local/lib/python3.6/site-packages/neuralseg-0.1.0a0-py3.6.egg/neuralseg/lstm_seg.py", line 20, in _build_graph
    self._encode()
  File "/usr/local/lib/python3.6/site-packages/neuralseg-0.1.0a0-py3.6.egg/neuralseg/atten_seg.py", line 24, in _encode
    self.window_size)
  File "/usr/local/lib/python3.6/site-packages/neuralseg-0.1.0a0-py3.6.egg/neuralseg/layers.py", line 51, in self_attention
    window_size, window_size)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 2409, in matrix_band_part
    num_upper=num_upper, name=name)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): num_lower must be negative or less or equal to number of rows (3) got: 5
         [[Node: encoding/bilinear_attention/MatrixBandPart = MatrixBandPart[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](encoding/bilinear_attention/ones_like, encoding/bilinear_attention/MatrixBandPart/num_lower,
encoding/bilinear_attention/MatrixBandPart/num_lower)]]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.