zseder / hunvec Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 4.0 422 KB

Sequential Tagging in NLP using neural networks

Python 99.02% Shell 0.98%

hunvec's People

Contributors

Stargazers

Watchers

Forkers

katalin-pajkossy makrai pajkossy llparaschiv

hunvec's Issues

embedding init of word vectors, where the word is not in dataset

learning_rate_adjustor cannot be omitted

extensions should be empty list by default

evaluation of short sentences

currently while preparing datasets very short sentences are dropped.
even if training is not possible with them the test data could contain them so that test results are reliable

pylearn incompatibility

now, we need small changes in pylearn2/utils/iteration.py and pylearn2/space/__init__.py to be able to run, why are they necessary? should our changes go to pylearn2? Can we overcome this?

continue training

if a model is given to trainer, continue training and don't use training parameters
later maybe we can change training parameters with this

tagging script

prepare an easy to use tagging script, the trainer script should be renamed to training, splitted into train and NN, etc

dropout not working when model is reloaded and training is continued

in sequence_tagger.py, dropout_fprop() uses self.hdims, but that parameter isn't saved with the model (see __get_state__), so it fails.
Instead of checking hdims, a check on self.layers would be better of possible. If not, hdims has to go to __get_state__, but hopefully not, it is redundant.

@pajkossy

prepare.py options

-for automatic replacement of numerals
-option for giving closed vocab (so that words out of it are mapped to unknown, even if they're in the training data)

fake_feats error

if fake_feats is there and other features are there, the convergence is way slower. Maybe it is because pylearn's SGD gets confused with features that are always active

feature.cfg parser needed

for locations of gazeetteer lists etc.

Dropout

dropout should be easy with pylearn2

better log-sum-exp approximation

log(sum(exp(z_{i}))) = max_{i}z_{i} + log(sum(z_{i} - max{i}z_{i}))

SequenceDataSpace

would it be a better choice thant indexsequencespace?
can it be used to create 2+ sized batches?

F1 monitoring

viterbi is done by theano, but it would be easier to run theano-independent tagging+viterbi+f1 computing with simply numpy, and add its result somehow to monitor, we don't need this information during training step

ProjectionLayer fix parameters

When we use a bigger vocabulary than what is in training data (external embeddings at evaluation time for example), there are a lots of invariant parameters in ProjectionLayer.
We should create our own ProjectionLayer (inheriting the one from pylearn), that knows which parameters are constant, and which are changable, and then change ProjectionLayer.get_params() to return only that part of W that is changable (if this slicing operation is permitted in theano)

implement new get() method for dataset

iteration.py:784: UserWarning: dataset is using the old iterator interface which is deprecated and will become officially unsupported as of July 28, 2015. The dataset should implement a get method respecting the new interface.

sparse representation of tags

tags are now encoded using one-hot encoding, but is it needed? cannot they be simple numbers?

lr_lin_decay

even if lr_monitor_decay flag is on

when using --dropout

Traceback (most recent call last):
File "hunvec/seqtag/trainer.py", line 123, in
main()
File "hunvec/seqtag/trainer.py", line 119, in main
wt.train()
File "/home/pajkossy/git/hunvec/hunvec/seqtag/sequence_tagger.py", line 157, in train
self.algorithm.train(dataset=self.dataset['train'])
File "/home/pajkossy/pylearn2/pylearn2/training_algorithms/sgd.py", line 453, in train
self.sgd_update(*batch)
File "/home/pajkossy/theano/theano_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 588, in call
self.fn.thunks[self.fn.position_of_error])
File "/home/pajkossy/theano/theano_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 579, in call
outputs = self.fn()
IndexError: index 9360 is out of bounds for size 9354
Apply node that caused the error: AdvancedSubtensor1(feats_W, Elemwise{Cast{int64}}.0)
Inputs shapes: [(9354, 100), (168,)]
Inputs strides: [(400, 4), (8,)]
Inputs types: [TensorType(float32, matrix), TensorType(int64, vector)]
Use the Theano flag 'exception_verbosity=high' for a debugprint of this apply node.

save datasets

save them into file to avoid reading and featurizing over and over again from 3 different files when there are train/test/devel splits

tagger.py bug

Traceback (most recent call last):
File "hunvec/seqtag/tagger.py", line 61, in
main()
File "hunvec/seqtag/tagger.py", line 57, in main
tag(args)
File "hunvec/seqtag/tagger.py", line 48, in tag
tags = wt.tag_sen(words, feats)
File "/home/pajkossy/hunvec/hunvec/seqtag/sequence_tagger.py", line 224, in tag_sen
y = self.f(words, feats)
File "/home/pajkossy/pylearn_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 608, in call
storage_map=self.fn.storage_map)
File "/home/pajkossy/pylearn_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 597, in call
outputs = self.fn()
IndexError: index 30 is out of bounds for size 28
Apply node that caused the error: AdvancedSubtensor1(feats_W, Flatten{1}.0)
Inputs types: [TensorType(float64, matrix), TensorType(int64, vector)]
Inputs shapes: [(28, 5), (52,)]
Inputs strides: [(40, 8), (8,)]
Inputs values: ['not shown', 'not shown']

eval, tagging slow

Language modeling

It's quite slow right now, but there is a PR lisa-lab/pylearn2#1406, with which LM can be fastened

feature embedding of low dimension

~5 (no the representation is sparse because of high diemnsional feature vectors)

maxout

experiment with maxout as hidden layers

symmetric window with lookahead

save corpus output

after training, if we use model for tagging, we need index->word and index->tag resolution

external embeddings

for ProjectionLayer, parameters should be able to be set from embeddings trained outside of this library

error message when using regularization

Traceback (most recent call last):
File "hunvec/seqtag/trainer.py", line 123, in
main()
File "hunvec/seqtag/trainer.py", line 118, in main
wt.create_algorithm(d, args.model_path)
File "/home/pajkossy/git/hunvec/hunvec/seqtag/sequence_tagger.py", line 151, in create_algorithm
self.algorithm.setup(self, self.dataset['train'])
File "/home/pajkossy/pylearn2/pylearn2/training_algorithms/sgd.py", line 316, in setup
** fixed_var_descr.fixed_vars)
File "/home/pajkossy/git/hunvec/hunvec/cost/seq_tagger_cost.py", line 22, in expr
sc += model.tagger.get_weight_decay(self.reg[0])
File "/home/pajkossy/pylearn2/pylearn2/models/mlp.py", line 695, in get_weight_decay
for layer, coeff in safe_izip(self.layers, coeffs):
File "/home/pajkossy/pylearn2/pylearn2/utils/init.py", line 277, in safe_izip
assert all([len(arg) == len(args[0]) for arg in args])
AssertionError

words in training, but not in vocab

If there is a word in the training data that is not in the predefined vocab, then these words will be handled as unknowns

tagger.py gives irrealistic fscores

when using with -- sets train,test,valid

complete vocab file needed

if words in training data are not included, their embedding wont be used

optparse

it would be easier to run experiments without always changing parameters (and more git friendly)

configurable hidden layers

Should be able to run with n hidden layers with different number of units for experimenting
watch out for regularization, coeffwill be a variable length tuple depending on number of hidden layers

TaggedCorpus refactoring

TaggedCorpus should be split into RawCorpus and TaggedCorpus

only tag-related content goes into TaggedCorpus
read should be implemented this way
- in RawCorpus.read() maybe we should use a needed_fields argument, with a default value of [0] (0th index=word), and TaggedCorpus.read() should have only contain a call to RawCorpus.read(needed_fields=[0,1]), where at first place, there is the tag
- the only problem is that from now on, words will be a 1-length lists, to be compatible with tags, when those will be 2-length lists (NOT tuples, because lists can be changed in-place, so it's easier to turn them into integers later)
read() should keep the pre flag, so when featurizer is preprocessing the data, it will only return words
- if we call featurizer preprocessing in RawCorpus.__init__(), needed_fields flag for read() will be good and only words will be returned, so hopefully no change there

config file

argparse is okay right now, but there are a lot of options and there will be a lot more later, so we need a config infrastructure

arguments should be there
new configs about possible datasets (eng+ner, hun+pos, etc)

tagger improvements

per word precision
tagging