Giter VIP home page Giter VIP logo

hunvec's People

Contributors

makrai avatar pajkossy avatar zseder avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

hunvec's Issues

evaluation of short sentences

currently while preparing datasets very short sentences are dropped.
even if training is not possible with them the test data could contain them so that test results are reliable

pylearn incompatibility

now, we need small changes in pylearn2/utils/iteration.py and pylearn2/space/__init__.py to be able to run, why are they necessary? should our changes go to pylearn2? Can we overcome this?

continue training

  • if a model is given to trainer, continue training and don't use training parameters
  • later maybe we can change training parameters with this

tagging script

prepare an easy to use tagging script, the trainer script should be renamed to training, splitted into train and NN, etc

prepare.py options

-for automatic replacement of numerals
-option for giving closed vocab (so that words out of it are mapped to unknown, even if they're in the training data)

fake_feats error

if fake_feats is there and other features are there, the convergence is way slower. Maybe it is because pylearn's SGD gets confused with features that are always active

Dropout

dropout should be easy with pylearn2

SequenceDataSpace

  • would it be a better choice thant indexsequencespace?
  • can it be used to create 2+ sized batches?

F1 monitoring

viterbi is done by theano, but it would be easier to run theano-independent tagging+viterbi+f1 computing with simply numpy, and add its result somehow to monitor, we don't need this information during training step

ProjectionLayer fix parameters

When we use a bigger vocabulary than what is in training data (external embeddings at evaluation time for example), there are a lots of invariant parameters in ProjectionLayer.
We should create our own ProjectionLayer (inheriting the one from pylearn), that knows which parameters are constant, and which are changable, and then change ProjectionLayer.get_params() to return only that part of W that is changable (if this slicing operation is permitted in theano)

implement new get() method for dataset

iteration.py:784: UserWarning: dataset is using the old iterator interface which is deprecated and will become officially unsupported as of July 28, 2015. The dataset should implement a get method respecting the new interface.

when using --dropout

Traceback (most recent call last):
File "hunvec/seqtag/trainer.py", line 123, in
main()
File "hunvec/seqtag/trainer.py", line 119, in main
wt.train()
File "/home/pajkossy/git/hunvec/hunvec/seqtag/sequence_tagger.py", line 157, in train
self.algorithm.train(dataset=self.dataset['train'])
File "/home/pajkossy/pylearn2/pylearn2/training_algorithms/sgd.py", line 453, in train
self.sgd_update(*batch)
File "/home/pajkossy/theano/theano_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 588, in call
self.fn.thunks[self.fn.position_of_error])
File "/home/pajkossy/theano/theano_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 579, in call
outputs = self.fn()
IndexError: index 9360 is out of bounds for size 9354
Apply node that caused the error: AdvancedSubtensor1(feats_W, Elemwise{Cast{int64}}.0)
Inputs shapes: [(9354, 100), (168,)]
Inputs strides: [(400, 4), (8,)]
Inputs types: [TensorType(float32, matrix), TensorType(int64, vector)]
Use the Theano flag 'exception_verbosity=high' for a debugprint of this apply node.

save datasets

save them into file to avoid reading and featurizing over and over again from 3 different files when there are train/test/devel splits

tagger.py bug

Traceback (most recent call last):
File "hunvec/seqtag/tagger.py", line 61, in
main()
File "hunvec/seqtag/tagger.py", line 57, in main
tag(args)
File "hunvec/seqtag/tagger.py", line 48, in tag
tags = wt.tag_sen(words, feats)
File "/home/pajkossy/hunvec/hunvec/seqtag/sequence_tagger.py", line 224, in tag_sen
y = self.f(words, feats)
File "/home/pajkossy/pylearn_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 608, in call
storage_map=self.fn.storage_map)
File "/home/pajkossy/pylearn_env/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 597, in call
outputs = self.fn()
IndexError: index 30 is out of bounds for size 28
Apply node that caused the error: AdvancedSubtensor1(feats_W, Flatten{1}.0)
Inputs types: [TensorType(float64, matrix), TensorType(int64, vector)]
Inputs shapes: [(28, 5), (52,)]
Inputs strides: [(40, 8), (8,)]
Inputs values: ['not shown', 'not shown']

maxout

experiment with maxout as hidden layers

save corpus output

after training, if we use model for tagging, we need index->word and index->tag resolution

external embeddings

for ProjectionLayer, parameters should be able to be set from embeddings trained outside of this library

error message when using regularization

Traceback (most recent call last):
File "hunvec/seqtag/trainer.py", line 123, in
main()
File "hunvec/seqtag/trainer.py", line 118, in main
wt.create_algorithm(d, args.model_path)
File "/home/pajkossy/git/hunvec/hunvec/seqtag/sequence_tagger.py", line 151, in create_algorithm
self.algorithm.setup(self, self.dataset['train'])
File "/home/pajkossy/pylearn2/pylearn2/training_algorithms/sgd.py", line 316, in setup
** fixed_var_descr.fixed_vars)
File "/home/pajkossy/git/hunvec/hunvec/cost/seq_tagger_cost.py", line 22, in expr
sc += model.tagger.get_weight_decay(self.reg[0])
File "/home/pajkossy/pylearn2/pylearn2/models/mlp.py", line 695, in get_weight_decay
for layer, coeff in safe_izip(self.layers, coeffs):
File "/home/pajkossy/pylearn2/pylearn2/utils/init.py", line 277, in safe_izip
assert all([len(arg) == len(args[0]) for arg in args])
AssertionError

optparse

it would be easier to run experiments without always changing parameters (and more git friendly)

configurable hidden layers

  • Should be able to run with n hidden layers with different number of units for experimenting
  • watch out for regularization, coeffwill be a variable length tuple depending on number of hidden layers

TaggedCorpus refactoring

TaggedCorpus should be split into RawCorpus and TaggedCorpus

  • only tag-related content goes into TaggedCorpus
  • read should be implemented this way
    • in RawCorpus.read() maybe we should use a needed_fields argument, with a default value of [0] (0th index=word), and TaggedCorpus.read() should have only contain a call to RawCorpus.read(needed_fields=[0,1]), where at first place, there is the tag
    • the only problem is that from now on, words will be a 1-length lists, to be compatible with tags, when those will be 2-length lists (NOT tuples, because lists can be changed in-place, so it's easier to turn them into integers later)
  • read() should keep the pre flag, so when featurizer is preprocessing the data, it will only return words
    • if we call featurizer preprocessing in RawCorpus.__init__(), needed_fields flag for read() will be good and only words will be returned, so hopefully no change there

config file

argparse is okay right now, but there are a lot of options and there will be a lot more later, so we need a config infrastructure

  • arguments should be there
  • new configs about possible datasets (eng+ner, hun+pos, etc)

unknown word's vector

possibly the vector belonging to unknown word (-1) is the same as for the last word, because of python indexing. The new vocab size should be vocab+1, and then they won't collide

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.