ufal / neuralmonkey Goto Github PK
View Code? Open in Web Editor NEWAn open-source tool for sequence learning in NLP built on TensorFlow.
License: BSD 3-Clause "New" or "Revised" License
An open-source tool for sequence learning in NLP built on TensorFlow.
License: BSD 3-Clause "New" or "Revised" License
The package should somehow provide an executable. Right now the training may be executed with python -u -m neuralmonkey.train whatever.ini
, maybe we should document this somewhere until we find a better way to do this. One of us will have to learn how to manage a proper Python package.
This should have been done in the tf9
branch before the merge.
When you catch general exceptions there, every exception (eg. import error) from the module that you want something from gets caught and the the error message about non-existence of something that is clearly there is quite confusing. Btw. you should never just catch general exceptions. Fix this after I finish #4.
I you attempt to create more encoders and do not provide its name, it will crash later on collision of the variable scopes. I would suggest a mechanism (probably in utils.py) that would be always asked to get a name and append a number if there would be a collision.
Variables are saved only if they are the best so far. They should be saved whenever the score makes it to top-n scores.
Why is this logbook
thing in master? Is it done or is it work-in-progress? If it's done, flask should be a dependency. It is not used anywhere, is it a stand-alone tool? Then it should be documented somewhere.
Test scripts should be moved away from the root directory. Also, why there is two files, tests_run.sh
and run_tests.sh
and what is their purpose? This should be all done in the tests directory. Also with unit-tests_run.sh
, mypy_run.sh
, lint_run.sh
and others.
Also, one of the run tests scripts (tests_run i think) should use the -P
(or --directory-prefix
) option of wget instead of cd
-ing there and back again. The test-output
directory should be generated somewhere else than in the root of the repository, preferable in a temporary location either in /tmp
or in a tmp
dir in subdirectory of tests
.
While I try to make #6 happen, I find many issues that I am not capable/willing to address. I will maintain a checklist of what needs to be done here. Btw. you would not believe how much code that clearly is not working (random.random > 0.5
, unused variables, ...) I've encountered so far.
processors/bpe.py
classes are questionable here.processors/german.py
classes are questionableutils.py
This needs to be a class!logging.py
variousdecoding_function.py
There are lots of arguments, some unused. This needs a complete refactor.mlp.py
This really should not be a class. Also, aren't there any implementations of multilayered perceptron that we can use?readers/plain_text_reader.py
This should not be a class.config/config_loader.py
general exceptions, see #12config/configuration.py
general exceptions, see #12config/config_generator.py
I think this should be abandoned, see #17encoders/sentence_encoder.py
crazy big object, 13 parameters...encoders/image_encoder.py
dttoencoders/cnn_encoder.py
dttobidirectional_rnn_layer.py
class is questionabletokenize_data.py
oh the horrordecoders/sequence_classifier.py
too many instance attributesdecders/decoder.py
big and ugly objectimage_utils.py
questionable classprepare_str_images.py
general exceptionprecompute_image_features.py
too long, break it uptrainers/copynet.py
undefined variable!trainers/cross_entropy_trainer.py
questionable classtrainers/mixer.py
various errorsdecompound_truecase.py
javabridgerunners/runner.py
questionable classrunners/beamsearch.py
questionable classrunners/copynet_runner.py
undefined variable!runners/perplexity.py
questionable classlogbook/logbook.py
dependenciescells/noisy_gru_cells.py
variouslearning_utils.py
pure evil โ half-screen levels of indentationcaffe_image_features.py
imports and other thingslazy_dataset.py
argument numbers, non-existent membersreformat_downloaded_image_features.py
importsSince we are hosting this publicly, it should have a license. I personally like MIT or BSD3.
Why does batch size in ini files appear both in [main]
and [runner]
? Which one is used?
Once we have the run.py script it should be extremely easy using flask (which is already a dependency). It will receive a dictionary of dataset series (the same way we have right now) as JSON and send back a JSON with outputs and some statistics.
Can we define one random_seed in the top level of the configuration, that will be used everywhere?
With webservice ready, it is time to set up something like this:
http://quest.ms.mff.cuni.cz/moses/demo.php
Here's the checklist
Also, creating a new label for issues related to the web service.
Commit add9bdc (fix saving variables) introduced a bug. It creates a symlink with wrong relative address. For example, when I set my output to directory test-out
, it creates a link to test-out/data.whatever
in that folder. So the script is looking for test-out/test-out/data.whatever
instead of test-out/data.whatever
.
My preferred solution would be to put the commit into a separate branch (rewinding current master by one commit) and merge #13. That will enable running tests/small.ini
on Travis. The new branch can be merged when the bug is fixed. What do you think, @jindrahelcl?
Implement minimum risk trainer as described in http://arxiv.org/abs/1512.02433.
Since 5a4498, original (meaning not post-processed) decoded output and pre-processed reference is shown in the validation log. This makes the validation output twice as large and ultimately more hideous. But it's useful when debugging pre- and postprocessing.
neuralmonkey/estimate_scheduled_sampling.py
is depending on scipy. Scipy is quite a big dependency, I'd hate to install it just for the one function. Can we do something about this?
Often data are split into more files. Lazy dataset which is designed for loading bigger datasets should be able to get a list of files / wildcards specifying the files it will read.
Evaluation functions should be refactored in callable and comparable objects to simplify the training loop function. They can also define their own name so the output in the log need not to be the name of the function.
IP adresses, queries, results... all of this should be stored in some files.
This should be mainly rewriting tensorflow.python.ops.rnn*
to tf.nn.rnn*
.
If we do not have concrete versions of dependencies in requirements, things like error in #54 might be happening from time to time. On the other hand, if we freeze the dependencies, we should check for updates from time to time, which means more work. I'm leaning towards automatic updates and letting the build fail from time to time. We test things fairly regularly now, so we should be able to catch and repair breaking changes. What do you think?
Lazy dataset should have a similar building function as the standard (in-memory) dataset in the config
module. Moreover, its __init__
method should be refactored the same way.
I'm not quite sure what are we trying to achieve here. What is the goal of this package? How does it differ from tflearn and similar frameworks? Are we writing something that is already done somewhere? If not, what is new here?
These questions should be clearly answered in the README, if we want anybody to use this.
... and get rid of NLTK, whom I don't believe a single line of code.
What is imagenet_synset_words.txt
doing in the repo?
When I ran my tests/small.ini
configuration, it failed on some error with lambda wanting two arguments, but just one was given. I solved this in my branch by removing the second (unused) argument of the lambda on line 36 in learning_utils, but I'm not sure whether this does not break anything else. Can you have a look at this, @jindrahelcl?
Edit: The correct lambda was on line 155, but maybe they should be the same?
Write support for ensemble models.
The idea is, in the end, to give the running script multiple *.ini files (or one with links to another experiments).
Right now we can specify a random seed in configuration, but it does not work.
Now, the encoders are listed multiple times in the configuration: in the main configuration and as the arguments of a decoder. Duplication is a frequent source of errors, and therefore it should be only in the decoder.
Implementation of beam search relies on placeholders which are fed fed nothing if the ground thruth sequence is not provided.
@jlibovicky, in #15 you mentioned that it would be hard to generate an ini file. Why do we need to do that?
I'm not quite sure what is the design (if any) of the configuration manipulation. I thought that there is an ini file, that gets parsed into an abstract representation (some terrible Python object), then we build a computation graph according to the representation and run it. Is there anything else happening?
It is probably and encoding issue connected to transfer to Python 3.5. Values in <> get ignored in when logbook serves the ini files.
We should be able to create models that have more decoders at the same time. E.g. that would classify a sentence and output a sequence at the same time.
Should subword_nmt
be a submodule? Are we doing the imports right?
Cannot import 'Levenshtein' when trying to run pylint on evaluation.py
. This means that a package is missing in requirements.txt.
We need to:
__init__.py
file in every directoryimport neuralmonkey.vocabulary
instead of import vocabulary
).pylintrc
I've already done this for the tests/python
directory, I hope it does not break anything.
When I run bin/neuralmonkey-logbook --port 5050 --logdir tests/tmp-test-output
, I get a screen with "click experiment on the left", but there is no experiment on the left. Why is that?
There are many code review tools integrated with GitHub (eg. Reviewable). Should we use one of them in our workflow?
We should run pylint on everything. For easy automatic checking, every file should have 10/10 score. To achieve this, you may have to locally disable some warnings (# pylint: disable=...
) โ use this only if it is really necessary. After you eliminate all errors and warnings, add this line to the file:
# tests: lint
All files containing this line are checked with pylint by lint_run.sh
, which you should always run before you commit anything and which is automatically run on Travis CI after you push to Github.
You can see the list of files that have not been checked yet with test_status.sh
. This issue will be closed when that list is empty.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.