yandex / faster-rnnlm Goto Github PK

Faster Recurrent Neural Network Language Modeling Toolkit with Noise Contrastive Estimation and Hierarchical Softmax

License: Other

Shell 2.09% Makefile 1.55% Cuda 7.01% C 1.26% C++ 88.08%

faster-rnnlm's Introduction

Faster RNNLM (HS/NCE) toolkit

In a nutshell, the goal of this project is to create an rnnlm implementation that can be trained on huge datasets (several billions of words) and very large vocabularies (several hundred thousands) and used in real-world ASR and MT problems. Besides, to achieve better results this implementation supports such praised setups as ReLU+DiagonalInitialization [1], GRU [2], NCE [3], and RMSProp [4].

How fast is it? Well, on One Billion Word Benchmark [8] and 3.3GHz CPU the program with standard parameters (sigmoid hidden layer of size 256 and hierarchical softmax) processes more then 250k words per second in 8 threads, i.e. 15 millions of words per minute. As a result an epoch takes less than one hour. Check Experiments section for more numbers and figures.

The distribution includes ./run_benchmark.sh script to compare training speed on your machine among several implementations. The scripts downloads Penn Tree Bank corpus and trains four models: Mikolov's rnnlm with class-based softmax from here, Edrenkin's rnnlm with HS from Kaldi project, faster-rnnlm with hierarchical softmax, and faster-rnnlm with noise contrastive estimation. Note that while models with class-based softmax can achieve a little lower entropy then models hierarchical softmax, their training is infeasible for large vocabularies. On the other hand, NCE speed doesn't depend on the size of the vocabulary. Whats more, models trained with NCE is comparable with class-based models in terms of resulting entropy.

Quick start

Run ./build.sh to download Eigen library and build faster-rnnlm.

To train a simple model with GRU hidden unit and Noise Contrastive Estimation, use the following command:

./rnnlm -rnnlm model_name -train train.txt -valid validation.txt -hidden 128 -hidden-type gru -nce 20 -alpha 0.01

Files train.txt and test.txt must contain one sentence per line. All distinct words that are found in the training file will be used for the nnet vocab, their counts will determine Huffman tree structure and remain fixed for this nnet. If you prefer using limited vocabulary (say, top 1 million words) you should map all other words to or another token of your choice. Limited vocabulary is usually a good idea if it helps you to have enough training examples for each word.

To apply the model use following command:

./rnnlm -rnnlm model_name -test train.txt

Logprobs (log10) of each sentence are printed to stdout. Entropy of the corpus in bits is printed to stderr.

Model architecture

The neural network has an input embedding layer, a few hidden layers, an output layer, and optional direct input-output connections.

Hidden layer

At the moment the following hidden layers are supported: sigmoid, tanh, relu, gru, gru-bias, gru-insyn, gru-full. First three types are quite standard. Last four types stand for different modification of Gated Recurrent Unit. Namely, gru-insyn follows formulas from [2]; gru-full adds bias terms for reset and update gates; gru uses identity matrices for input transformation without bias; gru-bias is gru with bias terms. The fastest layer is relu, the slowest one is gru-full.

Output layer

Standard output layer for classification problems is softmax. However, as softmax outputs must be normalized, i.e. sum over all classes must be one, its calculation is infeasible for a very large vocabulary. To overcome this problem one can use either softmax factorization or implicit normalization. By default, we approximate softmax via Hierarchical Softmax over Huffman Tree [6]. It allows to calculate softmax in logarithmic linear time, but reduces the quality of the model. Implicit normalization means that one calculates next word probability as in full softmax case, but without explicit normalization over all the words. Of course, it is not guaranteed that such probabilities will sum to up. But in practice the sum is quite close to one due to custom loss function. Checkout [3] for more details.

Maximum entropy model

As was noted in [0], simultaneous training of neural network together with maximum entropy model could lead to significant improvement. In a nutshell, maxent model tries to approximate probability of target as a linear combination of its history features. E.g. in order to estimate probability if word "d" in the sentence "a b c d", the model will sum the following features: f("d") + f("c d") + f("b c d") + f("a b c d"). You can use maxent with both HS and NCE output layers.

Experiments

We provide results of model evaluation on two popular datasets: PTB and One Billion Word Benchmark. Checkout doc/RESULTS.md for reasonable parameters.

Penn Treebank Benchmark

The most popular corpus for LM benchmarks is English Penn Treebank. Its train part contains a little less than 1kk words and the size of vocabulary is 10k words. In other words, it's akin to Iris flower dataset. The size of vocabulary allows one to use less efficient softmax approximation. We compare faster-rnnlm with the latest version of rnnlm toolkit from here. As expected, class-based works a little better than hierarchical softmax, but it is much slower. On the other hand, perplexity for NCE and class-based softmax is comparable while training time differs significantly. What's more, training speed for class-based softmax will decrease with an increase in the size of the vocabulary, while NCE doesn't bother about it. (At least, in theory; in practice, bigger vocabulary will probably increase cache miss frequency.) For fair speed comparison we use only one thread for faster-rnnlm.

Note. We use the following setting: learning_rate = 0.1, noise_samples=30 (for nce), bptt=32+8, threads=1 (for faster-rnnlm).

It was shown that RNN models with sigmoid activation functions trained with NCE criterion outperforms ones trained with CE criterion over approximated softmax (e.g. [3]). We tried to reproduce this improvements using other popular architectures, namely, truncated ReLU, Structurally Constrained Recurrent Network [9] with 40 context units, and Gated Recurrent Unit [2]. Surprisingly, not all types of hidden units benefit from NCE. Truncated ReLU achieves the lowest perplexity among all the other units during CE training, and the highest - during NCE training. We used truncated ReLU as standard ReLU works even worse. "Smart" units (SCRN and GRU) demonstrate superior results.

Note. We report the best perplexity after grid search using the following parameters: learning_rate = {0.01, 0.03, 0.1, 0.3, 1.0}, noise_samples = {10, 20, 60} (for nce only), bptt={32+8, 1+8}, diagonal_initialization={None, 0.1, 0.5, 0.9, 1.0}, L2 = {1e-5, 1e-6, 0}.

The following figure shows dependency between number of noise samples and final perplexity for different types of units. Dashed lines indicate perplexity for models with Hierarchical Softmax. It's easy to see that the samples used, the lower the final perplexity is. However, even 5 samples is enough for NCE to work better than HS. Except for relu-trunc, thas couldn't be trained with NCE for any number of noise samples.

Note. We report the best perplexity after grid search. The size of the hidden layer is 200.

One Billion Word Benchmark

For One Billion Word Benchmark we use setup as is it was described in [8] using official scripts. Around 0.8 billion words in the training corpus; 793471 words in the vocabulary (including <s> and </s> words). We use heldout-00000 for validation, and heldout-00001 for testing.

Hierarchical softmax versus Noise Contrastive Estimation. In a nutshell, for bigger vocabularies drawbacks of HS become more significant. As a result, NCE training results in much smaller values of perplexity. It's easy to see that performance of Truncated ReLU on this dataset agrees with experiments on PTB. Namely, RNN with Truncated ReLU units could be training more efficiently with CE, if the layer size is small. However, relative performance of the other unit types have changed. In contrast to PTB experiments, on One Billion Words corpus the simplest unit achieves the best quality.

Note. We report the best perplexity on heldout-00001 after grid search over the learning_rate, bptt, and diagonal_initialization. We use 50 noise samples for NCE training.

The following graph demonstrates dependency between number of noise samples and final perplexity. Just as in the case of PTB, 5 samples is enough for NCE to significantly outperform NCE.

One important property of RNNLM models is that they are complementary to standard N-gram LM. One way to achieve this is to train maxent model as a part of the neural network mode. That could be achieved by --direct and --direct-order options. Another way to achieve the same effect is to use external language model. We use Interpolated KN 5-gram model that is shipped with the benchmark.

Maxent model significantly decrease perplexity for all hidden layer types and sizes. Moreover, it diminishes the impact of layer size. As expected, combination of RNNLM-ME and KN works better than any of them (perplexity of the KN model is 73).

Note. We took the best performing models from the previous and added maxent layer of size 1000 and order 3.

Command line options

We opted to use command line options that are compatible with Mikolov's rnnlm. As result one can just replace the binary to switch between implementations.

The program has three modes, i.e. training, evaluation, and sampling.

All modes require model name:

    --rnnlm <file>
      Path to model file

Will create and .nnet files (for storing vocab/counts in the text form and the net itself in binary form). If the and .nnet already exist, the tool will attempt to load them instead of starting new training. If the exists and .nnet doesn't, the tool will use existing vocabulary and new weights.

To run program in test mode, you must provide test file. If you use NCE and would like to calculate entropy, you must use --nce_accurate_test flag. All other options are ignored in apply mode

    --test <file>
      Test file
    --nce-accurate-test (0 | 1)
      Explicitly normalize output probabilities; use this option
      to compute actual entropy (default: 0)

To run program in sampling mode, you must select positive number of sentences to sample.

  --generate-samples <int>
    Number of sentences to generate in sampling mode (default: 0)
  --generate-temperature <float>
    Softmax temperature (use lower values to get robuster results) (default: 1)

To train program, you must provide train and validation files

  --train <file>
    Train file
  --valid <file>
    Validation file (used for early stopping)

Model structure options

  --hidden <int>
    Size of embedding and hidden layers (default: 100)
  --hidden-type <string>
    Hidden layer activation (sigmoid, tanh, relu, gru, gru-bias, gru-insyn, gru-full)
    (default: sigmoid)
  --hidden-count <int>
    Count of hidden layers; all hidden layers have the same type and size (default: 1)
  --arity <int>
    Arity of the HS tree; for HS mode only (default: 2)
  --direct <int>
    Size of maxent layer in millions (default: 0)
  --direct-order <int>
    Maximum order of ngram features (default: 0)

Learning reverse model, i.e. a model that predicts words from last one to first one, could be useful for mixture.

  --reverse-sentence (0 | 1)
    Predict sentence words in reversed order (default: 0)

The performance does not scale linearly with the number of threads (it is sub-linear due to cache misses, false HogWild assumptions, etc). Testing, validation and sampling are always performed by a single thread regardless of this setting. Also checkout "Performance notes" section

  --threads <int>
    Number of threads to use

By default, recurrent weights are initialized using uniform distribution. In [1] another method to initialize weights was suggested, i.e. identity matrix multiplied by some positive constant. The option below corresponds to this constant.

  --diagonal-initialization <float>
    Initialize recurrent matrix with x * I (x is the value and I is identity matrix)
    Must be greater then zero to have any effect (default: 0)

Optimization options

  --rmsprop <float>
    RMSprop coefficient; rmsprop=1 disables rmsprop and rmsprop=0 equivalent to RMS
    (default: 1)
  --gradient-clipping <float>
    Clip updates above the value (default: 1)
  --learn-recurrent (0 | 1)
    Learn hidden layer weights (default: 1)
  --learn-embeddings (0 | 1)
    Learn embedding weights (default: 1)
  --alpha <float>
    Learning rate for recurrent and embedding weights (default: 0.1)
  --maxent-alpha <float>
    Learning rate for maxent layer (default: 0.1)
  --beta <float>
    Weight decay for recurrent and embedding weight, i.e. L2-regularization
    (default: 1e-06)
  --maxent-beta <float>
    Weight decay for maxent layer, i.e. L2-regularization (default: 1e-06)

The program supports truncated back propagation through time. Gradients from hidden to input are back propagated on each time step. However gradients from hidden to previous hidden are propagated for bptt steps within each bppt-period block. This trick could speed up training and wrestle gradient explosion. See [7] for details. To disable any truncation set bptt to zero.

  --bptt <int>
    Length of truncated BPTT unfolding
    Set to zero to back-propagate through entire sentence (default: 3)
  --bptt-skip <int>
    Number of steps without BPTT;
    Doesn't have any effect if bptt is 0 (default: 10)

Early stopping options (see [0]). Let `ratio' be a ratio of previous epoch validation entropy to new one.

  --stop <float>
    If `ratio' less than `stop' then start leaning rate decay (default: 1.003)
  --lr-decay-factor <float>
    Learning rate decay factor (default: 2)
  --reject-threshold <float>
    If (whats more) `ratio' less than `reject-threshold' then purge the epoch
    (default: 0.997)
  --retry <int>
    Stop training once `ratio' has hit `stop' at least `retry' times (default: 2)

Noise Contrastive Estimation is used iff number of noise samples (--nce option) is greater then zero. Otherwise HS is used. Reasonable value for nce is 20.

  --nce <int>
    Number of noise samples; if nce is position then NCE is used instead of HS
    (default: 0)
  --use-cuda (0 | 1)
    Use CUDA to compute validation entropy and test entropy in accurate mode,
    i.e. if nce-accurate-test is true (default: 0)
  --use-cuda-memory-efficient (0 | 1)
    Do not copy the whole maxent layer on GPU. Slower, but could be useful to deal with huge
    maxent layers (default: 0)
  --nce-unigram-power <float>
    Discount power for unigram frequency (default: 1)
  --nce-lnz <float>
    Ln of normalization constant (default: 9)
  --nce-unigram-min-cells <float>
    Minimum number of cells for each word in unigram table (works
    akin to Laplacian smoothing) (default: 5)
  --nce-maxent-model <string>
    Use given the model as a noise generator
    The model must a pure maxent model trained by the program (default: )

Other options

  --epoch-per-file <int>
    Treat one pass over the train file as given number of epochs (default: 1)
  --seed <int>
    Random seed for weight initialization and sampling (default: 0)
  --show-progress (0 | 1)
    Show training progress (default: 1)
  --show-train-entropy (0 | 1)
    Show average entropy on train set for the first thread (default: 0)
    Train entropy calculation doesn't work for NCE

Performance notes

To speed up matrix operations we use Eigen (C++ template library for linear algebra). Besides, we use data parallelism with sentence-batch HogWild [5]. The best performance could be achieved if all the threads are binded to the same CPU (one thread per core). This could be done by means of taskset tool (available by default in most Linux distros). E.g. if you have 2 CPUs and each CPU has 8 real cores + 8 hyper threading cores, you should use the following command:

taskset -c 0,1,2,3,4,5,6,7 ./rnnlm -threads 8 ...

In NCE mode CUDA is used to accelerate validation entropy calculation. Of course, if you don't have GPU, you can use CPU to calculate entropy, but it will take a lot of time.

Usage advice

You don't need to repeat structural parameters (hidden, hidden-type, reverse, direct, direct-order) when using an existing model. They will be ignored. The vocabulary saved in the model will be reused.
The vocabulary is built based on the training file on the first run of the tool for a particular model. The program will ignore sentences with OOVs in train time (or report them in test time).
Vocabulary size plays very small role in the performance (it is logarithmic in the size of vocabulary due to the Huffman tree decomposition). Hidden layer size and the amount of training data are the main factors.
Usually NCE works better then HS in terms of both PPL and WER.
Direct connections could dramatically improve model quality. Especially in case of HS. Reasonable values to start from are -direct 1000 -direct-order 4.
The model will be written to file after a training epoch if and only if its validation entropy improved compared to the previous epoch.
It is a good idea to shuffle sentences in the set before splitting them into training and validation sets (GNU shuf & split are one of the possible choices to do it). For huge datasets use --epoch-per-file option.

References

[0] Mikolov, T. (2012). Statistical language models based on neural networks. Presentation at Google, Mountain View, 2nd April.

[1] Le, Q. V., Jaitly, N., & Hinton, G. E. (2015). A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. arXiv preprint arXiv:1504.00941.

[2] Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.

[3] Chen, X., Liu, X., Gales, M. J. F., & Woodland, P. C. (2015). Recurrent neural network language model training with noise contrastive estimation for speech recognition.

[4] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural Networks for Machine Learning, vol. 4, 2012

[5] Recht, B., Re, C., Wright, S., & Niu, F. (2011). Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems (pp. 693-701). Chicago

[6] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

[7] Sutskever, I. (2013). Training recurrent neural networks (Doctoral dissertation, University of Toronto).

[8] Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., & Robinson, T. (2013). One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005. GitHub

[9] Mikolov, T., Joulin, A., Chopra, S., Mathieu, M., & Ranzato, M. A. (2014). Learning longer memory in recurrent neural networks. arXiv preprint arXiv:1412.7753.

faster-rnnlm's People

Contributors

Stargazers

Watchers

Forkers

sherjilozair mrgloom erfannoury ilya-eder njuhugn fedorajzf dl-nisl nkhuyu zhoujialinmumu inman xuanhan863 scofield0li yanqingmen evilmucedin enjoyhacking jankim vadimostanin zuiwufenghua poseidon1214 noscripter nbcc prayagverma gorinars errord chenmengdx jiangnanhugo gphuang samuel2015 wuntoguo sunqf little1tow t13m ana2s007 pbamotra chubbymaggie pandasasa vangogh0318 abtinsetyani arg0 systran lixiangnlp khalkechev jsenellart nihaoucas hohaxu huangpeng1126 zbxzc35 caomw hongyugong pranjaldaga g10dras zt706 gogabr wolfws wenhuach noa fulquan bo-yang gentaiscool cynsithia mattiadg shaoguangcheng aligator77 joe2hpimn alexnanchen yflyzhang xuehui0725 5idaidai yfliao stavdv alonshmilo hassyma moses1994 eedanny alexeib saitop zabin10 expectation-maximization dilip-dmk stevenlol thanhlct watanabetouru timwee hailiang-wang krislc junshipeng oliviershi hinike ilyagusev eric-haibin-lin optimus1009 nonva mwydmuch lianfei kevinwenya aztecsmith spencerx leaveschen linecode hellomotor

faster-rnnlm's Issues

Fine-Tune functionality

Will Faster-RNN support fine tune functionality or it is easy to modify to support such functionality? Thanks.

Completion of error handling

Would you like to add more error handling for return values from functions like the following?

fread ⇒ ReadHeader
fseek ⇒ WordReader constructor

error by training data

dear all,
I am trying to train using dataset from rnnlm.org, but have errors as follow:
./rnnlm -rnnlm binh_model -train /home/binh/DeepLearning/simple-examples/data/ptb.char.train.txt -valid /home/binh/DeepLearning/simple-examples/data/ptb.char.valid.txt -hidden 128 -hidden-type gru -nce 20 -alpha 0.01
Read the vocabulary: 8 words
Restoring existing nnet
Constructing RNN: layer_size=128, layer_type=gru, layer_count=1, maxent_hash_size=0, maxent_order=0, vocab_size=8, use_nce=1
Constructing NCE: layer_size=128, maxent_hash_size=0, cuda=1, ln(Z)=9.000000
Constructed UnigramNoiseGenerator: power=1.000, mincells(param)=5.000, mincells(real)=9090909
Initial entropy (bits) valid: -nan
Epoch 1 lr: 1.00e-02/1.00e-01 progress: 99.79% 49385.93 Kwords/sec entropy (bits) valid: -nan elapsed: 0.1s+0.0s Awful: Nnet rejected
Epoch 2 lr: 5.00e-03/5.00e-02 progress: 99.79% 56978.07 Kwords/sec entropy (bits) valid: -nan elapsed: 0.1s+0.0s Awful: Nnet rejected
Epoch 3 lr: 2.50e-03/2.50e-02 progress: 99.79% 54694.62 Kwords/sec entropy (bits) valid: -nan elapsed: 0.1s+0.0s Awful: Nnet rejected

Could anyone help me?
Best thanks

One Billion Word Benchmark precision

Hello,

First thanks for this nice benchmark!

I have a question regarding the last figure. For the "gru-insyn+direct+KN" results, how do you combine KN with gru-insyn+direct? Is it during training? Do you combine log probabilities of the two models during evaluation with some kind of interpolation?

Many thanks!

Alexandre Nanchen

Training time for one billion word benchmark

Hi, thanks for your RNN work. Very useful so far.

We are thinking of training the model for the one billion word dataset which comes to about 35 million sentences. I was wondering how long the training took for the configuration shown in the docs. It would be great to have a ballpark figure (a week? month? couple days?) before we embark on training on a motherlode of data.

Thanks!

Question about get_maxent_index in Nce.cc

Hi, may I ask a question? I was wondering how you can guarantee the value returned by get_maxent_index function does not exceed the max hash size. Thanks in advance.

Training with several hidden layers

Hi! I have some questions about faster-rnnlm. There it is possible to use several hidden layers during training. My questions are:

Which of them is used for recurrent part?
Does it use those hidden layers during decoding or computing entropy?
Thanks!

Code bug(memory leaks)

~MaybeStaticArray() (hierarchical_softmax.cc:316)
Mismatched free() / delete / delete [] is detected by Valgrind.
delete dynamic_array;
should be replaced by
delete [] dynamic_array;
since you defined new T[dynamic_size] in the constructor.
This may cause undefined behavior.

Entropy is nan when a vocabulary larger than 2 million words is used

Entropy is nan
The progress is always 99.99%, it is too slow

OSX compilation: use of undeclared identifier 'CLOCK_MONOTONIC'

rnnlm.cc:101:32: error: use of undeclared identifier 'CLOCK_MONOTONIC'
  void Reset() { clock_gettime(CLOCK_MONOTONIC, &start); }
                               ^
rnnlm.cc:105:19: error: use of undeclared identifier 'CLOCK_MONOTONIC'
    clock_gettime(CLOCK_MONOTONIC, &finish);

Build in ubuntu 16.04

dear all,

after call ./build.sh I have got the errors

rnnlm.cc:448:22: error: ‘isnan’ was not declared in this scope
if (isnan(entropy) || isinf(entropy) || !(ratio >= bad_ratio)) {
^
rnnlm.cc:448:22: note: suggested alternative:
In file included from /usr/include/c++/4.8/complex:44:0,
from ../eigen3/Eigen/Core:28,
from ../eigen3/Eigen/Dense:1,
from ../faster-rnnlm/util.h:7,
from ../faster-rnnlm/hierarchical_softmax.h:9,
from rnnlm.cc:17:
/usr/include/c++/4.8/cmath:632:5: note: ‘std::isnan’
isnan(_Tp __x)
^
rnnlm.cc:448:40: error: ‘isinf’ was not declared in this scope
if (isnan(entropy) || isinf(entropy) || !(ratio >= bad_ratio)) {
^
rnnlm.cc:448:40: note: suggested alternative:
In file included from /usr/include/c++/4.8/complex:44:0,
from ../eigen3/Eigen/Core:28,
from ../eigen3/Eigen/Dense:1,
from ../faster-rnnlm/util.h:7,
from ../faster-rnnlm/hierarchical_softmax.h:9,
from rnnlm.cc:17:
/usr/include/c++/4.8/cmath:614:5: note: ‘std::isinf’
isinf(_Tp __x)
^
Makefile:48: recipe for target 'rnnlm.o' failed
make: *** [rnnlm.o] Error 1
make: *** Waiting for unfinished jobs....
/usr/include/string.h: In function ‘void* mempcpy_inline(void, const void, size_t)’:
/usr/include/string.h:652:42: error: ‘memcpy’ was not declared in this scope
return (char ) memcpy (__dest, __src, __n) + __n;
^
Makefile:60: recipe for target 'cuda_softmax.o' failed
make: ** [cuda_softmax.o] Error 1

Could you pls help me?
Thanks

make clean: add rnnlm.o

This is too trivial for a pull request:

Currently, the "clean" target only deletes rnnlm and $(OBJ_FILES). However, rnnlm.o is not contained in OBJ_FILES.

Therefore, it would be great to add rnnlm.o:

clean:
    rm -f rnnlm rnnlm.o $(OBJ_FILES)

Thanks for providing faster-rnnlm!

NCE Error: Compiled without CUDA support

Hi,

While running the below command:

./rnnlm -rnnlm model_name -train train.txt -valid validation.txt -hidden 128 -hidden-type gru -nce 20 -alpha 0.01 -use-cuda 1

I get the following error:

Constructing NCE: layer_size=128, maxent_hash_size=0, cuda=1, ln(Z)=9.000000
NCE error: Compiled without CUDA support!

How to compile with CUDA support?

I have CUDA 7.5 on the machine and have been running other programs on GPU successfully.

$LD_LIBRARY_PATH
bash: :/usr/local/cuda/lib64

$CUDA_HOME
bash: /usr/local/cuda: Is a directory

Entropy is nan when a vocabulary larger than 2 million words is used.

Hi,
When I am using a vocabulary that is larger than 2 million words (e.g., 2.2 million) the validation entropy is always nan.
However, on the exact same data if I use a slightly smaller vocabulary (1937725 words) then entropy is calculated normally. The vocabulary is being limited by rare words from the vocabulary file.

Best regards,
Rafael

Get stuck in text generation mode

I'd like to generate text using the tool.

The network has 500 hidden units, trained on about 300K tokens encoded in UTF-8 (27.3K vocab). Tried both NCE and HS.
Training shows no issues but when running with "-rnnlm $rnnmodel --generate-samples 1000000 --generate-temperature 1.0" the program gives the following output and seem to be freezing (memory and CPU is almost not occupied). Same behaviour with and without CUDA compilation and execution.

Read the vocabulary: 27360 words
Restoring existing nnet
Constructing RNN: layer_size=500, layer_type=sigmoid, layer_count=1, maxent_hash_size=999980640, maxent_order=3, vocab_size=27360, use_nce=0
Contructed HS: arity=2, height=19

Trying to find what is the issue from the code, but no success so far

Addition of a build system generator

I suggest to reuse a higher level build system than your current small make file so that powerful checks for software features will become easier.

Generating embeddings

Sorry if this is a dumb question, but does this also generate word embeddings?

Using the library as a binary classification

Hello, you can use the library as a binary classification?
For example input I would insert the variable length sequences, each with an associated target (1 or 0):

seq 1: 0 1 0 0 1 0 1 target->1
seq 2: 1 1 0 1 target->0
seq 3: 0 1 0 target->0

Thanks a lot in advance and I hope to get answers from someone of you.

Information about interpolation between RNNLM and KN

I am trying to obtain models resulting from the interpolation of RNNLM and KN as it is mentioned in README.md.

With Mikolov's rnnlm, there are options to do this (lm-prob and lambda), however, I am not able to find them in faster-rnnlm, and I cannot find help in the documentation.

Is there any explanation about how to combine RNNLM and KN when using faster-rnnlm?

Thanks in advance

how to use your model to rescore in kaldi?

Returning the hidden layers of the RNN

Hi, thanks for the great package.

I am curious if it is possible to output the unrolled hidden layers of the RNN for each sentence? Specifically, for each new sentence, RNNLM can calculate a score (perplexity / log prob) and if it can also output the values for the hidden layer for each word, that would create a nice visualization.

Learning Rate Decay

Is it the expected behavior to continuously decay learning rate every epoch after one bad epoch? Doesn't that exponentially decrease the learning rate? Wouldn't it be better to decay the learning rate only once, and then decay again only when validation still does not decrease?

In your experience, did you find the status quo better?

perplexity results

Anton,

I am running the toolkit on the text from the Cantab-Tedlium recipe in Kaldi.
it a text file of about 900 MB, vocab size 150K
Anyway my question is:
do you think it's normal to get a 114 perplexity in HS and 124 in NCE mode ?
I would have expected the NCE results better than the HS ones according to your home page.

(parameters are the one from the WSJ recipe for rnnlm)

thanks
Vincent

ld: cannot find -lcuda

Hi,

I tried installing faster-rnnlm and while running ./build.sh I got this error:

/usr/bin/ld: cannot find -lcuda
collect2: error: ld returned 1 exit status
Makefile:39: recipe for target 'rnnlm' failed
make: *** [rnnlm] Error 1

I have installed cuda 7.5 but I don't know which library do I need to link as lcuda.

Best regards,
Angel

any example for testing...

Hi,
after successfully building this SW, I want to test it: training and test with a dataset.
Could you pls tell me,

which data set makes sense,
how to prepare everything for training and testing?

Thanks.

NCE training (with help GPU)

Hi, Everyone. I have some questions about faster-rnnlm.
First question, I want to use this toolkit with option -direct 1000, but there appears error:
CUDA ERROR:
Failed to allocate cuda memory for maxent
out of memory
I know it is due to -direct 1000, because when i use -direct 400 this error doesn't appear. Our GPU memory is 3GB. Does there exist any way to use -direct 1000 without error?
Second question, you use GPU when computing only validation or test entropy. Why don't you use GPU during training? What is the reason? Don't you think it would be faster to use GPU during training?

More Details about UnigramNoiseGenerator

Could anyone provide some references or illustrations about UnigramNoiseGenerator?
I am confused with the unigram_table_'s construction. Thanks in advance.

Why got some OOV for testing?

I train the model on millions of sentences, when testing, I found some OOV for the whole sentence but the sentence is just simple as "you 're not .", Anyone knows why?

Error building faster-rnnlm

Hello,

I am having some issues installing faster-RNNLM in Ubuntu 16.04. As the instrucction suggests, I run the script build.sh but the output says:

Installing Faster RNNLM
Already up-to-date.
Folder 'eigen3' already exists. Exiting
nvcc cuda_softmax.cu -c -Xcompiler "-O3 -march=native -funroll-loops" -o cuda_softmax.o
/usr/include/string.h: In function ‘void* __mempcpy_inline(void*, const void*, size_t)’:
/usr/include/string.h:652:42: error: ‘memcpy’ was not declared in this scope
return (char *) memcpy (__dest, __src, __n) + __n;
^
Makefile:60: recipe for target 'cuda_softmax.o' failed
make: *** [cuda_softmax.o] Error 1

I've been reading something about the compiler, so my version is:

gcc --version

gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I've installed CUDA toolkit 9.0 (with the deb local option because the .run was giving me a lot of problems).
Could you please help me?

Thanks!

Edit: I've tried to compile with gcc-4.8 and g++-4.8 but the same error appears on screen.

build in Ubuntu 16.04

hi,
after call ./build.sh I have got the errors
In file included from ../faster-rnnlm/nce.h:12:0,
from nce.cc:1:
../faster-rnnlm/util.h:29:100: error: default template arguments may not be used in function templates without -std=c++11 or -std=gnu++11
inline void Dump(const Eigen::Matrix<Real, rows, Eigen::Dynamic, Eigen::RowMajor>& matrix, FILE* fo) {
^
../faster-rnnlm/util.h:35:114: error: default template arguments may not be used in function templates without -std=c++11 or -std=gnu++11
inline void Load(Eigen::Matrix<Real, rows, Eigen::Dynamic, Eigen::RowMajor, _MaxRows, _MaxCols>* matrix, FILE* fo) {

Could you pls help me?
Thanks

Nnet rejected problem - Can you recommend a better training corpus?

Hi:

I've been experimenting with faster-rnnlm using karpthy's text data from char-rnn and other public domain texts but I can't move beyond "Awful: Nnet rejected"

I've tried changing gru etc. and lowering values as per your previous reply but the results are always the same. The program exits after 3 passes and the resulting .Nnet file is, I assume, of poor quality based on the sampled output.

Is there a known training corpus that works well with faster-rnnlm? Any links and additional information is appreciated.

Cheers

Random text generator from original RNNLM; CRnnLM::testGen()

Any chance you're re-implementing the original CRnnLM::testGen() method?

e.g. https://github.com/katakombi/rnnlm/blob/master/rnnlmlib.cpp#L2361-L2497

Benchmark on 1 Billion

You report training times in the 1 Billion dataset, but not test perplexity. It would be helpful to compare this implementation with others if you could post the test perplexity and the hyperparameters that could be used to reproduce the result.

Also, thanks for putting this on Github. :)

building error

build failing on centos7 with gcc 5.4
Installing Faster RNNLM Already up-to-date. Folder 'eigen3' already exists. Exiting g++ rnnlm.o hierarchical_softmax.o nce.o words.o maxent.o nnet.o layers/simple_layer.o layers/gru_layer.o layers/scrn_layer.o layers/layer_stack.o layers/interface.o cuda_softmax.o -o rnnlm -Wall -march=native -funroll-loops -g -D__STDC_FORMAT_MACROS --std=gnu++11 -I../ -DEIGEN_DONT_PARALLELIZE -Ofast -pthread -lm -lstdc++ -L/usr/local/cuda/lib64 -lcuda -lcudart -lcublas -lrt cuda_softmax.o: In function add_maxent(int, int, unsigned int const*, float const*, unsigned long const*, int const*, float*)':
tmpxft_0001a70a_00000000-5_cuda_softmax.cudafe1.cpp:(.text+0xda): undefined reference to __cudaPopCallConfiguration' cuda_softmax.o: In function add_prepared_maxent(int, int, unsigned int const*, float const*, int const*, int, float*)':
tmpxft_0001a70a_00000000-5_cuda_softmax.cudafe1.cpp:(.text+0x200): undefined reference to __cudaPopCallConfiguration' cuda_softmax.o: In function pick_target_scores(float const*, unsigned int const*, unsigned long, float*)':
tmpxft_0001a70a_00000000-5_cuda_softmax.cudafe1.cpp:(.text+0x2d8): undefined reference to __cudaPopCallConfiguration' cuda_softmax.o: In function initialize_matrix(float*, int, int, float)':
tmpxft_0001a70a_00000000-5_cuda_softmax.cudafe1.cpp:(.text+0x3bd): undefined reference to __cudaPopCallConfiguration' cuda_softmax.o: In function take_exp(float*, int, int)':
tmpxft_0001a70a_00000000-5_cuda_softmax.cudafe1.cpp:(.text+0x481): undefined reference to __cudaPopCallConfiguration' cuda_softmax.o: In function CalculateSoftMax(CudaStorage*, float const*, unsigned long const*, int const*, unsigned long, unsigned int const*, float*)':
tmpxft_0001a70a_00000000-5_cuda_softmax.cudafe1.cpp:(.text+0x1a4b): undefined reference to __cudaPushCallConfiguration' tmpxft_0001a70a_00000000-5_cuda_softmax.cudafe1.cpp:(.text+0x2073): undefined reference to __cudaPushCallConfiguration'
tmpxft_0001a70a_00000000-5_cuda_softmax.cudafe1.cpp:(.text+0x20e3): undefined reference to __cudaPushCallConfiguration' tmpxft_0001a70a_00000000-5_cuda_softmax.cudafe1.cpp:(.text+0x21a3): undefined reference to __cudaPopCallConfiguration'
tmpxft_0001a70a_00000000-5_cuda_softmax.cudafe1.cpp:(.text+0x2439): undefined reference to __cudaPushCallConfiguration' tmpxft_0001a70a_00000000-5_cuda_softmax.cudafe1.cpp:(.text+0x24d6): undefined reference to __cudaPopCallConfiguration'
tmpxft_0001a70a_00000000-5_cuda_softmax.cudafe1.cpp:(.text+0x25f4): undefined reference to __cudaPopCallConfiguration' tmpxft_0001a70a_00000000-5_cuda_softmax.cudafe1.cpp:(.text+0x272c): undefined reference to __cudaPushCallConfiguration'
tmpxft_0001a70a_00000000-5_cuda_softmax.cudafe1.cpp:(.text+0x2825): undefined reference to __cudaPopCallConfiguration' tmpxft_0001a70a_00000000-5_cuda_softmax.cudafe1.cpp:(.text+0x29ec): undefined reference to __cudaPopCallConfiguration'
cuda_softmax.o: In function __device_stub__Z17initialize_matrixPfiif(float*, int, int, float)': tmpxft_0001a70a_00000000-5_cuda_softmax.cudafe1.cpp:(.text+0x2b3d): undefined reference to __cudaPopCallConfiguration'
cuda_softmax.o: In function __device_stub__Z8take_expPfii(float*, int, int)': tmpxft_0001a70a_00000000-5_cuda_softmax.cudafe1.cpp:(.text+0x2c01): undefined reference to __cudaPopCallConfiguration'
cuda_softmax.o: In function __device_stub__Z10add_maxentiiPKjPKfPKmPKiPf(int, int, unsigned int const*, float const*, unsigned long const*, int const*, float*)': tmpxft_0001a70a_00000000-5_cuda_softmax.cudafe1.cpp:(.text+0x2d16): undefined reference to __cudaPopCallConfiguration'
cuda_softmax.o:tmpxft_0001a70a_00000000-5_cuda_softmax.cudafe1.cpp:(.text+0x2e36): more undefined references to __cudaPopCallConfiguration' follow collect2: error: ld returned 1 exit status make: *** [rnnlm] Error 1

the binary format of output

Could you supply optimal configurations for 1B benchmark?

I'm trying to use faster-rnnlm on a 3B word dataset and would like to use the optimal hyperparameters you obtained in the One Billion Word benchmark.

In particular, the end of this section https://github.com/yandex/faster-rnnlm#one-billion-word-benchmark contains the sentence:

Note. We took the best performing models from the previous and added maxent layer of size 1000 and order 3.

Is there any way you can provide what those hyperparameters are for each of the three models graphed?

Thanks!

LSTMs

Thanks for the great software! I don't think there is something comparable that is able to deal with huge vocabularies and amount of data!

LSTMs have shown superior performance for language models in many recent papers. Would it make sense to add LSTMs as an available hidden unit or is there a particular reason this isn't implemented? Would implementing it be prone to any significant hurdles/effort in the current code?

Different results on each execution.

Hey, I have two questions.
I realized this tool makes different results whenever I executed it.
The differences were not huge though.
It seems because of using threads, is it correct?
This is my first question.

And the other is that I am curious if there exists a way to fix it to make same results regardless using threads.

Thanks.

Probably bug in valid/text entropies in doc/RESULTS.md

doc/RESULTS.md contains "Validation entropy" and "Test entropy" for a range of experiments on the "One Billion Word Benchmark"

The "Validation entropy" number is exactly the same as the "Test entropy" number, so it looks as though the same dataset was used to compute both (they should be news.en.heldout-00000-of-00050 and news.en.heldout-00001-of-00050 respectfully).

Issue in Windows

I am trying to compile this in windows. i fixed all the dependency related issues, but i am stuck at one place.
The Dump and Load function in utils class is showing error and i am not able to compile it. Any help will be really helpful

Error 101 error C2784: 'void Dump(const Eigen::Matrix<Real,rows,-1,1> &,FILE *)' : could not deduce template argument for 'const Eigen::Matrix<Real,rows,-1,1> &' from 'const RowMatrix' c:\users\users\documents\visual studio 2012\projects\rnnlm\faster-rnnlm\hierarchical_softmax.cc 251
Error 102 error C2784: 'void Load(Eigen::Matrix<Real,rows,-1,1> *,FILE *)' : could not deduce template argument for 'Eigen::Matrix<Real,rows,-1,1> *' from 'RowMatrix *' c:\users\users\documents\visual studio 2012\projects\rnnlm\faster-rnnlm\hierarchical_softmax.cc 255
Error 126 error C2784: 'void Dump(const Eigen::Matrix<Real,rows,-1,1> &,FILE *)' : could not deduce template argument for 'const Eigen::Matrix<Real,rows,-1,1> &' from 'const RowMatrix' c:\users\users\documents\visual studio 2012\projects\rnnlm\faster-rnnlm\nce.cc 196
Error 127 error C2784: 'void Load(Eigen::Matrix<Real,rows,-1,1> *,FILE *)' : could not deduce template argument for 'Eigen::Matrix<Real,rows,-1,1> *' from 'RowMatrix *' c:\users\users\documents\visual studio 2012\projects\rnnlm\faster-rnnlm\nce.cc 200

what is the meaning of "-nan & Nnet rejected"

my cmd is:
./rnnlm -rnnlm keji.model.1 -train keji.all.gb.seg -valid keji.random.valid -hidden 128 -hidden-type relu -nce 20 -alpha 0.01 -threads 8 -direct 500 -direct-order 4 -nce-accurate-test 1 -use-cuda 1

traning log as follows:
Epoch 1 lr: 1.00e-02/1.00e-01 progress: 88.58% 380.07 Kwords/sec entropy (bits) valid: -nan elapsed: 461.8s+1263.8s Awful: Nnet rejected Epoch 2 lr: 5.00e-03/5.00e-02 progress: 90.16% 378.44 Kwords/sec entropy (bits) valid: -nan elapsed: 460.7s+1263.8s Awful: Nnet rejected

thx.

the binary format of output

Hi, Everyone. I want to ask you some questions.
How to convert the binary format of output into the text format of output? Are there correspond parameter settings? Thank you.

Learning rate for 1B corpus

Hi, I am training a wikipedia corpus with 1B tokens, using sigmoid/gru with hidden count 1/2/3. The initial learning rate of 0.01 gave me pretty good results when I was working with 100M wikipedia, but for the 1B corpus after training a couple epochs both sigmoid/gru are starting to give me NaN entropy. Just curious, what are the learning rate that you used for the 1B benchmark corpus? I am now setting it to 0.001 and hopefully the gradients won't explode.

big-endian machine

i want to be able to train a model on a linux box, but then ship to an ibm or sun box or any other for execution.

cannot load on a big-endian architecture when it was generated on a little-endian machine

Restoring existing nnet
Bad model version: -1834810029

but other than that i can get the program to work fine on ibm/sun

OSX compilation error; ld: library not found for -lrt

This occurs from running make -j

g++ rnnlm.o hierarchical_softmax.o nce.o words.o maxent.o nnet.o recurrent.o -o rnnlm -Wall -march=native -funroll-loops -g -D__STDC_FORMAT_MACROS -I../ -DEIGEN_DONT_PARALLELIZE  -O3 -pthread -DNOCUDA -lm -lrt
clang: warning: argument unused during compilation: '-pthread'
ld: library not found for -lrt

If -lrt is removed, this error occurs:

Undefined symbols for architecture x86_64:
  "Vocabulary::HashImpl::kMinSize", referenced from:
      Vocabulary::HashImpl::Rebuild() in words.o
  "Vocabulary::kWordOOV", referenced from:
      Vocabulary::HashImpl::Rebuild() in words.o
ld: symbol(s) not found for architecture x86_64

Awful: Nnet rejected

I try running the command ,
./rnnlm -rnnlm model_name -train train.txt -valid valid.txt -hidden 256 -hidden-type gru -nce 20 -alpha 0.01
and it occur the error:
entropy (bits) valid: -nan elapsed: 29.3s+0.1s Awful: Nnet rejected

how can I solve the problem?

yandex / faster-rnnlm Goto Github PK

faster-rnnlm's Introduction

Faster RNNLM (HS/NCE) toolkit

Quick start

Model architecture

Hidden layer

Output layer

Maximum entropy model

Experiments

Penn Treebank Benchmark

One Billion Word Benchmark

Command line options

Performance notes

Usage advice

References

faster-rnnlm's People

Contributors

Stargazers

Watchers

Forkers

faster-rnnlm's Issues

after call ./build.sh I have got the errors

Recommend Projects

Recommend Topics

Recommend Org