Giter VIP home page Giter VIP logo

awslabs / keras-apache-mxnet Goto Github PK

View Code? Open in Web Editor NEW

This project forked from keras-team/keras

289.0 27.0 65.0 12.83 MB

[DEPRECATED] Amazon Deep Learning's Keras with Apache MXNet support

Home Page: https://github.com/awslabs/keras-apache-mxnet/wiki

License: Other

Makefile 0.04% Python 99.76% Shell 0.13% Dockerfile 0.07%
deep-learning mxnet keras python apache-mxnet keras-mxnet keras-tutorials keras-neural-networks

keras-apache-mxnet's People

Contributors

abhaikollara avatar ahundt avatar carlthome avatar edersantana avatar farizrahman4u avatar fchollet avatar fuzzythecat avatar gabrieldemarmiesse avatar gvtulder avatar jfsantos avatar jihobak avatar kalyc avatar lukedeo avatar matsuyamax avatar maxpumperla avatar myutwo150 avatar nzw0301 avatar olegsinavski avatar ozabluda avatar phreeza avatar roywei avatar sandeep-krishnamurthy avatar staticskies avatar taehoonlee avatar tdhd avatar the-moliver avatar tleeuwenburg avatar wxs avatar yanboliang avatar yaringal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

keras-apache-mxnet's Issues

installation process

It would be nice to have this as a single installation step:

If I have MXNet already: keep my version if it is high enough.
If I don’t have MXNet already: install the latest MXNet version along with this package.

multi-gpu tutorial: batch size and optimization

When running the tutorial with the default settings it doesn't seem like the batch size is optimized for the # of gpus (and learning rate should be modified too), so maybe that should be discussed and set. When I tried to tinker with it, I wasn't able to really increase gpu utilization much.

ubuntu@ip-172-31-9-178:~$ watch -n0.1 nvidia-smi

Every 0.1s: nvidia-smi                                                                                       Fri Feb 16 23:00:41 2018

Fri Feb 16 23:00:41 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   44C    P0    53W / 300W |    786MiB / 16152MiB |     15%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   41C    P0    51W / 300W |    786MiB / 16152MiB |     17%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   41C    P0    53W / 300W |    786MiB / 16152MiB |     18%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   43C    P0    55W / 300W |    786MiB / 16152MiB |     19%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2774      C   python                                       776MiB |
|    1      2774      C   python                                       776MiB |
|    2      2774      C   python                                       776MiB |
|    3      2774      C   python                                       776MiB |
+-----------------------------------------------------------------------------+

mxnet not using the GPU with mxnet-cu90 on Windows

I was using latest keras with tensorflow with GPU. I installed mxnet in the following way:

pip install keras-mxnet
pip install mxnet-cu90

I changed "backend": "mxnet" in the keras config file. In jupyter I see "Using MXNet backend", but when training only CPU is utilized and not the GPU.

Any advice?

multi-gpu tutorial: warnings at the start of training

/home/ubuntu/.local/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

And six of these repeated:

/home/ubuntu/.local/lib/python3.6/site-packages/Keras-2.1.3-py3.6.egg/keras/backend/mxnet_backend.py:4194: SyntaxWarning: assertion is always true, perhaps remove parentheses?

After the layers are described, it then gives these warnings:

[22:54:01] src/operator/././cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/module/bucketing_module.py:402: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (1.0 vs. 0.03125). Is this intended?
  force_init=force_init)

Input is getting overwritten after for loop mxnet_backend.py:ones_like

The following code:

mx_shape = tuple([0 if x is None else x for x in x.shape])

produces the following error under python2.7:

AttributeError: 'long' object has no attribute 'shape'

The problem is that the for loop overrides the variable x and makes it a float.

A simple solution which worked for my case was to just rename the inner variable to y:
mx_shape = tuple([0 if x is None else y for y in x.shape])

Skipped conv3d backend test

in tests/keras/backend/backend_test.py, for test case test_conv3d, few test cases are skipped for MXNet backend.
@roywei is working on the fix and enable these edge cases.

Fail to install keras with Mxnet

I am trying to install keras with mxnet, howerver, it failed.

pip install git+https://github.com/awslab
s/keras-apache-mxnet

  File "/tmp/RtmpIpKgN0/chunk-code-26bb3d563203.txt", line 1, in <module>
    import keras
  File "/home/gpu-server-1/.virtualenvs/r-tensorflow/local/lib/python2.7/site-packages/keras/__init__.py", line 3, in <module>
    from . import utils
  File "/home/gpu-server-1/.virtualenvs/r-tensorflow/local/lib/python2.7/site-packages/keras/utils/__init__.py", line 6, in <module>
    from . import conv_utils
  File "/home/gpu-server-1/.virtualenvs/r-tensorflow/local/lib/python2.7/site-packages/keras/utils/conv_utils.py", line 9, in <module>
    from .. import backend as K
  File "/home/gpu-server-1/.virtualenvs/r-tensorflow/local/lib/python2.7/site-packages/keras/backend/__init__.py", line 36, in <module>
    assert _backend in {'theano', 'tensorflow', 'cntk'}
AssertionError

My environment:

Ubuntu 16.04.4 LTS \n \l
python 2.7.12
cuda9.1
mxnet-cu91

errors when installing keras-mxnet in python 2.7 mxnet conda env

Using a DLAMI Ubuntu v7 in the mxnet_p27 Conda environment I get the following errors after running pip install --upgrade h5py numpy keras-mxnet:

mkl-random 1.0.1 requires cython, which is not installed.
mkl-fft 1.0.0 requires cython, which is not installed.
botocore 1.10.2 has requirement python-dateutil<2.7.0,>=2.1, but you'll have python-dateutil 2.7.2 which is incompatible.
mxnet-cu90mkl 1.1.0 has requirement numpy<=1.13.3, but you'll have numpy 1.14.3 which is incompatible.

The mxnet_p36 env works fine.

Training acc does not increase when using mxnet mkldnn

I was doing performance profiling on CPU and found out if I use the latest mxnet-mkl (mxnet-mkl-1.2.0b20180507 pip install mxnet-mkl --pre) the accuracy does not go up. It's working fine when using mxnet-mkl 1.1.0 with pip install mxnet-mkl.
However if I use pure mxnet implementation, both packages working fine.

Instance type C5.18xlarge , ubuntu

Steps to reproduce:
Script from keras example:
https://github.com/awslabs/keras-apache-mxnet/blob/master/examples/cifar10_cnn.py

Using latest mxnet-mkl from master

pip install mxnet-mkl --pre
pip install keras-mxnet --pre
python cifar10_cnn.py

Built from source from release branch v1.2.0 with mkl

  1. check out branch v1.2.0
  2. installed mkl following option 2 of the guide: https://software.intel.com/en-us/articles/installing-and-building-mxnet-with-intel-mkl
  3. $ make -j $(nproc) USE_OPENCV=1 USE_MKLDNN=1 USE_BLAS=atlas

Training Logs:
using keras mxnet-mkl 1.2.0b

Using MXNet backend
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
x_train shape: (50000, 3, 32, 32)
50000 train samples
10000 test samples
Not using data augmentation.
Train on 50000 samples, validate on 10000 samples
Epoch 1/10
/usr/local/lib/python2.7/dist-packages/mxnet/module/bucketing_module.py:408: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (1.0 vs. 0.0078125). Is this intended?
  force_init=force_init)
[18:29:26] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 3456 bytes with malloc directly
[18:29:26] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 36864 bytes with malloc directly
[18:29:26] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 3211264 bytes with malloc directly
  128/50000 [..............................] - ETA: 24s - loss: 2.3026 - acc: 0.0859[18:29:26] src/operator/nn/mkldnn/mkldnn_base.cc:60: Allocate 14745600 bytes with malloc directly
50000/50000 [==============================] - 18s 359us/step - loss: 2.3029 - acc: 0.0993 - val_loss: 2.3562 - val_acc: 0.1000
Epoch 2/10
50000/50000 [==============================] - 16s 312us/step - loss: 2.3077 - acc: 0.0994 - val_loss: 2.3122 - val_acc: 0.1003
Epoch 3/10
50000/50000 [==============================] - 17s 330us/step - loss: 2.3029 - acc: 0.0998 - val_loss: 2.3026 - val_acc: 0.1000
Epoch 4/10
50000/50000 [==============================] - 15s 306us/step - loss: 2.3028 - acc: 0.0973 - val_loss: 2.3026 - val_acc: 0.1000
Epoch 5/10
50000/50000 [==============================] - 17s 332us/step - loss: 2.3027 - acc: 0.0989 - val_loss: 2.3026 - val_acc: 0.1000
Epoch 6/10
50000/50000 [==============================] - 16s 315us/step - loss: 2.3027 - acc: 0.0987 - val_loss: 2.3026 - val_acc: 0.1000
Epoch 7/10
50000/50000 [==============================] - 16s 321us/step - loss: 2.3027 - acc: 0.0967 - val_loss: 2.3026 - val_acc: 0.1000
Epoch 8/10
50000/50000 [==============================] - 16s 330us/step - loss: 2.3027 - acc: 0.0978 - val_loss: 2.3026 - val_acc: 0.1000
Epoch 9/10
50000/50000 [==============================] - 16s 324us/step - loss: 2.3027 - acc: 0.0982 - val_loss: 2.3026 - val_acc: 0.1000
Epoch 10/10
50000/50000 [==============================] - 16s 326us/step - loss: 2.3027 - acc: 0.0970 - val_loss: 2.3026 - val_acc: 0.1000
Saved trained model at /home/ubuntu/examples/saved_models/keras_cifar10_trained_model.h5 
10000/10000 [==============================] - 1s 122us/step
Test loss: 2.3025997383117676

using keras mxnet-mkl 1.1.0

Using MXNet backend
x_train shape: (50000, 3, 32, 32)
50000 train samples
10000 test samples
Not using data augmentation.
Train on 50000 samples, validate on 10000 samples
Epoch 1/10
/usr/local/lib/python2.7/dist-packages/mxnet/module/bucketing_module.py:403: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (1.0 vs. 0.0078125). Is this intended?
  force_init=force_init)
MKL Build:20171227
50000/50000 [==============================] - 21s 414us/step - loss: 1.9739 - acc: 0.2748 - val_loss: 1.6251 - val_acc: 0.4083
Epoch 2/10
50000/50000 [==============================] - 18s 364us/step - loss: 1.6319 - acc: 0.4060 - val_loss: 1.4385 - val_acc: 0.4752
Epoch 3/10
50000/50000 [==============================] - 18s 356us/step - loss: 1.4564 - acc: 0.4748 - val_loss: 1.3948 - val_acc: 0.5152
Epoch 4/10
50000/50000 [==============================] - 18s 368us/step - loss: 1.3202 - acc: 0.5272 - val_loss: 1.2079 - val_acc: 0.5705
Epoch 5/10
50000/50000 [==============================] - 20s 390us/step - loss: 1.2088 - acc: 0.5703 - val_loss: 1.1262 - val_acc: 0.6073
Epoch 6/10
50000/50000 [==============================] - 20s 406us/step - loss: 1.1309 - acc: 0.5998 - val_loss: 1.0989 - val_acc: 0.6047
Epoch 7/10
50000/50000 [==============================] - 18s 370us/step - loss: 1.0664 - acc: 0.6226 - val_loss: 1.0102 - val_acc: 0.6443
Epoch 8/10
50000/50000 [==============================] - 20s 393us/step - loss: 1.0096 - acc: 0.6437 - val_loss: 0.8982 - val_acc: 0.6853
Epoch 9/10
50000/50000 [==============================] - 20s 395us/step - loss: 0.9638 - acc: 0.6629 - val_loss: 0.9069 - val_acc: 0.6795
Epoch 10/10
50000/50000 [==============================] - 18s 368us/step - loss: 0.9268 - acc: 0.6768 - val_loss: 0.8839 - val_acc: 0.6861
Saved trained model at /home/ubuntu/examples/saved_models/keras_cifar10_trained_model.h5 
10000/10000 [==============================] - 2s 202us/step
Test loss: 0.883913392448
Test accuracy: 0.6861

using native mxnet-mkl 1.2.0b

INFO:root:Epoch[0] Batch [128]  Speed: 5276.20 samples/sec      accuracy=0.118883
INFO:root:Epoch[0] Batch [256]  Speed: 6193.33 samples/sec      accuracy=0.194824
INFO:root:Epoch[0] Batch [384]  Speed: 5896.97 samples/sec      accuracy=0.261047
INFO:root:Epoch[0] Train-accuracy=0.290365
INFO:root:Epoch[0] Time cost=8.733
INFO:root:Epoch[0] Validation-accuracy=0.345827
INFO:root:Epoch[1] Batch [128]  Speed: 6422.00 samples/sec      accuracy=0.314377
INFO:root:Epoch[1] Batch [256]  Speed: 6050.25 samples/sec      accuracy=0.357117
INFO:root:Epoch[1] Batch [384]  Speed: 6316.05 samples/sec      accuracy=0.381226
INFO:root:Epoch[1] Train-accuracy=0.395833
INFO:root:Epoch[1] Time cost=8.000
INFO:root:Epoch[1] Validation-accuracy=0.435225
INFO:root:Epoch[2] Batch [128]  Speed: 6226.67 samples/sec      accuracy=0.408127
INFO:root:Epoch[2] Batch [256]  Speed: 5576.82 samples/sec      accuracy=0.429443
INFO:root:Epoch[2] Batch [384]  Speed: 5928.68 samples/sec      accuracy=0.447205
INFO:root:Epoch[2] Train-accuracy=0.451823
INFO:root:Epoch[2] Time cost=8.462
INFO:root:Epoch[2] Validation-accuracy=0.498418
INFO:root:Epoch[3] Batch [128]  Speed: 6598.50 samples/sec      accuracy=0.475896
INFO:root:Epoch[3] Batch [256]  Speed: 6662.20 samples/sec      accuracy=0.498230
INFO:root:Epoch[3] Batch [384]  Speed: 7016.46 samples/sec      accuracy=0.502808
INFO:root:Epoch[3] Train-accuracy=0.526042
INFO:root:Epoch[3] Time cost=7.391
INFO:root:Epoch[3] Validation-accuracy=0.561511
INFO:root:Epoch[4] Batch [128]  Speed: 6485.46 samples/sec      accuracy=0.527132
INFO:root:Epoch[4] Batch [256]  Speed: 6002.80 samples/sec      accuracy=0.541931
INFO:root:Epoch[4] Batch [384]  Speed: 6302.73 samples/sec      accuracy=0.544739
INFO:root:Epoch[4] Train-accuracy=0.540365
INFO:root:Epoch[4] Time cost=7.992
INFO:root:Epoch[4] Validation-accuracy=0.571796
INFO:root:Epoch[5] Batch [128]  Speed: 7212.56 samples/sec      accuracy=0.560441
INFO:root:Epoch[5] Batch [256]  Speed: 7343.56 samples/sec      accuracy=0.574829
INFO:root:Epoch[5] Batch [384]  Speed: 7326.81 samples/sec      accuracy=0.580383
INFO:root:Epoch[5] Train-accuracy=0.591146
INFO:root:Epoch[5] Time cost=6.850
INFO:root:Epoch[5] Validation-accuracy=0.627670
INFO:root:Epoch[6] Batch [128]  Speed: 7072.83 samples/sec      accuracy=0.591206
INFO:root:Epoch[6] Batch [256]  Speed: 6632.22 samples/sec      accuracy=0.600403
INFO:root:Epoch[6] Batch [384]  Speed: 6356.41 samples/sec      accuracy=0.606262
INFO:root:Epoch[6] Train-accuracy=0.579427
INFO:root:Epoch[6] Time cost=7.475
INFO:root:Epoch[6] Validation-accuracy=0.640032
INFO:root:Epoch[7] Batch [128]  Speed: 6647.45 samples/sec      accuracy=0.613069
INFO:root:Epoch[7] Batch [256]  Speed: 7264.74 samples/sec      accuracy=0.624451
INFO:root:Epoch[7] Batch [384]  Speed: 7258.86 samples/sec      accuracy=0.626160
INFO:root:Epoch[7] Train-accuracy=0.608073
INFO:root:Epoch[7] Time cost=7.089
INFO:root:Epoch[7] Validation-accuracy=0.652690
INFO:root:Epoch[8] Batch [128]  Speed: 6471.52 samples/sec      accuracy=0.632510
INFO:root:Epoch[8] Batch [256]  Speed: 7245.21 samples/sec      accuracy=0.638000
INFO:root:Epoch[8] Batch [384]  Speed: 7245.78 samples/sec      accuracy=0.638794
INFO:root:Epoch[8] Train-accuracy=0.643229
INFO:root:Epoch[8] Time cost=7.166
INFO:root:Epoch[8] Validation-accuracy=0.674644
INFO:root:Epoch[9] Batch [128]  Speed: 6291.10 samples/sec      accuracy=0.652374
INFO:root:Epoch[9] Batch [256]  Speed: 6623.54 samples/sec      accuracy=0.654236
INFO:root:Epoch[9] Batch [384]  Speed: 6954.76 samples/sec      accuracy=0.651367
INFO:root:Epoch[9] Train-accuracy=0.664062
INFO:root:Epoch[9] Time cost=7.549
INFO:root:Epoch[9] Validation-accuracy=0.681171

Broken Unit Tests and Integration Tests

Following unit tests and integration tests are broken with MXNet backend. These need to be fixed before completion of phase 1.

  1. tests/integration_tests/test_image_data_tasks => avg_pooling
  2. tests/integration_tests/test_temporal_data_tasks => categorical_crossentropy, embedding layer.
  3. tests/keras/applications/imagenet_utils_test (test_preprocess_input_symbolic) => getitem of KerasSymbol is broken.
  4. tests/keras/applications/applications_test => Inceptionv3
  5. tests/keras/applications/applications_test => DenseNet
  6. tests/test_topology/test_recursion_with_bn_and_loss => MXNet batchnorm should do updates
  7. tests/test_topology/test_shared_layer_depth_is_correct => MXNet do not fully support Embedding Layer.
  8. tests/test_topology => MXNet Model.predict() API has issues. Cannot call predict() without calling compile().
  9. tests/test_training => MXnet do not support multi input network compiling.
  10. tests/layers/convolutional_test => MXNet do not support Pooling with "SAME" mode.
  11. tests/layers/convolutional_test => MXNet do not support Cropping
  12. tests/layers/embeddings_test => MXNet do not support Embedding Layers
  13. tests/layers/noise_test => MXNet do not support -> GaussianNoise, GaussianDropout, AlphaDropout
  14. tests/layers/normalization => MXNet uses native batchnorm operator.
  15. tests/layers/wrappers => MXNet does not support TimeDistributed.
  16. tests/wrappers/scikit_learn_test => MXNet does not support Linear Regression.
  17. tests/model_saving => MXNet does not support multi metrics output.
  18. tests/model_saving => MXNet tolerance on batch_retrain is 1e-2 rather than 1e-5.
  19. tests/optimizer => MXNet backend do not support NAdam, Adam_Amsgrad, Adamax
  20. tests/optimizer => MXNet backend accuracy on train_on_batch is not upto the required tolerance.
  21. tests/test_model_saving => test_loading_weights_by_name_and_reshape

Performance Improvement - Keras with MXNet backend

  1. Avoid transpose operation.
    Keras passes conv kernel in channels_last format.
    We can avoid the transpose of kernel, by making a change in Keras conv layer something like:

        if self.data_format == 'channels_first':
            kernel_shape = (self.filters, input_dim) + self.kernel_size
        else:
            kernel_shape = self.kernel_size + (input_dim, self.filters)

Not supported keras/examples with MXNet backend

Checked examples are tested to be working with MXNet backend
Not supported examples have clear error message specifying the exact functionality MXNet does not support yet

  • addition_rnn.py
  • antirectifier.py
  • babi_memnn.py
  • babi_rnn.py
  • cifar10_cnn.py
  • cifar10_cnn_capsule.py [Custom CNN infer shape Not supported]
  • cifar10_cnn_tfaugment2d.py [TF specific Not supported]
  • cifar10_resnet.py
  • conv_filter_visualization.py [Gradient Not supported]
  • conv_lstm.py [Conv2DLSTM Not supported]
  • deep_dream.py [Gradient Not supported]
  • image_ocr.py [CTC Not supported]
  • imdb_bidirectional_lstm.py
  • imdb_cnn.py
  • imdb_cnn_lstm.py
  • imdb_fasttext.py
  • imdb_lstm.py
  • lstm_seq2seq.py
  • lstm_seq2seq_restore.py
  • lstm_stateful.py
  • lstm_text_generation.py
  • mnist_acgan.py
  • mnist_cnn.py
  • mnist_dataset_api.py [TF Specific Not supported]
  • mnist_denoising_autoencoder.py
  • mnist_hierarchical_rnn.py
  • mnist_irnn.py
  • mnist_mlp.py
  • mnist_net2net.py
  • mnist_siamese.py
  • mnist_sklearn_wrapper.py
  • mnist_swwae.py [Gradient Not supported]
  • mnist_tfrecord.py [TF Specific Not supported]
  • mnist_transfer_cnn.py
  • neural_doodle.py [Gradient Not supported]
  • neural_style_transfer.py [Gradient Not supported]
  • pretrained_word_embeddings.py
  • reuters_mlp.py
  • reuters_mlp_relu_vs_selu.py
  • variational_autoencoder.py [Custom Loss Not supported]
  • variational_autoencoder_deconv.py [Custom Loss Not supported]

MXNet backend uses native batch_norm operator

MXNet backend uses mxnet batchnorm operator directly without going through Keras batchnorm normalization layer.
Reason: MXNet do not support

  • moving_average_update() API.
  • batch_update
  • add_update

Changes made on Keras code

Using this issue to track all the changes made on keras code, to avoid confusion for keras developers and future work.

  1. Override Keras model with mxnet model
  2. Kernel shape in Conv layers, provide in both channels last and channels first format. Need to implement that for every new Conv layer. In build() and compute_output_shape()
  3. change on multi_gpu_model to pass mxnet context
  4. Batch Norm
  5. Embedding layers

Unable to use Linear Regression with MXNet backend

tests/keras/wrappers/scikit_learn_test.py is brokenn for linear_regression usecases.

Following code should work.

model = Sequential()
    model.add(Dense(input_dim, input_shape=(input_dim,)))
    model.add(Activation('relu'))
    model.add(Dense(hidden_dims))
    model.add(Activation('relu'))
    model.add(Dense(1))
    model.add(Activation('linear'))
    model.compile(optimizer='sgd', loss='mean_absolute_error',
                  metrics=['accuracy'])

[Feature Requests] Keras2 with MXNet backend

  • Allow users to choose the bucket in 'save_mxnet_model()' API. For example, save 'train' bucket symbol rather than always saving 'pred' bucket symbol. This allows users to save checkpoints and continue for retraining. Contact @dmadeka for more details.
  • keras.models.get_mxnet_model(model) API and return symbols and params rather than storing on disk. To use MXNet for inference in the same session.
  • MXNet provides easy to use interface for running large-scale model training jobs across multiple machines - http://mxnet.incubator.apache.org/faq/multi_devices.html?highlight=distributed%20training We should provide such functionality for Keras users. We can start with MXNet backend and extend it to other backends.

Does not utilize all the GPUs on a multi-gpu machine

Hello,

I have noticed that while running keras/example/cifar10_cnn.py(uses Sequential API) and keras/example/lstm_text_generation.py(uses Sequential API) on multi-gpu machine such as Amazon AWS P3.8xLarge instance, the code does not utilize all the GPUs. However, it just uses a single GPU.

I have used keras-apache-mxnet/benchmark template scripts and modified cifar10_cnn.py and lstm_text_generation.py based on that template script.

Command:

sh run_mxnet_backend.sh 4_gpu_config

Here, the test is only using 1 GPU instead of 4 GPUs.

However, the when I ran the benchmark_resnet.py(uses functional API) test with 4_gpu_config option, it utilizes all the GPUs.

I think the problem is with setting the model context when test uses keras sequential() API.

More information:

(Pdb) self
<keras.models.Sequential object at 0x7f3002ad9630>

(Pdb) self._context
[gpu(0), gpu(1), gpu(2), gpu(3)]

(Pdb) self.model
<keras.backend.mxnet_backend.get_model.<locals>.Model object at 0x7f2ffb4ec9b0>

(Pdb) self.model._context
[gpu(0)]

Thank-You.

[Checklist] Phase 1 - Support CNN on Keras with MXNet

This issue is to track all the pending tasks to support CNN on Keras with MXNet backend.
(Note this is a running list)

  • bias_add
  • batch_norm
  • batch_dot
  • Conv2D
  • Conv2D_Transpose
  • Pooling2D
  • Able to create and stack layers sequentially
  • Able to compile the model
  • Able to fit
  • Supports CPU
  • Supports one GPU
  • Supports multi-GPU
  • Able to train Resnet50 on CIFAR10
  • SGD Optimizer
  • RMSProp Optimizer
  • Adam Optimizer
  • AdaDelta Optimizer
  • Able to train VGGNet on CIFAR10
  • Able to train InceptionV3 on CIFAR10
  • Able to save the model
  • Able to load the saved model (pre-trained with MXNet)
  • Able to run inference from pre-trained model
  • Keras tests are passing
  • Keras lint checks are passing
  • Able to load the saved model (from other backend?)

Better messaging for handling channels_first and channels_last

  • There is a need to show some code samples on how to do this for your own dataset, not just the standard built-in ones. For to_channels_first API

  • I’d still like to see the CLI output for the examples we highlight to show # of GPUs, what my current channel order config is, and the image/sec stats. Could be something to bake in as well.

Operators to be implemented/fixed in MXNet Backend

Below is the checklist of operators to be implemented or revisited for fixing issues in MXNet backend for Keras:

  • Batchnorm
  • BatchDot
  • Binary CrossEntropy
  • Categorical CrossEntropy
  • logsumexp
  • Sparse operators
  • bias_add
  • temporal_padding - MXNet do not support 3D Tensor padding
  • conv1d/2d/3d
  • rnn

Loading CNN trained with TF backend produces redefinition error

Please make sure that the boxes below are checked before you submit your issue. If your issue is an implementation question, please ask your question on StackOverflow or join the Keras Slack channel and ask there instead of filing a GitHub issue.

Thank you!

  • Check that you are up-to-date with the master branch of Keras. You can update with:
    pip install git+git://github.com/keras-team/keras.git --upgrade --no-deps

  • If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.

  • If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with:
    pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps

  • Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).

Trying to load a pretrained resnet50 I get:

AssertionError: Redefinition of variable conv1/kernel1

Cannot slice axis on tensor with MXNet backend

Minimum reproducible code:

import numpy as np

import keras
from keras import backend as K

# In Numpy
data = np.array([[[1,2,3], [4,5,6]]])
data.shape
(1,2,3)
data[:,-1,:].shape
(1,3)

# In Keras with MXNet backend
var1 = K.variable(data)
var1.shape
(1,2,3)
K.eval(var1[:,-1,:])
<<ERROR>>

Check failed: b < e (1 vs. 0) slicing with begin=[1]=1, end[1]=0, and step[1]=1 is invalid

Reason:

MXNet does not support slicing the axis with mx.sym.slice operator.
In KerasSymbol, getitem method, we need to special case this kind of slicing and use mx.sym.slice_axis()

Can't converge when switching to MXnet backend

My training keras-tensorflow pipeline works fine, but when switching to keras-mxnet got warning:
Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (1.0 vs. 0.125). Is this intended? force_init=force_init).
Can it be the reason why my network can't converge? And what are possible mistakes that produce rescale_grad warning?
I'm not using multi-gpu.

multi-gpu tutorial: model parameter (n and version)

The comments say that if we're doing ResNet110 - which I think we are - then n should be set to 18 or 12, depending on what version we're running. We're still running n=3 which is ResNet20.0 on version 1. Also, should be we be running version 2 instead of 1?

save mxnet model not working for channels last format and on GPU

Following this tutorial for saving mxnet native models, works fine if data_format is channels_first in ~/.keras/keras.json:
https://github.com/awslabs/keras-apache-mxnet/blob/master/docs/mxnet_backend/save_mxnet_model.md

If I change to channels_last,

data_names, data_shapes = save_mxnet_model(model=model, prefix='mnist_cnn', epoch=0)
print(data_names)
print(data_shapes)

gives the following output:

['/conv2d_1_input1'] 
[DataDesc[/conv2d_1_input1,(128L, 28L, 28L, 1L),float32,NCHW]]

In DataDesc, it's still NCHW (channels first) format.
Leading to an error when loading back into mxnet code (note: changed data shape to channels last):

import numpy as np
import mxnet as mx

# Step1: Load the model in MXNet

# Use the same prefix and epoch parameters we used in save_mxnet_model API.
sym, arg_params, aux_params = mx.model.load_checkpoint(prefix='mnist_cnn', epoch=0)

# We use the data_names and data_shapes returned by save_mxnet_model API.
mod = mx.mod.Module(symbol=sym, 
                    data_names=['/conv2d_1_input1'], 
                    context=mx.cpu(), 
                    label_names=None)
mod.bind(for_training=False, 
         data_shapes=[('/conv2d_1_input1', (1, 28, 28, 1))], 
         label_shapes=mod._label_shapes)

Error Message:

 from ._conv import register_converters as _register_converters
infer_shape error. Arguments:
  /conv2d_1_input1: (1, 1L, 28L, 28L)
Traceback (most recent call last):
  File "test_mnist.py", line 27, in <module>
    result = mod.predict(data_iter)
  File "/usr/local/lib/python2.7/dist-packages/mxnet/module/base_module.py", line 371, in predict
    self.forward(eval_batch, is_train=False)
  File "/usr/local/lib/python2.7/dist-packages/mxnet/module/module.py", line 610, in forward
    self.reshape(new_dshape, new_lshape)
  File "/usr/local/lib/python2.7/dist-packages/mxnet/module/module.py", line 471, in reshape
    self._exec_group.reshape(self._data_shapes, self._label_shapes)
  File "/usr/local/lib/python2.7/dist-packages/mxnet/module/executor_group.py", line 382, in reshape
    self.bind_exec(data_shapes, label_shapes, reshape=True)
  File "/usr/local/lib/python2.7/dist-packages/mxnet/module/executor_group.py", line 358, in bind_exec
    allow_up_sizing=True, **dict(data_shapes_i + label_shapes_i))
  File "/usr/local/lib/python2.7/dist-packages/mxnet/executor.py", line 402, in reshape
    arg_shapes, _, aux_shapes = self._symbol.infer_shape(**kwargs)
  File "/usr/local/lib/python2.7/dist-packages/mxnet/symbol/symbol.py", line 990, in infer_shape
    res = self._infer_shape_impl(False, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/mxnet/symbol/symbol.py", line 1120, in _infer_shape_impl
    ctypes.byref(complete)))
  File "/usr/local/lib/python2.7/dist-packages/mxnet/base.py", line 149, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Error in operator conv2d_1/conv2d2: Shape inconsistent, Provided = [32,1,3,3], inferred shape=(32,28,3,3)

Operators missing for MXNet backend

Variables and Placeholders

  • Support for constraints in Keras variables and Placeholders

Update Operators

  • update
  • update_add
  • update_sub

Graph Manipulations

  • gradients

Layers

  • Embedding
  • noise.GaussianNoise
  • noise.GaussianDropout
  • noise.AlphaDropout
  • ConvLSTM2D

RNNs

  • rnn

CNNs

  • conv1d
  • conv2d_transpose
  • separable_conv2d
  • depthwise_conv2d
  • conv3d
  • conv3d_transpose
  • local_conv1d
  • local_conv2d
  • pooling with SAME mode
  • conv1d with CAUSAL mode
  • separable_conv1D

Higher Order Functions

  • map_fn
  • foldl
  • foldr

Sparse Tensors

  • Sparse tensors are supported
  • sparse sum
  • sparse mean
  • sparse concat
  • sparse dot
  • sparse embedding

NN Operators

  • sparse_categorical_crossentropy

Optimizers

  • Adam_AMSGrad

Others

  • truncated_normal
  • cumsum
  • cumprod
  • logsumexp
  • stack
  • slice
  • ctc
  • module - #37
  • gather operator does not work with Embedding Layer - https://github.com/awslabs/keras-apache-mxnet/issues/6300000000000
  • Pool2D with SAME mode.
  • Depthwise and Separable Conv2D with multiplier != 1 and stride != 1
  • Partial Loss is not supported. See here for more details - pytest tests/keras/engine/test_training.py -k "test_model_with_partial_loss"
  • External Loss is not supported. See here for more details - pytest tests/keras/engine/test_training.py -k "test_model_with_external_loss"
  • Does not support clone model

Dropout not working for LSTM layer

LSTM layer not working if dropout parameters are specified here:

removing the params from
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2, input_shape=input_shape))
to
model.add(LSTM(128, input_shape=input_shape)))
works

Move skip tests list to a nosetest config file

Currently, for MXNet backend, we skip around 190 tests out of ~650 tests. We skip tests with @skip_if decorator. This involves changes in the code. Pulling latest from Keras and merging into this repo, creates merge conflicts. It would very useful if we can remove those MXNet specific skip test code from test files and put it as a config or have a test runner that manages the list of tests to run or skip.

One closest reference is ONNX tests - https://github.com/onnx/onnx/blob/master/onnx/backend/test/runner/__init__.py

load_weights () is not loading weights if pre-trained model is used

I am using keras-mxnet 2.1.6.1.

I trained a model on cifar10 data using densenet121 (code below). There is no issue in training. However if I load weights to continue the training or call predict, it seems weight doesn't get loaded since training again starts from same loss/accuracy as the first time. Predict results are all nan.

from __future__ import print_function
import keras
from keras.applications.densenet import DenseNet121
from keras.layers.pooling import GlobalAveragePooling2D
from keras.layers.core import Dense
from keras.layers import Input
from keras.models import Model
from keras.regularizers import *


def get_model():
        aliases = {}
        Input_1 = Input(shape=(3, 221, 221), name='Input_1')
        DenseNet121_1_model = DenseNet121(include_top= False, input_tensor = Input_1)
        DenseNet121_1 = DenseNet121_1_model(Input_1)
        aliases['DenseNet121_1'] = DenseNet121_1_model.name
        num_layers = len(DenseNet121_1_model.layers)
        for i, layer in enumerate(DenseNet121_1_model.layers):
                if ((i * 100) / (num_layers - 1)) <= (100 - 10):
                        layer.trainable = False
        GlobalAveragePooling2D_1 = GlobalAveragePooling2D(name='GlobalAveragePooling2D_1')(DenseNet121_1)
        Dense_1 = Dense(name='Dense_1',units= 10,activation= 'softmax' )(GlobalAveragePooling2D_1)

        model = Model([Input_1],[Dense_1])
        return model


from keras.datasets import cifar10
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D
import os
from skimage.transform import resize
import numpy as np

batch_size = 16
num_classes = 10
epochs = 2

# The data, split between train and test sets:
(x_train1, y_train1), (x_test1, y_test1) = cifar10.load_data()

y_train = y_train1[:x_train1.shape[0]//5]
y_test = y_test1[:x_test1.shape[0]//5]  

x_train = np.ndarray((x_train1.shape[0]//5, 3,221,221), dtype=np.float32)
x_test = np.ndarray((x_test1.shape[0]//5, 3,221,221), dtype=np.float32)

for i in range(x_train.shape[0]):
    x_train[i] = resize(x_train1[i], (3,221,221), anti_aliasing=True)
   
for i in range(x_test.shape[0]):
    x_test[i] = resize(x_test1[i], (3,221,221), anti_aliasing=True)


# Convert class vectors to binary class matrices.
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

# initiate RMSprop optimizer
opt = keras.optimizers.rmsprop(lr=0.0001, decay=1e-6)

model=get_model()
#model = keras.models.load_model("cifar.h5")
# Let's train the model using RMSprop
model.compile(loss='categorical_crossentropy',
              optimizer=opt, context=["gpu(0)"],
              metrics=['accuracy'])

x_train /= 255
x_test /= 255

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test),
          shuffle=True)
model.save("cifar.h5")

Conv2d_transpose and conv3d_transpose test cases failing

Changes in constant() operator causes failure in Conv2d_transpose() and conv3d_transpose() test cases with the following error:

    def _sync_weights(self):
        if self._weights_dirty:
            args, auxs = self._module.get_params()
            for name in self._arg_names:
>               self._args[name][:] = args[name]
E               KeyError: 'conv2d_transpose_2/kernel1'

K.gather not working for Embedding Layer using MXNet backend

In Keras Embedding layer, K.gather operator is used (MXNet backend use mx.sym.take and TensorFlow backend use tf.gather for implementation). However, K.gather is giving error when using MXNet backend, and this issue only occurs in Embedding layer use case, not in other use cases.
Test Case:
tests/keras/layers/embeddings_test.py
Error Message:
Error in operator broadcast_mul0: [13:05:25] src/operator/tensor/./elemwise_binary_broadcast_op.h:67: Check failed: l == 1 || r == 1 operands could not be broadcast together with shapes [3,2] [3]
This PR fixed it and is using directly mx.sym.Embedding instead of K.gather to implement the Keras Embedding Layer. Note this fix will break original Keras code and have conflict in future merge.

Keras-MXNet CNN for training on CPU is relatively slower

Hi,

I am running the keras/examples/mnist_cnn.py and keras/exaples/cifar10_cnn.py and I can see that the training time it took on every epoch on Keras using MXNet backend is higher as compared to Keras using Tensorflow backend when ran on CPU.

The image data is already a channels_first when using MXNet backend and channels_last when using TensorFlow backend, which means, no transpose overhead operation.

Below are the pieces of information:

Machine: MacBook Pro(2.5GHz intel core i7 and 16 GB 2133 MHz RAM)

Python version: 2.7.14

MXNet version: 1.1.0

Tensorflow version: 1.5.0

Keras version: 2.1.4

Results:

Backend mnist_cnn.py cifar10_cnn.py
Keras+MXNet training performance 272 sec/epoch 388 sec/epoch
Keras+TensorFlow training performance 150 sec/epoch 239 sec/epoch

Thank-You!

Keras with MxNet running only on CPU

@sandeep-krishnamurthy Great work! I have used Mxnet as backend for Keras 1.2 in your previous repository: https://github.com/dmlc/keras and I really loved it! Congratulations!

I came across your new forked repository for keras 2.x but I am having trouble to use the GPU. More specifically I followed the instructions from https://github.com/deep-learning-tools/keras/wiki/Installation-Guide---Keras-with-MXNet-backend and I installed the mxnet-cu80 since I have CUDA 8. MxNet loads/works fine but when I try to run a Keras example: https://github.com/deep-learning-tools/keras/blob/master/examples/cifar10_cnn.py the experiment runs on the CPU and not the GPU. What am I doing wrong? Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.