Giter VIP home page Giter VIP logo

gradient-checkpointing's Introduction

Status: Maintenance (expect bug fixes and minor updates)

Saving memory using gradient-checkpointing

Training very deep neural networks requires a lot of memory. Using the tools in this package, developed jointly by Tim Salimans and Yaroslav Bulatov, you can trade off some of this memory usage with computation to make your model fit into memory more easily. For feed-forward models we were able to fit more than 10x larger models onto our GPU, at only a 20% increase in computation time.

The memory intensive part of training deep neural networks is computing the gradient of the loss by backpropagation. By checkpointing nodes in the computation graph defined by your model, and recomputing the parts of the graph in between those nodes during backpropagation, it is possible to calculate this gradient at reduced memory cost. When training deep feed-forward neural networks consisting of n layers, we can reduce the memory consumption to O(sqrt(n)) in this way, at the cost of performing one additional forward pass (see e.g. Training Deep Nets with Sublinear Memory Cost, by Chen et al. (2016)). This repository provides an implementation of this functionality in Tensorflow, using the Tensorflow graph editor to automatically rewrite the computation graph of the backward pass.

Memory used while training a ResNet model with large batch size, using the regular tf.gradients function and using our memory-optimized gradient implementation

How it works

For a simple feed-forward neural network with n layers, the computation graph for obtaining gradients looks as follows:

The activations of the neural network layers correspond to the nodes marked with an f. During the forward pass all these nodes are evaluated in order. The gradient of the loss with respect to the activations and parameters of these layers is indicated by the nodes marked with b. During the backward pass, all these nodes are evaluated in the reversed order. The results obtained for the f nodes are needed to compute the b nodes, and hence all f nodes are kept in memory after the forward pass. Only when backpropagation has progressed far enough to have computed all dependencies, or children, of an f node, can it be erased from memory. This means that the memory required by simple backprop grows linearly with the number of neural net layers n. Below we show the order in which these nodes are computed. The purple shaded circles indicate which of the nodes need to be held in memory at any given time.

Graph 1. Vanilla backprop

Simple backpropagation as described above is optimal in terms of computation: it only computes each node once. However, if we are willing to recompute nodes we can potentially save a lot of memory. We might for instance simply recompute every node from the forward pass each time we need it. The order of execution, and the memory used, then look as follows:

Graph 2. Memory poor backprop

Using this strategy, the memory required to compute gradients in our graph is constant in the number of neural network layers n, which is optimal in terms of memory. However, note that the number of node evaluations now scales with n^2, whereas it previously scaled as n: Each of the n nodes is recomputed on the order of n times. The computation graph thus becomes much slower to evaluate for deep networks, which makes this method impractical for use in deep learning.

To strike a balance between memory and computation we need to come up with a strategy that allows nodes to be recomputed, but not too often. The strategy we use here is to mark a subset of the neural net activations as checkpoint nodes.

Our chosen checkpoint node

These checkpoint nodes are kept in memory after the forward pass, while the remaining nodes are recomputed at most once. After being recomputed, the non-checkpoint nodes are kept in memory until they are no longer required. For the case of a simple feed-forward neural net, all neuron activation nodes are graph separators or articulation points of the graph defined by the forward pass. This means that we only need to recompute the nodes between a b node and the last checkpoint preceding it when computing that b node during backprop. When backprop has progressed far enough to reach the checkpoint node, all nodes that were recomputed from it can be erased from memory. The resulting order of computation and memory usage then look as follows

Graph 3. Checkpointed backprop

For the simple feed-forward network in our example, the optimal choice is to mark every sqrt(n)-th node as a checkpoint. This way, both the number of checkpoint nodes and the number of nodes inbetween checkpoints are on the order of sqrt(n), which means that the required memory now also scales with the square root of the number of layers in our network. Since every node is recomputed at most once, the additional computation required by this strategy is equivalent to a single forward pass through the network.

Our package implements checkpointed backprop as shown in Graph 3 above. This is implemented by taking the graph for standard backprop (Graph 1 above) and automatically rewriting it using the Tensorflow graph editor. For graphs that contain articulation points (single node graph dividers) we automatically select checkpoints using the sqrt(n) strategy, giving sqrt(n) memory usage for feed-forward networks. For more general graphs that only contain multi-node graph separators our implementation of checkpointed backprop still works, but we currently require the user to manually select the checkpoints.

Additional explanation of computation graphs, memory usage, and gradient computation strategies, can be found in the blog post accompanying our package.

Setup requirements

pip install tf-nightly-gpu
pip install toposort networkx pytest

Also, when running the tests, make sure that the CUDA Profiling Tools Interface (CUPTI) can be found, e.g. by running export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/cuda/extras/CUPTI/lib64"

Usage

This repository provides a drop-in replacement for tf.gradients in base Tensorflow. Import this function using

from memory_saving_gradients import gradients

and use the gradients function like you would normally use tf.gradients to compute gradients of losses to parameters. (This assumes you are explicitly calling tf.gradients, rather than implicitly inside a tf.train.Optimizer).

In addition to the regular arguments to tf.gradients, our gradients function has one additional argument, checkpoints. The checkpoints argument tells the gradients function which nodes of the graph you want to checkpoint during the forward pass through your computation graph. The nodes in between the checkpoints are then recomputed during the backward pass. You can supply a list of tensors to checkpoint, gradients(ys,xs,checkpoints=[tensor1,tensor2]), or you can use one of several keywords:

  • 'collection' (default): This checkpoints all tensors returned by tf.get_collection('checkpoints'). You then need to make sure you add tensors to this collection using tf.add_to_collection('checkpoints', tensor) when you define your model.
  • 'memory' : This uses a heuristic to automatically select a set of nodes to checkpoint which achieves our desired O(sqrt(n)) memory usage. The heuristic works by automatically identifying articulation points in the graph, i.e. tensors which split the graph into two disconnected parts when removed, and then checkpointing a suitable number of these tensors. This currently works well for many, but not all, models.
  • 'speed' : This option tries to maximize running speed by checkpointing the outputs of all ops that are typically expensive to compute, namely convolutions and matrix multiplies.

Overwriting tf.gradients

A useful alternative to using the new gradients function directly is to just overwrite the function that python has registered to the tf.gradients name. This can be done as follows:

import tensorflow as tf
import memory_saving_gradients
# monkey patch tf.gradients to point to our custom version, with automatic checkpoint selection
tf.__dict__["gradients"] = memory_saving_gradients.gradients_speed

Following this, all calls to tf.gradients will use the memory saving version instead.

The same can be done when using Keras:

import memory_saving_gradients as gc
from tensorflow.python.ops import gradients as tf_gradients
tf_gradients.gradients = gc.gradients_speed

Replace gradients_speed with gradients_memory or gradients_collection to use the other methods of checkpoint selection.

Tests

The test folder contains scripts for testing the correctness of the code and to profile the memory usage for various models. After modifying the code you can run ./run_all_tests.sh from this folder to execute the tests.

Testing memory usage and running time for ResNet on CIFAR10 for different numbers of layers. Batch-size 1280, GTX1080

Limitations

The provided code does all graph manipulation in python before running your model which is slow for large graphs. The current algorithm for automatically selecting checkpoints is purely heuristic and is expected to fail on some models outside of the class we have tested. In such cases manual mode checkpoint selection should be used: Add your chosen checkpoint nodes to a Tensorflow collection names "checkpoints" and use checkpoints=collection when calling our gradients function.

References

gradient-checkpointing's People

Contributors

cberner avatar chenchuang avatar christopherhesse avatar davidbelanger avatar diego- avatar dmitrivainbrand avatar ethancaballero avatar justheuristic avatar thomasquintana avatar timsalimans avatar yaroslavvb avatar yselivonchyk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gradient-checkpointing's Issues

Limiting memory usage via GPUOptions conflicts with is_gpu_available

TF version tensorflow-gpu==1.14.0

The default behavior is that TF would allocate almost all the GPU memory.

I tried to limit the GPU memory allocated by TF via
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction = 0.5)

But it turns out no effect. This SO post explained that any function calling device_lib.list_local_devices() would allocate all the GPU memory on all of the devices.

By commenting out all the tf.test.is_gpu_available(), the GPUOptions above works. This is not the problem of gradient-checkpointing but some unexpected behavior of TF. Just try to leave an issue here in case anyone running into the same problem as I do.:sweat_smile:

Cannot fit any extra batches into memory than normal

Hi,

I really like what you have developed, i think it will be very useful for models like DenseNet.

I tried it on a Keras model I have been working on, i just copied the "monkey patch" from the Keras-test example you've made.

I was not able to see any improvement. I specifically wanted to train on a larger batch size, but I get out of memory error at the same threshold as before applying the patch.

Have I misunderstood what gradient-checkpointing can do? if not, how do I verify if the patch is working?

Splitting model across 2 GPUs leads to OOM

I am running a UNet based model on a single GPU, using gradients_speed.
When splitting the same model across 2 GPUs training runs out with OOM before even starting.

Same model runs fine on 2 GPUs with regular gradients.

What would be a good place to start investigating this issue? What can be causing that?

AttributeError: 'NoneType' object has no attribute 'pred'

Hello, I tried using this project with keras import code below:
`
import tqdm
import keras
import numpy as np
import tensorflow as tf
import keras.backend as k
import memory_saving_gradients
from keras.models import Model
from keras.layers import Input,Dense,Bidirectional,Activation,TimeDistributed,GRU,Dropout

k.dict["gradients"] = memory_saving_gradients.gradients_memory

inputs=Input((400,len(chars)))

gu1=Bidirectional(GRU(200,activation='relu',kernel_initializer='RandomUniform',
bias_initializer='RandomUniform',recurrent_dropout=0.2,return_sequences=True))(inputs)

gu2=GRU(400,activation='relu',kernel_initializer='RandomUniform',
bias_initializer='RandomUniform',recurrent_dropout=0.2,dropout=0.2,return_sequences=True)(gu1)

d=Dropout(0.3)(gu2)

logits_td=TimeDistributed(Dense(len(chars)))(d)

logits=Activation('softmax')(logits_td)

model=Model(inputs,logits)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy','categorical_accuracy'])
model.train_on_batch(data_x,data_y)
Note that data_x and data_y shapes are __(32,400,74)__ and that I cannot importimport tensorflow.python.*and the full traceback is--------------------------
AttributeError Traceback (most recent call last)
in ()
1 import time
2 t1=time.time()
----> 3 model.train_on_batch(data_1[:32],data_2[:32])
4 print('Batch Training time Approx. '+str(round(time.time()-t1,1)))

/usr/local/lib/lib/python3.4/site-packages/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight)
1811 else:
1812 ins = x + y + sample_weights
-> 1813 self._make_train_function()
1814 outputs = self.train_function(ins)
1815 if len(outputs) == 1:

/usr/local/lib/lib/python3.4/site-packages/keras/engine/training.py in _make_train_function(self)
988 training_updates = self.optimizer.get_updates(
989 params=self._collected_trainable_weights,
--> 990 loss=self.total_loss)
991 updates = self.updates + training_updates
992 # Gets loss and metrics. Updates weights at each call.

/usr/local/lib/lib/python3.4/site-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
85 warnings.warn('Update your ' + object_name + 86 ' call to the Keras 2 API: ' + signature, stacklevel=2)
---> 87 return func(*args, **kwargs)
88 wrapper._original_function = func
89 return wrapper

/usr/local/lib/lib/python3.4/site-packages/keras/optimizers.py in get_updates(self, loss, params)
413 @interfaces.legacy_get_updates_support
414 def get_updates(self, loss, params):
--> 415 grads = self.get_gradients(loss, params)
416 self.updates = [K.update_add(self.iterations, 1)]
417

/usr/local/lib/lib/python3.4/site-packages/keras/optimizers.py in get_gradients(self, loss, params)
71
72 def get_gradients(self, loss, params):
---> 73 grads = K.gradients(loss, params)
74 if hasattr(self, 'clipnorm') and self.clipnorm > 0:
75 norm = K.sqrt(sum([K.sum(K.square(g)) for g in grads]))

/var/host/media/removable/UNTITLED/seq2seq/memory_saving_gradients.py in gradients_memory(ys, xs, grad_ys, **kwargs)
25
26 def gradients_memory(ys, xs, grad_ys=None, **kwargs):
---> 27 return gradients(ys, xs, grad_ys, checkpoints='memory', **kwargs)
28
29 def gradients_collection(ys, xs, grad_ys=None, **kwargs):

/var/host/media/removable/UNTITLED/seq2seq/memory_saving_gradients.py in gradients(ys, xs, grad_ys, checkpoints, **kwargs)
256 dv = tf_gradients(boundary,
257 checkpoints_disconnected_other+xs,
--> 258 grad_ys=substitute_backprops, **kwargs)
259 debug_print("Got gradients %s", dv)
260 debug_print("for %s", boundary)

/usr/local/lib/lib/python3.4/site-packages/tensorflow/python/ops/gradients_impl.py in gradients(ys, xs, grad_ys, name, colocate_gradients_with_ops, gate_gradients, aggregation_method)
547 # issue here because of zeros.
548 if loop_state:
--> 549 out_grads[i] = loop_state.ZerosLike(op, i)
550 else:
551 out_grads[i] = control_flow_ops.ZerosLikeOutsideLoop(op, i)

/usr/local/lib/lib/python3.4/site-packages/tensorflow/python/ops/control_flow_ops.py in ZerosLike(self, op, index)
1172 if grad_state is None:
1173 # op is not in a while loop that is part of gradients().
-> 1174 return ZerosLikeOutsideLoop(op, index)
1175 op_ctxt = op._get_control_flow_context()
1176 val = ops.convert_to_tensor(op.outputs[index], name="tensor")

/usr/local/lib/lib/python3.4/site-packages/tensorflow/python/ops/control_flow_ops.py in ZerosLikeOutsideLoop(op, index)
1303 else:
1304 op_ctxt = op._get_control_flow_context()
-> 1305 pred = op_ctxt.pred
1306 branch = op_ctxt.branch
1307 switch_val = switch(op.inputs[0], pred)[1 - branch]

AttributeError: 'NoneType' object has no attribute 'pred'`
,thank you

Rerunning from Checkpoint Gives Error

When attempting to restart learning from a checkpoint, I get the following error (please let me know what other information I can provide):

Traceback (most recent call last):
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1356, in _do_call
    return fn(*args)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found.
  (0) Failed precondition: Attempting to use uninitialized value conv01/kernel
         [[{{node conv01/kernel}}]]
         [[conv01/kernel/_3]]
  (1) Failed precondition: Attempting to use uninitialized value conv01/kernel
         [[{{node conv01/kernel}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main_run_0008.py", line 1599, in <module>
    main()
  File "main_run_0008.py", line 1519, in main
    'model_init_' + hparam_str[0] + '_run_{:04d}_'.format(run[0]) + ts + '.h5') )
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\network.py", line 1090, in save
    save_model(self, filepath, overwrite, include_optimizer)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 382, in save_model
    _serialize_model(model, f, include_optimizer)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 97, in _serialize_model
    weight_values = K.batch_get_value(symbolic_weights)
  File "C:\Program Files\Python36\lib\site-packages\keras\backend\tensorflow_backend.py", line 2420, in batch_get_value
    return get_session().run(ops)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 950, in run
    run_metadata_ptr)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_run
    run_metadata)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: 2 root error(s) found.
  (0) Failed precondition: Attempting to use uninitialized value conv01/kernel
         [[node conv01/kernel (defined at C:\Program Files\Python36\lib\site-packages\keras\backend\tensorflow_backend.py:402) ]]
         [[conv01/kernel/_3]]
  (1) Failed precondition: Attempting to use uninitialized value conv01/kernel
         [[node conv01/kernel (defined at C:\Program Files\Python36\lib\site-packages\keras\backend\tensorflow_backend.py:402) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'conv01/kernel':
  File "main_run_0008.py", line 1599, in <module>
    main()
  File "main_run_0008.py", line 1390, in main
    train_model = load_model( args.checkpoint )
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 419, in load_model
    model = _deserialize_model(f, custom_objects, compile)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 225, in _deserialize_model
    model = model_from_config(model_config, custom_objects=custom_objects)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 458, in model_from_config
    return deserialize(config, custom_objects=custom_objects)
  File "C:\Program Files\Python36\lib\site-packages\keras\layers\__init__.py", line 55, in deserialize
    printable_module_name='layer')
  File "C:\Program Files\Python36\lib\site-packages\keras\utils\generic_utils.py", line 145, in deserialize_keras_object
    list(custom_objects.items())))
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\network.py", line 1032, in from_config
    process_node(layer, node_data)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\network.py", line 991, in process_node
    layer(unpack_singleton(input_tensors), **kwargs)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\base_layer.py", line 431, in __call__
    self.build(unpack_singleton(input_shapes))
  File "C:\Program Files\Python36\lib\site-packages\keras\layers\convolutional.py", line 141, in build
    constraint=self.kernel_constraint)
  File "C:\Program Files\Python36\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "C:\Program Files\Python36\lib\site-packages\keras\engine\base_layer.py", line 252, in add_weight
    constraint=constraint)
  File "C:\Program Files\Python36\lib\site-packages\keras\backend\tensorflow_backend.py", line 402, in variable
    v = tf.Variable(value, dtype=tf.as_dtype(dtype), name=name)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variables.py", line 259, in __call__
    return cls._variable_v1_call(*args, **kwargs)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variables.py", line 220, in _variable_v1_call
    shape=shape)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variables.py", line 198, in <lambda>
    previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variable_scope.py", line 2511, in default_variable_creator
    shape=shape)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variables.py", line 263, in __call__
    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variables.py", line 1568, in __init__
    shape=shape)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\variables.py", line 1728, in _init_from_args
    name=name)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\state_ops.py", line 79, in variable_op_v2
    shared_name=shared_name)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 2024, in variable_v2
    shared_name=shared_name, name=name)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 3616, in create_op
    op_def=op_def)
  File "C:\Program Files\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

Does this package work for tensorflow 1.15?

I found that the last date of commit is 2 years ago, so maybe this package is not applied in tensorflow1.15? Does anyone could make sure it?? It can not work in my codes with tf1.15. I need to know if it is because of the version of tensorflow or my own codes.

Using gradient checkpointing with a optimizer

I posted this as an issue before ( #4 ), however, neither of the suggestions appear to work. I get the same error with both methods suggested:

File "last-lstm-batchnorm.py", line 543, in
main()
File "last-lstm-batchnorm.py", line 486, in main
autoenc = Autoencoder(seq_len=SEQ_LEN, num_classes=NUM_CLASSES, embedding_dim=EMBEDDED_DIM)
File "last-lstm-batchnorm.py", line 93, in init
self.optimize
File "/usr/local/lib/python3.5/dist-packages/lazy_property/init.py", line 27, in get
result = self.method(instance)
File "last-lstm-batchnorm.py", line 214, in optimize
grads = tf.gradients(self.objective, tf.trainable_variables())
File "/data/oscar/rxu/featurelearning/memory_saving_gradients.py", line 27, in gradients_memory
return gradients(ys, xs, grad_ys, checkpoints='memory', **kwargs)
File "/data/oscar/rxu/featurelearning/memory_saving_gradients.py", line 258, in gradients
grad_ys=substitute_backprops, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py", line 516, in gradients
colocate_gradients_with_ops)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py", line 192, in _PendingCount
between_op_list, between_ops, colocate_gradients_with_ops)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1348, in MaybeCreateControlFlowState
loop_state.AddWhileContext(op, between_op_list, between_ops)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1157, in AddWhileContext
outer_forward_ctxt = forward_ctxt.outer_context
AttributeError: 'NoneType' object has no attribute 'outer_context'

where I am doing the following to update the weights (for the second suggestion):

@lazyprop.LazyProperty
    def objective(self):
        reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
        logits, _ = self.inference
        xentropy = tf.losses.sparse_softmax_cross_entropy(self.labels, logits)
        return xentropy + sum(reg_losses)
@lazyprop.LazyProperty
def optimize(self):    
    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    with tf.control_dependencies(update_ops):
        optimizer = tf.train.AdamOptimizer(self.learning_rate)
        # optimizer_op = optimizer.minimize(self.objective, name='optimizer')
        grads = tf.gradients(self.objective, tf.trainable_variables())
        grads_and_vars = list(zip(grads, tf.trainable_variables()))
        train_op = optimizer.apply_gradients(grads_and_vars)
    return optimizer_op

and the following for the first suggestion:

@lazyprop.LazyProperty
def objective(self):
    reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
    logits, _ = self.inference
    xentropy = tf.losses.sparse_softmax_cross_entropy(self.labels, logits)
    return xentropy + sum(reg_losses)

@lazyprop.LazyProperty
def optimize(self):    
    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    with tf.control_dependencies(update_ops):
        optimizer = tf.train.AdamOptimizer(self.learning_rate)
        optimizer_op = optimizer.minimize(self.objective, name='optimizer')
    return optimizer_op

Does Not Work with Keras

@yaroslavvb Would you please add keras model.fit_generator to your test cases? I notice the keras test case is a simple MNIST model that does not use convolutional layers either. As an example for me, on tensorflow 1.5-gpu with keras 2.1.6 and python 3.5 x64-bit on a Windows 10 machine, I cannot get the following to work (i.e. memory used and time per epoch is the same with or without memory_saving_gradients code):

# -*- coding: utf-8 -*-

##########
#LIBRARIES
##########

#Future
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import pandas as pd

pd.set_option('chained_assignment',None) #Sets `SettingWithCopyWarning` to None. If
                                         # making a chained assignment, the outcome may
                                         # vary depnding on if the data is a view of
                                         # other data or a copy of other data.

import cv2

import os
import time
import argparse
import h5py
import gc

import multiprocessing as mp

import tensorflow as tf
from tensorflow.python.keras._impl.keras import backend as K

from tensorflow.contrib.data.python.ops.shuffle_ops import shuffle_and_repeat
from tensorflow.contrib.data.python.ops.batching import map_and_batch

import memory_saving_gradients

Dataset = tf.data.Dataset

from tensorflow.python.keras.preprocessing.image import ImageDataGenerator, load_img, img_to_array
from tensorflow.python.keras.models import Sequential, Model, load_model, model_from_yaml
from tensorflow.python.keras.callbacks import LearningRateScheduler, ModelCheckpoint, EarlyStopping, History, TensorBoard
from tensorflow.python.keras import regularizers, optimizers
from tensorflow.python.keras.layers import Conv2D, Dense, Flatten, Dropout, Input, Lambda, Activation

##################
#GLOBAL VARIABLES
##################

img_shape_raw = (3, 160, 320)

batch_size = 32

num_epochs = 1

crop_top = 70
crop_btm = 25

img_format = 'channels_first'
K.set_image_data_format(img_format)

img_shape_input = (img_shape_raw[0],
                   img_shape_raw[1] - crop_top - crop_btm,
                   img_shape_raw[2]) #(3, 65, 320)

max_procs = mp.cpu_count() - 1 or 1 # 4 physical cores, 8 logical cores
max_q_size = batch_size

root = r'.'

fldr_img_raw = os.path.join( root, r'dat\raw' )
fldr_csv_raw = os.path.join( root, r'dat\raw' )

fldr_img_mod = os.path.join( root, r'dat\mod' )
fldr_csv_mod = os.path.join( root, r'dat\mod' )

train_csv = os.path.join(fldr_csv_mod, 'training_data.csv')
val_csv = os.path.join(fldr_csv_mod, 'validation_data.csv')
test_csv = os.path.join(fldr_csv_mod, 'test_data.csv')

pth_bins_fl = os.path.join( fldr_csv_mod, 'bins.txt' )

fldr_fig = os.path.join( root, r'fig' )

lr = [1e-4, ]
run = [1, ]

hparam_str = ['1e-4', ]

fldr_log = os.path.join( root, r'log', hparam_str[0], 'run_{:04d}'.format(run[0]))

fldr_arch = os.path.join( root, r'arch' )
fldr_wt = os.path.join( root, r'wt' )
fldr_ckpt = os.path.join( root, r'ckpt' )
fldr_mdl = os.path.join( root, r'mdl' )

fldr_summary = os.path.join( root, r'summary' )

fl_fmt_wt_ckpt = os.path.join( fldr_ckpt,
                               r'wt_ckpt-run_{run:04d}'.format(run=run[0]) + '_epoch_{epoch:04d}_val_mse_{val_mean_squared_error:.7f}.h5' )

################
#DATA GENERATOR
################

def get_data( keep_ptl = 75 ):
    '''This just returns the train, validation, and test dataframes
       keeping a certain percentile of the original data. I'm not
       including it here for space and since it doesn't seem pertinent.
    '''

def generator_from_df( df, batch_size, shuffle = True ):
    
    def read( img_pth, angle ):
        
        im_fl = tf.read_file( img_pth )
        im = tf.image.decode_image(im_fl, channels=3)
        im = tf.transpose( im, [2, 0, 1] ) # Make image channels first

        return Dataset.from_tensors( (im, angle) )

    img_pths = tf.convert_to_tensor( df['Image_Path'].values )
    angs = tf.convert_to_tensor( df['Angle'].values )

    ds = Dataset.from_tensor_slices( (img_pths, angs) )

    ds = ds.apply( tf.contrib.data.parallel_interleave( read, cycle_length = batch_size, sloppy = True ) )

    if shuffle:
        ds = ds.apply( shuffle_and_repeat( buffer_size = 2*batch_size, count = num_epochs ) )
    else:
        ds = ds.repeat( num_epochs )

    ds = ds.apply( map_and_batch(
        lambda img_pth, ang: (img_pth,ang),
        batch_size,
        num_parallel_batches = max_procs ) )
    
    ds = ds.prefetch( max_procs )

    iterator = ds.make_one_shot_iterator()
    sess = K.get_session()

    next_element = iterator.get_next()

    while True:

        try:
          yield sess.run(next_element)
        except tf.errors.OutOfRangeError:
          break

###########
#GET MODEL
###########

def get_model( lr ):

    keep_prob = 0.5
    rate = keep_prob
    
    l2 = regularizers.l2(0.001)

    with tf.name_scope('Input'):
        inputs = Input( shape=img_shape_input, name='input' )

        x = Lambda(lambda x: x / 255. - 0.5,
                   input_shape=img_shape_input, name = 'norm_-0.5_to_0.5')(inputs)

    with tf.name_scope('Hidden_Layers'):

        with K.name_scope('ConvLayer_01'):
        
            x = Conv2D(4, (5,5),
                       kernel_regularizer=l2,
                       bias_regularizer=l2,
                       padding='same',
                       name='conv01')(x)

        with tf.name_scope('ConvLayer_02'):
        
            x = Conv2D(12, (5,5),
                       kernel_regularizer=l2,
                       bias_regularizer=l2,
                       padding='same',
                       name='conv02')(x)

        with tf.name_scope('ConvLayer_03'):
        
            x = Conv2D(24, (5,5),
                       kernel_regularizer=l2,
                       bias_regularizer=l2,
                       padding='same',
                       name='conv03')(x)

        with tf.name_scope('ConvLayer_04'):
        
            x = Conv2D(24, (3,3),
                       kernel_regularizer=l2,
                       bias_regularizer=l2,
                       padding='same',
                       name='conv04')(x)

        with tf.name_scope('ConvLayer_05'):
        
            x = Conv2D(32, (3,3),
                       kernel_regularizer=l2,
                       bias_regularizer=l2,
                       padding='same',
                       name='conv05')(x)

        with tf.name_scope('Flatten'):
        
            x = Flatten(name='flatten')(x)

        with tf.name_scope('FullyConnectedLayer_01'):
                
            x = Dense(100,
                      kernel_regularizer=l2,
                      bias_regularizer=l2,
                      name='fc01')(x)

        with tf.name_scope('FullyConnectedLayer_02'):
        
            x = Dense(50,
                      kernel_regularizer=l2,
                      bias_regularizer=l2,
                      name='fc02')(x)

        with tf.name_scope('FullyConnectedLayer_03'):

            x = Dense(25,
                      kernel_regularizer=l2,
                      bias_regularizer=l2,
                      name='fc03')(x)

        with tf.name_scope('FullyConnectedLayer_04'):
        
            x = Dense(10,
                      kernel_regularizer=l2,
                      bias_regularizer=l2,
                      name='fc04')(x)

    with tf.name_scope('Output'):
    
        outputs = Dense(1,
                        name='output')(x)

    # Create Model
        
    model = Model( inputs = inputs, outputs = outputs )

    adam = optimizers.Adam( lr = lr, decay = 0.001 ) # Learning rate and decay set in LearningRateScheduler

    # Memory Saving Gradients

    layer_names = [ 'conv02', 'conv04', 'fc01', 'fc03' ]

    [tf.add_to_collection('checkpoints', model.get_layer(l).get_output_at(0))
     for l in layer_names]
    
    K.__dict__['gradients'] = memory_saving_gradients.gradients_collection

    # Compile Model

    model.compile(loss='mean_squared_error', optimizer=adam, metrics=['mse'])

    return model

class CumulativeHistory( History ):
    '''
    History does not allow resume history, but this does.
    '''
    def on_train_begin( self, logs=None ):
        if not hasattr(self, 'epoch'):
            super(CumulativeHistory, self).on_train_begin( logs )

def main(*args, **kargs):
    """ Behavioral Cloning Project
    """

    parser = argparse.ArgumentParser(description='Behavioral Cloning Project')

    parser.add_argument('-c', '--checkpoint', type=str, help='Checkpoint (`.h5` file)')
    parser.add_argument('-e', '--epoch', type=int, help='Initial epoch')
    
    args = parser.parse_args()

    model_type = 'new'
    train_model = None
    initial_epoch = 0

    if args.checkpoint is not None:

        train_model = load_model( args.checkpoint )

        initial_epoch = args.epoch

        model_type = 'loaded'

    # Set Configuration

    config = tf.ConfigProto( intra_op_parallelism_threads = max_procs,
                             inter_op_parallelism_threads = 0) # set automatically to number of logical cores

    config.gpu_options.allow_growth = True

    # Get Data

    df_train, df_val, df_test, bins = get_data( keep_ptl = 60 )
    
    ntrain, nval, ntest = df_train.shape[0], df_val.shape[0], df_test.shape[0]

    # Training

    train_graph = tf.Graph()

    train_generator = generator_from_df( df_train, batch_size )
    val_generator   = generator_from_df( df_val,   batch_size, shuffle=False )

    nbatches_train = ntrain // batch_size
    nbatches_val   = nval // batch_size
    
    history = CumulativeHistory()
    
    early_stop = EarlyStopping( monitor='val_mean_squared_error',
                                min_delta=1e-4,
                                patience=50,
                                verbose=0,
                                mode='min')
    
    model_ckpt = ModelCheckpoint( fl_fmt_wt_ckpt,
                                  monitor='val_mean_squared_error',
                                  verbose=0,
                                  save_best_only=True,
                                  save_weights_only=True,
                                  period=1)
    
    callbacks = [history, early_stop, model_ckpt]

    for i in range(len(lr)):

        train_sess = tf.Session( config = config, graph = train_graph )
        K.set_session( train_sess )

        if model_type == 'new':
            
            with train_graph.as_default():

                # Print model summary
                summary_fl_pth = os.path.join( fldr_summary, 'model_summary_run_{:04d}_'.format(run[0]) + r'.txt' )

                train_model = get_model( lr[i], is_training = True )

                with open(summary_fl_pth, 'w') as summary_file:
                    train_model.summary( print_fn=lambda x: summary_file.write(x + '\n') )

        with train_graph.as_default():
            
            with train_sess.as_default():

                if K.backend() == 'tensorflow':
                    
                    board = TensorBoard( log_dir = fldr_log,
                                         histogram_freq = 0,
                                         write_graph = True,
                                         write_images = True )
                    callbacks.append( board )

                writer = tf.summary.FileWriter( fldr_log, train_graph )

                ts = time.time()
                ts = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d_%H-%M-%S')

                arch_yaml = train_model.to_yaml()
                arch_fl_pth = os.path.join( fldr_arch, 'arch_' + hparam_str[0] + '_run_{:04d}_'.format(run[0]) + ts + '.yaml' )

                with open(arch_fl_pth, 'w') as arch_file:
                    arch_file.write( arch_yaml )
                
                train_model.save( os.path.join( fldr_mdl,
                                                'model_init_' + hparam_str[0] + '_run_{:04d}_'.format(run[0]) + ts + '.h5') )

                train_model.save_weights( os.path.join( fldr_wt,
                                                        'weights_init_' + hparam_str[0] + '_run_{:04d}_'.format(run[0]) + ts  + '.h5' ) )

                train_model.fit_generator(
                    generator = train_generator,
                    steps_per_epoch = nbatches_train,
                    epochs = num_epochs,
                    max_queue_size = max_q_size,
                    validation_data = val_generator,
                    validation_steps = nbatches_val,
                    workers = 0,
                    callbacks = callbacks,
                    initial_epoch = initial_epoch)

                ts = time.time()
                ts = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d_%H-%M-%S')

                train_model.save( os.path.join( fldr_mdl,
                                                'model_final_' + hparam_str[0] + '_run_{:04d}_'.format(run[0]) + ts + '.h5') )

                train_model.save_weights( os.path.join( fldr_wt,
                                                        'weights_final_' + hparam_str[0] + '_run_{:04d}_'.format(run[0]) + ts  + '.h5' ) )
                
        if K.backend() == 'tensorflow':
            K.clear_session()

        del train_model
        gc.collect()

if __name__ == '__main__':
    """ Entry point to the program
    """

    main()

memory_test.py failed

Hey guys,

Your work is great!

I tried to run the tests but failed. The error message is (I modified the code to print peak_memory):

Traceback (most recent call last):
File "memory_test.py", line 677, in
test_chain()
File "memory_test.py", line 119, in test_chain
assert peak_memory > 2e6, peak_memory
AssertionError: 0

Please help me if you can. :)

Extremely slow when running on distributed tensorflow with horovod

We are working to use gradient-checkpointing on distributed tensorflow(horovod: https://github.com/uber/horovod)

configuration as follows:
model: resnet50
input: synthetic data with 1k class number
horovod: horovod-0.12.1-py3.6-linux-x86_64 with NCCL2
tensorflow: 1.8.0
gradient-checkpointing: memory
experiment:

  1. 1GPU on P40
  2. 8GPU on P40, each GPU occupied by one MPI process.

The result on single GPU looks promising, with memory usage dropping from 7820.39MB to 3580.35MB while training speed drops from 152.51examples/sec to 115.68examples/sec. But the result on multiple(8) GPUs is not good. Memory usage drops from 7914.49MB to 3811.62MB, which is as expected.
But training speed drops from 143.63examples/sec per GPU to 6.71examples/sec.

Anyone could give us a hint on solving this issue?

gradients_memory require more memory than tf.Optimizer.minimize

I would like to use the memory saving gradients to train a U-net model with bigger patches or/and increased batch size. I implemented a toy example to assess the memory usage when switching from tf.Optimizer.minimize to the memory saving gradients: https://github.com/gchlebus/gchlebus.github.io/blob/ca55f92d816ebe4659721b61e1a1f4f3b5c3e4f1/code/profiling-tf-models/u_net.py

What I surprisingly found out, is that the memory gradients require more memory than tf.Optimizer.minimize, but less memory than tf.gradients. I queried the peak memory usage using the mem_util.py.
Memory usage:

  • tf.train.AdamOptimizer().minimize(loss): 75 MB
  • tf.gradients(loss, tf.trainable_variables()) + optimizer.apply_gradients(): 107 MB
  • gradients_memory(loss, tf.trainable_variables()) + optimizer.apply_gradients(): 96 MB

I would have two questions:

  1. How come that the memory saving gradients require more memory than tf.train.AdamOptimizer.minimize? Am I using the memory saving gradients wrongly?
  2. Why the peak memory usage between 1st and 2nd bullet point differ? I thought, that the minimizefunction does tf.gradients + optimizer.apply_gradients().

I would greatly appreciate your feedback.

Use with static (unrolled) RNN?

Hi guys, thanks for your contribution. I wanted to give some feedback and request that you add a static (unrolled) RNN to your test suite. If/when I get a chance to spend more time on this, I'm happy to contribute this myself.

I tried using your code with a 2-layer LSTM RNN using dynamic_rnn and hit the same issue as here: #9

I converted my model to use static_rnn. This removes the while loop by statically unrolling for a fixed sequence length. At this point, your code was unable to automatically find articulation points. So, I tried adding manual checkpoints in a few intuitive places (at output of each layer, or at every unrolled loop iteration, or at every k unrolled loop iterations). In all cases, the memory usage was still higher than the baseline. I investigated the modified backprop graph. It seemed to be doing a lot of redundant computation and not working as described in your writing. I suspect I wasn't checkpointing correctly. A working static RNN test case would be a helpful reference.

Checkpointing of VGG

Hi,
I've ran a bunch of Imagenet networks with and without checkpointing. Everything seems to work pretty well everywhere except for the VGGs. I've tried different block sizes - VGG[11,13,16,19], different batch sizes, the automatic and manual checkpointing through 'collections'. It just doesn't work:
image

I wonder if this is somehow inherently related to the fact that in VGG's most of the memory is spent on first few layers?
One thing I noticed when I tried to debug it is that output of the toposort() loos strange. All the Maxpooling layers are at the end :
image

Marked-up tensors is what the automatic mode chooses to checkpoint.
Yet again, manual checkpointing doesn't help. Any ideas?
Thanks
Dmitri

Checkpointing with collections raising exception in Keras

Hi,

A TF newbie here. Am trying to use a pre-trained Keras Resnet-50 model on very large biological images(3000x4000 pix) of larvae. Since the image size is huge have to resort to gradient checkpointing.

have done the requisite monkey patching

import memory_saving_gradients K.__dict__['gradients']=memory_saving_gradients.gradients_collection

I have manually defined a collection of all the checkpoint nodes like so

[tf.add_to_collection("checkpoints",base_model.get_layer(i).get_output_at(0)) for i in ["add_4","add_8","add_12","add_16"]]

However I get the following exception

File "/home/satish/anaconda3/envs/tensorflow/lib/python3.6/site-packages/keras/optimizers.py", line 244, in get_updates grads = self.get_gradients(loss, params) File "/home/satish/anaconda3/envs/tensorflow/lib/python3.6/site-packages/keras/optimizers.py", line 78, in get_gradients grads = K.gradients(loss, params) File "/home/satish/anaconda3/envs/tensorflow/lib/python3.6/site-packages/memory_saving_gradients.py", line 31, in gradients_collection return gradients(ys, xs, grad_ys, checkpoints='collection', **kwargs) File "/home/satish/anaconda3/envs/tensorflow/lib/python3.6/site-packages/memory_saving_gradients.py", line 185, in gradients raise Exception('no checkpoints nodes found or given as input! ') Exception: no checkpoints nodes found or given as input!

Not sure why the collection is deemed empty.

Would greatly appreciate any feedback.

Thanks
Satish

Problems with custom gradient

We have meet a problem when use checkpoints and custom gradients together. We have created custom gradient for operation tf.matrix_solve_ls for mode (fast=False), but if we include tensor MatrixSolveLs in the list of checkpointed tensors, the gradients function in memory_saving_gradient.py tries to use the default gradient and ends up with an error because the gradient is not defined for mode (fast=False). We are using tf 1.9. @yaroslavvb do you have any hints about how to make @tf.custom_gradient work with checkpointing?

TF while loop error

I'm trying to apply this awesome tool on BERT model. But it seems doesn's work with TF while loop. The model code is basically same as https://github.com/CLUEbenchmark/CLUENER2020/blob/master/tf_version/modeling.py, except that I add every sqrt(num_hidden_layers) hidden to collections by tf.add_to_collection('checkpoints', layer_output) . When run training, I got this error message: "ValueError: Cannot use 'loss/rnn/while/TensorArrayReadV3/Enter' as input to 'loss/rnn/while/TensorArrayReadV3_1' because 'loss/rnn/while/TensorArrayReadV3/Enter' is in a whileloop. See info log for more details." Would you please help me solve this problem?

Checkpointing for FP16

Hi guys,
I have added couple of lines of code in memory_saving_gradients.py and in the benchmarks to enable checkpointing and benchmarking on FP16 networks. It seem to work pretty well. Should I do a pull request for those changes?
Thanks
Dmitri

OOM when using gradients_memory's list of checkpointed tensors

First, thanks very much for this contribution!

  • I run my code using the gradients_memory() method, which indeed works.
  • I then try to get the bottleneck tensors it found, by taking the "Checkpoint nodes used" list. I add them manually to a "checkpoints" collection, and then use gradients_collection(). Then, I get OOM.

I don't understand why that is so - if I take the same tensors, shouldn't the behaviour be the same?

Gradient checkpointing seems to conflict with Keras batch norm

I tried this out but get an Error when computing the gradients with the provided function using manually selected checkpoints. I get three different errors at the same time, and am not sure what of my graph is actually causing them, so I would appreciate some hints so that I could come up with a minimal non-working example. I currently use TF1.13.1 and especially the tf.keras.layers.BatchNormalization (just saying this because it pops up along the Error message). Is there any hope that this would be an easy fix?

Traceback (most recent call last):                                                                                                                             
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 415, in _MaybeCompile                    
    xla_compile = op.get_attr("_XlaCompile")                                                                                                                              
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2413, in get_attr
    raise ValueError(str(e))                                                                                                                          
ValueError: Operation 'optimizer/head/convolve_batch_activate_20/batch_normalization_v1_21/cond/ReadVariableOp_1/Switch' has no attr named '_XlaCompile'.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 455, in _apply_op_helper
    as_ref=input_arg.is_ref)
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1240, in internal_convert_n_to_tensor
    ctx=ctx))
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1175, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 977, in _TensorTensorConversionFunction
    (dtype.name, t.dtype.name, str(t)))
ValueError: Tensor conversion requested dtype float32 for Tensor with dtype resource: 'Tensor("optimizer/gradients/optimizer/head/convolve_batch_activate_20/batch_normalization_v1_21/cond/ReadVariableOp_1/Switch
_grad/Switch_1:1", shape=(), dtype=resource)'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./src/sadt.py", line 544, in <module>
    with SpaceAndDeformableTimeNetwork(cfg, datasets) as exp:
  File "/lhome/davidj2/code/sync/space_and_deformable_time/src/xxsflow/experiments/base_experiment.py", line 42, in __enter__
    self.build_graph()
  File "./src/sadt.py", line 298, in build_graph
    self.optimizer_op = self.optimizer
  File "/lhome/davidj2/code/sync/space_and_deformable_time/src/xxsflow/utils.py", line 388, in wrapped_function
    setattr(self, attribute, function(self))
  File "./src/sadt.py", line 267, in optimizer
    grads = grads = tf.gradients(self.loss, tf.trainable_variables())
  File "/lhome/davidj2/code/sync/space_and_deformable_time/packages/gradient_checkpointing/memory_saving_gradients.py", line 40, in gradients_collection
    return gradients(ys, xs, grad_ys, checkpoints='collection', **kwargs)
  File "/lhome/davidj2/code/sync/space_and_deformable_time/packages/gradient_checkpointing/memory_saving_gradients.py", line 227, in gradients
    dv = tf_gradients(ys=copied_ys, xs=boundary+xs, grad_ys=grad_ys, **kwargs)
  File "/lhome/davidj2/code/sync/space_and_deformable_time/packages/gradient_checkpointing/memory_saving_gradients.py", line 27, in tf_gradients
    return tf_gradient_function(ys, *args, **kwargs)
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 664, in gradients
    unconnected_gradients)
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 965, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 420, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 965, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_grad.py", line 88, in _SwitchGrad
    return merge([false_grad, true_grad])[0], None
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 466, in merge
    return gen_control_flow_ops.merge(inputs, name)
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/ops/gen_control_flow_ops.py", line 418, in merge
    "Merge", inputs=inputs, name=name)
  File "/lhome/davidj2/code/sync/space_and_deformable_time/.venv/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 483, in _apply_op_helper
    raise TypeError("%s that don't all match." % prefix)
TypeError: Tensors in list passed to 'inputs' of 'Merge' Op have types [float32, resource] that don't all match.

RNN while loop context error

Problem Description

When stopping the gradient on a node in a while loop a value error is thrown. Stack trace:

2018-01-18 16:56:14.252388: I tensorflow/core/platform/s3/aws_logging.cc:53] Initializing Curl library
Traceback (most recent call last):
  File "test.py", line 256, in <module>
    data, data_lengths, target, target_lengths, training, hparams)
  File "test.py", line 52, in __init__
    self.optimize
  File "test.py", line 21, in decorator
    setattr(self, attribute, function(self))
  File "test.py", line 189, in optimize
    self.loss, params, checkpoints="speed")
  File "/u/smithmax/Projects/_/memory_saving_gradients.py", line 197, in gradients
    grad_node = tf.stop_gradient(x, name=x.op.name+"_sg")
  File "/u/smithmax/.conda/envs/tf_nightly3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 5220, in stop_gradient
    "StopGradient", input=input, name=name)
  File "/u/smithmax/.conda/envs/tf_nightly3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/u/smithmax/.conda/envs/tf_nightly3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3172, in create_op
    op_def=op_def)
  File "/u/smithmax/.conda/envs/tf_nightly3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1659, in __init__
    self._control_flow_post_processing()
  File "/u/smithmax/.conda/envs/tf_nightly3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1668, in _control_flow_post_processing
    control_flow_util.CheckInputFromValidContext(self, input_tensor.op)
  File "/u/smithmax/.conda/envs/tf_nightly3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_util.py", line 260, in CheckInputFromValidContext
    raise ValueError(error_msg + " See info log for more details.")
ValueError: Cannot use 'decoder/while/BasicDecoderStep/gru_cell/MatMul/Enter' as input to 'decoder/while/BasicDecoderStep/gru_cell/MatMul/Enter_sg' because 'decoder/while/BasicDecoderStep/gru_cell/MatMul/Enter' is in a while loop. See info log for more details.

I'm not familiar with graph editing, so I don't have an idea of a direction on where to start. :(

System Information

  • Python version: 3.6.0 (64-bit)
  • OS Platform: Debian 4.9.65 x86_x64
  • TF version: tf-nightly-gpu==1.6.0.dev20180117

Code to Reproduce

This is just a seq2seq graph class, sorry that it's a little messy.

There's various lines for different potential checkpoints commented out, and only one is currently uncommented. It doesn't appear to work for any of the checkpoint candidates.

import functools
import tensorflow as tf

import memory_saving_gradients as memory_saving_gradients


_CHECK = "checkpoints"


def lazy_loading_property(function):
    """ Lazy loading decorator.

    Source: https://danijar.com/structuring-your-tensorflow-models/
    """
    attribute = "_cache_" + function.__name__

    @property
    @functools.wraps(function)
    def decorator(self):
        if not hasattr(self, attribute):
            setattr(self, attribute, function(self))
        return getattr(self, attribute)

    return decorator


class RNNGraph:

    def __init__(
            self, data, data_lengths, target, target_lengths,
            training, hyperparams):
        self.data = data
        self.data_lengths = data_lengths
        self.target = target
        self.target_lengths = target_lengths
        self.training = training
        self._hyperparams = hyperparams

        # defined in time for `tf.initialize_variables()`.
        self.inference
        self.optimize
        self.loss

    @lazy_loading_property
    def encoder_cells(self):
        """ Encoding cells.

        :return: `RNNCell` object.
        """
        encoder_cells = tf.nn.rnn_cell.GRUCell(
            self._hyperparams["num_units"],
            kernel_initializer=self._hyperparams["initializer"],
            bias_initializer=self._hyperparams["initializer"])
        return encoder_cells

    @lazy_loading_property
    def decoder_cells(self):
        """ Decoding cells.

        :return: `RNNCell` object.
        """
        decoder_cells = tf.nn.rnn_cell.GRUCell(
            self._hyperparams["num_units"],
            kernel_initializer=self._hyperparams["initializer"],
            bias_initializer=self._hyperparams["initializer"])
        return decoder_cells

    @lazy_loading_property
    def inference(self, open_loop=False):
        """ Perform inference on the graph.

        :param open_loop:
        :return:
        """
        # Remove the end-of-sentence tag from targets.
        target = tf.slice(
            self.target, [0, 0, 0], [-1, tf.shape(self.target)[1]-1, -1])

        # Split the turn meta-info from the token ID.
        # [B, S, 10] --> [B, S, 9], [B, S, 1].
        data_meta, data_tokens = tf.split(self.data, [7, 1], axis=2)
        target_meta, target_tokens = tf.split(target, [7, 1], axis=2)

        # Embedding.
        self.embedding = tf.get_variable(
                "embedding",
                [self._hyperparams["vocab_size"],
                 self._hyperparams["embedding_size"]])

        # Look up embeddings.
        # Embeddings are shape: [B, S, 1, 300].
        encoder_inputs_embedded = tf.nn.embedding_lookup(
            self.embedding, tf.cast(data_tokens, tf.int32))
        decoder_inputs_embedded = tf.nn.embedding_lookup(
            self.embedding, tf.cast(target_tokens, tf.int32))

        # Remove '1' dimension from embeddings: [B, 1, S, 300] --> [B, S, 300].
        encoder_inputs_embedded = tf.squeeze(encoder_inputs_embedded, [2])
        decoder_inputs_embedded = tf.squeeze(decoder_inputs_embedded, [2])

        tf.add_to_collection(_CHECK, encoder_inputs_embedded)
        # tf.add_to_collection(_CHECK, decoder_inputs_embedded)

        # Merge the meta-info onto the token's embedding.
        # [B, S, 9] + [B, S, 300] --> [B, S, 309].
        encoder_inputs_embedded = tf.concat(
            [data_meta, encoder_inputs_embedded], axis=2)
        decoder_inputs_embedded = tf.concat(
            [target_meta, decoder_inputs_embedded], axis=2)

        # tf.add_to_collection(_CHECK, encoder_inputs_embedded)
        # tf.add_to_collection(_CHECK, decoder_inputs_embedded)

        # Run Dynamic RNN:
        #   encoder_outputs: [batch_size, max_time, num_units]
        #   encoder_states: [batch_size, num_units]
        with tf.variable_scope("gru_graph", reuse=tf.AUTO_REUSE):
            encoder_initial_state = self.encoder_cells.zero_state(
                tf.shape(self.data)[0], tf.float32)

        self.encoder_outputs, self.encoder_states = tf.nn.dynamic_rnn(
            self.encoder_cells, encoder_inputs_embedded, self.data_lengths,
            initial_state=encoder_initial_state)

        # tf.add_to_collection(_CHECK, self.encoder_outputs)
        # tf.add_to_collection(_CHECK, self.encoder_states)

        with tf.variable_scope("gru_graph", reuse=tf.AUTO_REUSE):
            # Vocabulary projection layer.
            projection_layer = tf.layers.Dense(
                self._hyperparams["vocab_size"], name="projection")

        # Decode.
        if open_loop:
            helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(
                self.embedding,
                tf.fill([self.data.get_shape()[0]], "<SOS>"),
                "<EOS>")
            # Decoder.
            decoder = tf.contrib.seq2seq.BasicDecoder(
                self.decoder_cells,
                helper,
                self.decoder_state_inputs,
                output_layer=projection_layer)
            decoder_outputs, decoder_states, decoder_output_lengths = \
                tf.contrib.seq2seq.dynamic_decode(decoder)
            logits = decoder_outputs.sample_id
        else:
            # Subtract one from target lengths because we've stripped EOS.
            helper = tf.contrib.seq2seq.TrainingHelper(
                decoder_inputs_embedded, self.target_lengths-1)
            # Decoder.
            decoder = tf.contrib.seq2seq.BasicDecoder(
                self.decoder_cells,
                helper,
                self.decoder_state_inputs,
                output_layer=projection_layer)
            decoder_outputs, decoder_states, decoder_output_lengths = \
                tf.contrib.seq2seq.dynamic_decode(decoder)
            logits = decoder_outputs.rnn_output

        # tf.add_to_collection(_CHECK, decoder_outputs)
        # tf.add_to_collection(_CHECK, decoder_states)

        return logits

    @lazy_loading_property
    def optimize(self):
        """ Create an operation to perform an update step on the network.

        :return: Update step operation.
        """
        optimizer = tf.train.AdamOptimizer(self._hyperparams["learning_rate"])
        params = tf.trainable_variables()

        if self._hyperparams["gradient_checkpointing"]:
            gradients = memory_saving_gradients.gradients(
                self.loss, params, checkpoints="speed")
        else:
            gradients = tf.gradients(self.loss, params)

        # Gradient clipping.
        clipped_gradients, _ = tf.clip_by_global_norm(
            gradients, self._hyperparams["gradient_clip"])
        update_operation = optimizer.apply_gradients(
            zip(clipped_gradients, params))

        return update_operation

    @lazy_loading_property
    def loss(self):
        """ Calculate the loss from the decoded sequence.

        :return: Cross entropy loss (scalar).
        """
        # Set all valid timesteps to `1`, and padded timesteps to `0`.
        weights = tf.cast(tf.sequence_mask(self.target_lengths-1), tf.float32)

        # Create decoder output, it is shifted left once.
        _, target_tokens = tf.split(self.target, [7, 1], axis=2)
        target_tokens = tf.cast(tf.squeeze(target_tokens, [2]), tf.int32)
        target_tokens = tf.slice(target_tokens, [0, 1], [-1, -1])

        return tf.contrib.seq2seq.sequence_loss(
            self.inference, target_tokens, weights=weights)

    @lazy_loading_property
    def decoder_state_inputs(self):
        return self.encoder_states

    @lazy_loading_property
    def total_input_size(self):
        total = self._hyperparams["num_units"] + 7
        return total

    @lazy_loading_property
    def decoder_state_inputs(self):
        return self.encoder_states


if __name__ == "__main__":
    with tf.device("/gpu:0"):
        data = tf.placeholder(tf.float32, [None, None, 8], name="input")
        data_lengths = tf.placeholder(tf.int32, [None], name="input_lengths")
        target = tf.placeholder(tf.float32, [None, None, 8], name="target")
        target_lengths = tf.placeholder(
            tf.int32, [None], name="target_lengths")
        training = tf.placeholder(tf.bool, [], name="training")

    hparams = {
        "num_units": 300,
        "embedding_size": 300,
        # Optimization
        "learning_rate": 0.003,
        "gradient_clip": 5,
        "gradient_checkpointing": True,
        # Hardcoded.
        "vocab_size": 27000,
        "batch_size": 2,
        "activation": tf.nn.relu,
        "initializer": tf.contrib.layers.xavier_initializer(),
    }

    model = RNNGraph(
        data, data_lengths, target, target_lengths, training, hparams)

Publish this on Pypi

Hello,

First of all, thank you for making this work public!

I have noticed that this package is not available on Pypi, and this makes distributing this package harder than it needs to be.

Would you be interested in doing so? I have published a few packages on there so if you need a hand doing so I'd be happy to help out. It should be a matter of a few minutes from start to finish.

Thank you and have a nice day,
Luca

Speed mode is much slower than memory mode

Settings:

  • benchmark: BERT Base
  • one 32GB V100 GPU
  • Tensorflow 1.15
  • CUDA 10.0, cuDNN 7.6.5

The measured time is the average of 10 iterations.

method iteration time (ms) memory (GB)
w/o optimization 557.11 16.42
recomputation (speed mode) 1457.91 13.32
recomputation (memory mode) 704.9 7.43

code comes from google-research/bert, with a small modification to adopt gradient checkpointing.

core dump

I update tensorflow to 1.5 ,and with cuda 9 cudnn 7
after install tf-nightly-gpu , I got core dumped

$python -c 'import tensorflow '
Segmentation fault (core dumped)

Program terminated with signal 11, Segmentation fault.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `python -c import tensorflow '.
Program terminated with signal 11, Segmentation fault.
#0 0x00007fa181219976 in _dl_relocate_object () from /lib64/ld-linux-x86-64.so.2
Missing separate debuginfos, use: debuginfo-install python-2.7.5-58.el7.x86_64
(gdb) bt
#0 0x00007fa181219976 in _dl_relocate_object () from /lib64/ld-linux-x86-64.so.2
#1 0x00007fa181221b3c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#2 0x00007fa18121d1b4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#3 0x00007fa1812211ab in dl_open () from /lib64/ld-linux-x86-64.so.2
#4 0x00007fa180a2302b in dlopen_doit () from /lib64/libdl.so.2
#5 0x00007fa18121d1b4 in dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#6 0x00007fa180a2362d in dlerror_run () from /lib64/libdl.so.2
#7 0x00007fa180a230c1 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#8 0x00007fa179a261f1 in py_dl_open () from /usr/lib64/python2.7/lib-dynload/ctypes.so
#9 0x00007fa180f26bb0 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#10 0x00007fa180f28efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#11 0x00007fa180eb2858 in function_call () from /lib64/libpython2.7.so.1.0
#12 0x00007fa180e8d9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#13 0x00007fa180e9c995 in instancemethod_call () from /lib64/libpython2.7.so.1.0
#14 0x00007fa180e8d9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#15 0x00007fa180ee4947 in slot_tp_init () from /lib64/libpython2.7.so.1.0
#16 0x00007fa180ee365f in type_call () from /lib64/libpython2.7.so.1.0
#17 0x00007fa180e8d9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#18 0x00007fa180f220f6 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#19 0x00007fa180f28efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#20 0x00007fa180f29002 in PyEval_EvalCode () from /lib64/libpython2.7.so.1.0
#21 0x00007fa180f38dec in PyImport_ExecCodeModuleEx () from /lib64/libpython2.7.so.1.0
#22 0x00007fa180f39068 in load_source_module () from /lib64/libpython2.7.so.1.0
#23 0x00007fa180f39d01 in import_submodule () from /lib64/libpython2.7.so.1.0
#24 0x00007fa180f39fe6 in load_next () from /lib64/libpython2.7.so.1.0
#25 0x00007fa180f3a92e in PyImport_ImportModuleLevel () from /lib64/libpython2.7.so.1.0
#26 0x00007fa180f1dbdf in builtin___import
() from /lib64/libpython2.7.so.1.0
#27 0x00007fa180e8d9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#28 0x00007fa180f1f7b7 in PyEval_CallObjectWithKeywords () from /lib64/libpython2.7.so.1.0
#29 0x00007fa180f24475 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#30 0x00007fa180f28efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#31 0x00007fa180f29002 in PyEval_EvalCode () from /lib64/libpython2.7.so.1.0
#32 0x00007fa180f38dec in PyImport_ExecCodeModuleEx () from /lib64/libpython2.7.so.1.0
#33 0x00007fa180f39068 in load_source_module () from /lib64/libpython2.7.so.1.0
#34 0x00007fa180f39d01 in import_submodule () from /lib64/libpython2.7.so.1.0
#35 0x00007fa180f3a1ff in ensure_fromlist () from /lib64/libpython2.7.so.1.0
#36 0x00007fa180f3aa3a in PyImport_ImportModuleLevel () from /lib64/libpython2.7.so.1.0
#37 0x00007fa180f1dbdf in builtin___import
() from /lib64/libpython2.7.so.1.0
#38 0x00007fa180e8d9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#39 0x00007fa180f1f7b7 in PyEval_CallObjectWithKeywords () from /lib64/libpython2.7.so.1.0
#40 0x00007fa180f24475 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#41 0x00007fa180f2657d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#42 0x00007fa180f2657d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#43 0x00007fa180f28efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#44 0x00007fa180eb2858 in function_call () from /lib64/libpython2.7.so.1.0
#45 0x00007fa180e8d9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#46 0x00007fa180f1f7b7 in PyEval_CallObjectWithKeywords () from /lib64/libpython2.7.so.1.0
#47 0x00007fa180f43c9c in PyErr_PrintEx () from /lib64/libpython2.7.so.1.0
#48 0x00007fa180f44c9c in PyRun_SimpleStringFlags () from /lib64/libpython2.7.so.1.0
#49 0x00007fa180f55520 in Py_Main () from /lib64/libpython2.7.so.1.0
#50 0x00007fa18017cb15 in __libc_start_main () from /lib64/libc.so.6
#51 0x000000000040071e in _start ()

A few more citations

This is a great package! Thanks for making it available.

FYI, your README should cite a few more works:

Zweig, Geoffrey and Padmanabhan, Mukund. Exact Alpha-Beta Computation in
Logarithmic Space with Application to MAP Word Graph Construction. Sixth
International Conference on Spoken Language Processing, 2000.

Lewis, Bil. Debugging Backwards in Time. arXiv preprint cs/0310016, 2003.

How does gradient checkpointing relate to reversible layers?

While studying ways to optimize GPU memory consumption, I found two approaches:

  1. Gradient checkpointing
  2. Reversible layer (from reformer paper)

Can you explain if there is a connection between them and which of the methods is more relevant now?

TF 1.6rc1 Support

Is it included in 1.6rc1, or do we still need to install nightly?

'NoneType' object has no attribute 'op'

I am trying to run the code for my model which uses 3d convolution and fully connected layers.

grads = gradient_memory(train_loss, self.model_variables)
grads = list(zip(grads, self.model_variables))

This should give me the list as

optimizer.compute_grads(train_loss, var_list=self.model_variables)

But instead, I get:

File "gradient_checkpointing.py", line 274, in
inputs_to_do_before = [d_checkpoints[r].op for r in ts]
'NoneType' object has no attribute 'op'

Can you help me with this, please?

I have set the checkpoints equal to ts_all.

Package Maintenance for TF-2.0 with Contrib Module Sunset

Iam wondering if i start to use the package, what seems to be an excel option to do models from scratch (including contrib to the package if i need to modify it to fit my model archs), do you plan to do maintenance for TF-2.0 ?

graph_editor gonna be erased from the framework since until now there is no proposal for maintenance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.