nvidia / framework-reproducibility Goto Github PK

View Code? Open in Web Editor NEW

423.0 32.0 40.0 1.22 MB

Providing reproducibility in deep learning frameworks

License: Apache License 2.0

Shell 10.62% Python 87.38% Jupyter Notebook 2.00%

gpu-determinism deterministic-ops gpu-support deep-learning ngc pytorch tensorflow determinism atomics frameworks

framework-reproducibility's Introduction

Framework Reproducibility (fwr13y)

Repository Name Change

The name of this GitHub repository was changed to framework-reproducibility on 2023-02-14. Prior to this, it was named framework-determinism. Before that, it was named tensorflow-determinism.

"In addition to redirecting all web traffic, all git clone, git fetch, or git push operations targetting the previous location[s] will continue to function as if made to the new location. However, to reduce confusion, we strongly recommend updating any existing local clones to point to the new repository URL." -- GitHub documentation

Repository Intention

This repository is intended to:

provide documentation, status, patches, and tools related to determinism (bit-accurate, run-to-run reproducibility) in deep learning frameworks, with a focus on determinism when running on GPUs, and
provide a tool, and related guidelines, for reducing variance (Seeder) in deep learning frameworks.

framework-reproducibility's People

Contributors

Stargazers

Watchers

framework-reproducibility's Issues

Usage of numpy.random.Generator to deal with Data-Loader Parallelism

Hi @duncanriach
Thanks for an awesome repo, which helped to fix a number of issues in my code.
For now I can can achieve reproducible results running on GPU with TF 2.4.1 and single worker. Which is good, but painfully slow.

You mentioned, that that there is an option to define pseudorandom number generator state for each parallel thread running. Do you have any code example of how to do that using TF 2.0+?
It seems that spawning of workers is handled internally and only way to interact with the process is define num_parallel_calls value in dataset.

Best regards,
Jamil

More segment ops need to be patched

Following #25, I tested more segment ops as in this colab notebook.

In conclusion, for unsorted segment ops, tf.math.unsorted_segment_(mean,prod,sqrt_n,sum) need to be patched, and for sorted segment ops, only tf.math.segment_sum needs to be patched.

Determinism across GPU architectures?

I manage to get deterministic results within the same GPU architecture, but not across architectures. Is this expected or is there something I can do to get the same results on all cards?

To be explicit, I get this:

1080ti (Pascal): result 1
Titan X (Pascal): result 1
Titan X (Maxwell): result 1
2080Ti (Turing): result 2
Titan RTX (Turing): result 2

This is encouraging, as I get reproducible results across different computers, but it seems like the results are still GPU architecture-specific.

I'm using Ubuntu 18.04, Nvidia driver 440.100, CUDA 10.0, CuDNN 7.6.1 and TensorFlow 1.15.

Don't get deterministic results with NGC 20.09

Prerequisites

I am using latest TensorFlow NGC container nvcr.io/nvidia/tensorflow:20.09-tf2-py3

Issue

When running my Deep Learning model on 1 GPU (Tesla T4) within the NGC container 20.09, I don't get reproductible results between consecutive runs (different predictions) although all the seeds are set correctly. Running the same code on CPU gives reproductible results over consecutive runs.

On a small dataset (138k samples), I managed to get reproductible results within the NGC by setting the default float type to float64 (instead of float32).
But with a bigger dataset (2.6m samples), after 1million of samples, the predictions start to differ over different runs.
The more I increase the dataset size, the bigger is the difference between predictions of different consecutive runs.

My datasets are tfrecord files of hashed values, serialized by batch of 10,000 samples.
Here are the small_dataset (size 13M):
small_dataset
and the big_dataset (size 251M) - can you send me your email to share it with you via Google Drive to be able to reproduce the issue?
You need to unzip them before running the script bellow.

Command lines:

sudo docker run --gpus all --shm-size=1g --ulimit memlock=-1 -it --rm -v /home:/workspace/home nvcr.io/nvidia/tensorflow:20.09-tf2-py3

CUDA_VISIBLE_DEVICES="0" TF_DETERMINISTIC_OPS="1" TF_CPP_MIN_LOG_LEVEL="3" python code.py data_file prediction_file

Script

import tensorflow as tf
import sys
import string
import os
from tensorflow.python.keras.initializers import RandomNormal, GlorotUniform, Zeros, glorot_normal
from tensorflow.python.keras.regularizers import l2
from tensorflow.compat.v1.keras.layers import Layer, Embedding, Input, Dense, Flatten, Add
from tensorflow.python.keras import backend as K
import numpy as np
import logging
tf.get_logger().setLevel(logging.ERROR)
K.set_floatx('float64')

prediction_output_name = "pred"

SEED = 1024
os.environ['PYTHONHASHSEED'] = str(SEED)
np.random.seed(SEED)
tf.compat.v1.set_random_seed(SEED)


class Linear(Layer):

    def call(self, x):
        return tf.reduce_sum(x, axis=[1])

    def compute_output_shape(self):
        return None, 1


class FM(Layer):

    def __init__(self, **kwargs):

        super(FM, self).__init__(**kwargs)

    def build(self, input_shape):
        if len(input_shape) != 3:
            raise ValueError("Unexpected inputs dimensions % d,\
                              expect to be 3 dimensions" % (len(input_shape)))
        super(FM, self).build(input_shape)

    def call(self, inputs):
        if K.ndim(inputs) != 3:
            raise ValueError(
                "Unexpected inputs dimensions %d, expect to be 3 dimensions"
                % (K.ndim(inputs)))

        concated_embeds_value = inputs
        square_of_sum = tf.square(tf.compat.v1.reduce_sum(
            concated_embeds_value, axis=1, keep_dims=True))
        sum_of_square = tf.compat.v1.reduce_sum(
            concated_embeds_value * concated_embeds_value, axis=1, keep_dims=True)
        cross_term = square_of_sum - sum_of_square
        cross_term = 0.5 * tf.compat.v1.reduce_sum(cross_term, axis=2, keep_dims=False)

        return cross_term

    def compute_output_shape(self):
        return None, 1


class DNN(Layer):

    def __init__(self, hidden_units, activation='relu', l2_reg=0, dropout_rate=0, use_bn=False, seed=1024, **kwargs):
        self.hidden_units = hidden_units
        self.activation = activation
        self.dropout_rate = dropout_rate
        self.seed = seed
        self.l2_reg = l2_reg
        self.use_bn = use_bn
        super(DNN, self).__init__(**kwargs)

    def build(self, input_shape):
        input_size = input_shape[-1]
        hidden_units = [int(input_size)] + list(self.hidden_units)
        self.kernels = [self.add_weight(name='kernel' + str(i),
                                        shape=(
                                            hidden_units[i], hidden_units[i + 1]),
                                        initializer=glorot_normal(
                                            seed=self.seed),
                                        # initializer=Constant(0.4),
                                        regularizer=l2(self.l2_reg),
                                        trainable=True) for i in range(len(self.hidden_units))]
        self.bias = [self.add_weight(name='bias' + str(i),
                                     shape=(self.hidden_units[i],),
                                     initializer=Zeros(),
                                     trainable=True) for i in range(len(self.hidden_units))]
        if self.use_bn:
            self.bn_layers = [tf.keras.layers.BatchNormalization() for _ in range(len(self.hidden_units))]

        self.dropout_layers = [tf.keras.layers.Dropout(self.dropout_rate, seed=self.seed + i) for i in
                               range(len(self.hidden_units))]

        self.activation_layers = [tf.keras.layers.Activation(self.activation) for _ in range(len(self.hidden_units))]
        super(DNN, self).build(input_shape)

    def call(self, inputs, training=None):

        deep_input = inputs

        for i in range(len(self.hidden_units)):
            fc = tf.math.add(tf.tensordot(
                deep_input, self.kernels[i], axes=(-1, 0)), self.bias[i])

            if self.use_bn:
                fc = self.bn_layers[i](fc, training=training)

            fc = self.activation_layers[i](fc)

            fc = self.dropout_layers[i](fc, training=training)
            deep_input = fc

        return deep_input

    def compute_output_shape(self, input_shape):
        if len(self.hidden_units) > 0:
            shape = input_shape[:-1] + (self.hidden_units[-1],)
        else:
            shape = input_shape

        return tuple(shape)

    def get_config(self, ):
        config = {'activation': self.activation, 'hidden_units': self.hidden_units,
                  'l2_reg': self.l2_reg, 'use_bn': self.use_bn, 'dropout_rate': self.dropout_rate, 'seed': self.seed}
        base_config = super(DNN, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))


class PredictionLayer(Layer):

    def __init__(self, task='binary', use_bias=True, **kwargs):
        if task not in ["binary", "multiclass", "regression"]:
            raise ValueError("task must be binary,multiclass or regression")
        self.task = task
        self.use_bias = use_bias
        super(PredictionLayer, self).__init__(**kwargs)

    def build(self, input_shape):

        if self.use_bias:
            self.global_bias = self.add_weight(
                shape=(1,), initializer=Zeros(), name="global_bias")

        super(PredictionLayer, self).build(input_shape)

    def call(self, inputs):
        x = inputs
        if self.use_bias:
            x = tf.nn.bias_add(x, self.global_bias, data_format='NHWC')
        if self.task == "binary":
            x = tf.sigmoid(x)

        output = tf.cast(tf.reshape(x, (-1, 1)), tf.float32)

        return output

    def compute_output_shape(self):
        return None, 1

    def get_config(self, ):
        config = {'task': self.task, 'use_bias': self.use_bias}
        base_config = super(PredictionLayer, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))


def input_pipeline(dataset_file, serialized_features, epochs):
    dataset = create_dataset(dataset_file, serialized_features, epochs)
    dataset.prefetch(1)
    iterator = tf.compat.v1.data.make_one_shot_iterator(dataset)
    training_init_op = iterator.make_initializer(dataset)
    return iterator, training_init_op


def create_dataset(dataset_file, serialized_features, epochs):
    decode_func = make_decoder(serialized_features)
    return tf.data.TFRecordDataset(dataset_file).map(decode_func).batch(1).repeat(epochs)


def make_decoder(serialized_features):
    features_dict = {f: tf.io.VarLenFeature(dtype=tf.int64) for f in serialized_features}
    features_dict['label'] = tf.io.VarLenFeature(dtype=tf.int64)

    def decode(serialized_example):
        features = tf.io.parse_example(
            [serialized_example],
            features_dict
        )
        return features
    return decode


def select_features(features, hash_size, model_features):
    selected_features = []
    for feature_name in model_features:
        feature_tensor = features.get(feature_name[0])
        dense_feature_tensor = tf.transpose(tf.sparse.to_dense(tf.cast(feature_tensor, dtype=tf.int32)))
        dense_feature_tensor = tf.math.mod(dense_feature_tensor, hash_size)
        selected_features.append(dense_feature_tensor)
    return selected_features


def build_optimizer(learning_rate, optimizer):
    if optimizer == 'RMSProp':
        opt = tf.compat.v1.train.RMSPropOptimizer(learning_rate=learning_rate)
    elif optimizer == 'adagrad':
        opt = tf.compat.v1.train.AdagradOptimizer(learning_rate=learning_rate)
    elif optimizer == 'adam':
        opt = tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate)
    else:
        opt = tf.compat.v1.train.GradientDescentOptimizer(learning_rate=learning_rate)
    return opt


def predictions_saver(prediction_path, predictions):
    with open(prediction_path, "a") as output_file:
        output_file.write('\n'.join([str(x) for x in predictions]) + '\n')


def summarize_weights(session):
  if hasattr(session, 'raw_session'): session = session.raw_session()
  weights = session.run(tf.compat.v1.trainable_variables())
  summary = sum(map(lambda x: x.sum(), weights))
  print("Summary of weights: %.20f" % summary)


if __name__ == '__main__':
    dataset_file = sys.argv[1]
    prediction_path = sys.argv[2]
    header = list(string.ascii_uppercase) + list(string.ascii_lowercase) + [str(i) for i in range (10)]
    epochs = 1
    hash_size = 2 ** 25
    model_features = "A-B-C-D".split('-')
    init_std = 0.01
    embedding_size = 4
    l2_reg_lr, l2_reg_nn, l2_reg_emb = 0, 0, 0
    hidden_units_num = 200
    hidden_layers_num = 4
    dnn_activation_function = "relu"
    dnn_dropout = 0
    learning_rate = 0.0004
    optimizer = "adam"

    graph = tf.Graph()
    with graph.as_default():
        # Input pipeline
        iterator, training_init_op = input_pipeline(dataset_file, header, epochs)
        features = iterator.get_next()
        selected_features = select_features(features, hash_size, model_features)
        stacked_features = tf.stack(selected_features, axis=1)
        inputs = tf.reshape(stacked_features, [-1, len(model_features)])
        label = tf.reshape(tf.sparse.to_dense(features['label']), [-1])

        # Linear part
        emb_linear = tf.compat.v1.keras.layers.Embedding(
            hash_size,
            1,
            embeddings_initializer=RandomNormal(mean=0.0, stddev=init_std, seed=SEED),
            embeddings_regularizer=l2(l2_reg_emb),
            embeddings_constraint=None,
            mask_zero=False,
            input_length=len(model_features)
        )(inputs)

        linear_logit = Linear(l2_reg_lr)(emb_linear)

        # Embeddings for FM / DNN
        emb_fm = tf.compat.v1.keras.layers.Embedding(
            hash_size,
            embedding_size,
            embeddings_initializer=RandomNormal(mean=0.0, stddev=init_std, seed=SEED),
            embeddings_regularizer=l2(l2_reg_emb),
            embeddings_constraint=None,
            mask_zero=False,
            input_length=len(model_features)
        )(inputs)

        # FM part
        fm_logit = FM()(emb_fm)

        # DNN part
        dnn_input = Flatten()(emb_fm)
        dnn_output = DNN([hidden_units_num] * hidden_layers_num, activation=dnn_activation_function, l2_reg=l2_reg_nn,
                    dropout_rate=dnn_dropout, seed=SEED)(dnn_input)
        # dnn_logit = Dense(1, use_bias=False, activation=None, kernel_initializer="ones")(dnn_output)
        dnn_logit = Dense(1, use_bias=False, activation=None, kernel_initializer=GlorotUniform(seed=SEED))(dnn_output)

        # Add together
        final_logit = Add()([linear_logit, fm_logit, dnn_logit])

        # Prediction layer
        pred = tf.reshape(PredictionLayer('binary')(final_logit), [-1], name=prediction_output_name)
        loss = tf.identity(tf.compat.v1.losses.log_loss(label, pred), name='LOSS')
        opt = build_optimizer(learning_rate, optimizer)
        optimizer = opt.minimize(loss)

        init_all_vars = tf.compat.v1.global_variables_initializer()
        saver = tf.compat.v1.train.Saver()

        config = tf.compat.v1.ConfigProto()
        config.gpu_options.allow_growth = True
        session = tf.compat.v1.Session(graph=graph, config=config)

        session.run(init_all_vars)
        session.run(training_init_op)
        summarize_weights(session)

        while True:
            try:
                inp, predictions, _ = session.run([inputs, pred, optimizer])
                summarize_weights(session)

                with open(prediction_path, "a") as output_file:
                    output_file.write('\n'.join([str(x) for x in predictions]) + '\n')
            except tf.errors.OutOfRangeError:
                break

        session.close()

Could you help me to understand why I don't get the exact same predictions after consecutive runs ?

Support for TF 2.1.0

First of all, thank you so much for this package! I'm using databricks and they've upgraded some of their runtimes to TF 2.1.0 and was wondering if I might request support for that. It isn't a blocker for me (I just downgrade to 2.0.0), but would be nice to have moving forward. Thanks for your work on this!

Model nondeterminism when using `K.layers.DepthwiseConv2D` / `tf.nn.depthwise_conv2d` (known issue)

Dear @duncanriach,

I'm using TF2.3 on Ubuntu 16.04.
To get deterministic results, I followed your instructions. Please check the attached code that is very simple MNIST example.
After running the code twice, I compared the results. Unfortunately, I got some non-deterministic results such as loss, embs, and so on.

Please check my code and give me some advice.

`torch_status.md` determinism tracker similiar to `tensorflow_status.md`

A list of all the torch operations and their respective determinism status would be very useful.

Message passing neural network determinism thwarted by tf.math.segment_sum and tf.gather

Hi @duncanriach,

First of all, thank you for the in depth overview of randomness on GPU and your work in improving them.

I am unfortunately unable to provide code for my issue but I was hoping to see what ideas you could possibly have.

Setup: Running tensorflow 2.2.0 with TF estimators on g4dn AWS instances. Not using keras models (although we do use tf.keras.regularizers.l2), but mostly tf.compat.v1.layers.dense and tf.matmul, tf.math.segment_sum. I am using a MPNN model if that helps haha. I set os.environ['TF_DETERMINISTIC_OPS'] = '1' as you suggest. I also have np.random.seed and tf.estimator.RunConfig(tf_random_seed) set.

What I've tried: Running it 3 times each on CPU for 5 epochs with 0.001 learning rate, on GPU for 5 epochs with 0.001, 0.0001, 0.00001 learning rate.

What I've found: All 3 runs on CPU are the same. All 3 runs on GPU are different, except 0.001 learning rate diverges much faster than 0.0001 which diverges much faster than 0.00001 learning rate. You can see the results here:

I don't expect GPU to be fully deterministic, but in order to compare changes to my models/featurizations, the amount of variation for 0.001 learning on GPU is far too much. The variance for 0.00001 learning rate is very acceptable, and I was wondering if you have any ideas on what is causing this and how to mitigate this. Since CPU is fully deterministic, I would expect this to be purely a GPU issue. Does that seem correct?

Model not deterministic/reproducible with all seeds set and os.environ['TF_DETERMINISTIC_OPS'] = '1'

in reference of this kaggle notebook i've written this notebook to try transfer learning through multiple datasets.

through multiple soft runs on the notebook, I came to found out that I can't get deterministic/reproducible model out of it.

Weird thing is, with the same virtual-environment, I can get deterministic/reproducible from a plain CNN-MNIST code.

Spent a week on researching and can't find the solution. Any guidance or suggestions are much appreciated.

colab code for reproducing the errors (dataset download and extractions included within)
my virtual environment

p.s. link duplication from issue44414 as I think this issue should be mentioned here

Tensorflow-determinism not working with conda tensorflow-gpu

I have tested tensorflow-determinism on my windows computer with tensorflow-gpu 1.14. The tensorflow-determinism is installed in Anaconda prompt using pip.

However, the output of the program is different on each run.

I have tested everything on another computer with the same/identical code but the only difference is that tensorflow-determinism is installed via pip on windows cmd.

For some reasons unknown to me, I cannot install tensorflow via pip on windows cmd on the former machine and I think this is the reason why it is not working.

If possible, I hope you can address this issue. Thanks

Why tensorflow-determinism and not simply determinism?

I'm wondering why is this only specific to tensorflow if it's related to nvidia atomic operations shouldn't it work for any library that leverages nvidia cuda as underlying mechanism?

Nondeterminism from tf.image.crop_and_resize

Hi, first of all, congrats on your repo and your speech as well, I'm running into a situation that I cannot reproduce the result every time I training the model on the custom dataset.

System information

tensorflow-gpu==1.14
keras==2.2.4
CUDA==v10.0
GPU: GeForce GTX 1050

I'm using repo Mask-RCNN of matterport : https://github.com/matterport/Mask_RCNN

As far as I aware that I have lock everything with certain seed at the begin of my modify code :

import os
import random
import tensorflow as tf
from tfdeterminism import patch
patch()
random.seed(42)
np.random.seed(42)
tf.set_random_seed(42)

and also a seed in my DataGenerator as well, so every run with the same number of epoch, every iteration of every epoch will use the same image

But I have notice a thing in mrcnn/model.py

https://github.com/matterport/Mask_RCNN/blob/master/mrcnn/model.py#L421

the function
tf.image.crop_and_resize()

I think it involve in non-determinism issue, so where did I miss ? Please help

Will this package work on tf.keras model?

Regarding "stock tensorflow 2.1", I installed tensorflow by using pip install tensorflow-gpu==2.1.0, shall I need strictly follow the pip installation command in your README file?

Also, if using tf.keras, shall I expect to get deterministic experiment results?

Early versions of TensorFlow (e.g. 1.9) do not have a version attribute

I am getting AttributeError for tf 1.9, saying that tensorflow doesn't have "version" as attribute.
Can you help?

Non-reproducible model training results with TensorFlow 2.5

Having issues attaining reproducible results after running model training multiple times with the same dataset, model, training loop, and seed.

Environment

python: 3.8.5
cuda: 11.2
tensorflow: 2.5.0

Code

Function to set random seed.

def set_seeds(seed=0):
    os.environ["PYTHONHASHSEED"] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)

    os.environ["TF_DETERMINISTIC_OPS"] = "1"
    os.environ["TF_CUDNN_DETERMINISTIC"] = "1"

    # called once at beginning of program
    tf.config.threading.set_inter_op_parallelism_threads(1)
    tf.config.threading.set_intra_op_parallelism_threads(1)

def sum_of_weights(model):
    return sum(map(lambda x: x.sum(), model.get_weights()))

Model training loop, here the same exact model is repetitively trained over the same dataset and seed.

data = {}
models = {}
pre_weights = {}
post_weights = {}

# Loop over model hyperparameters
for i, hyperparams in enumerate(all_hyperparams):

    # Load dataset generators
    client = Dataset(hyperparams).get_client(0)
    gen_train, gen_valid = client.create_generators()

    # Create model
    model = Model(client).create()
    pre_weights[i] = reproducibility.sum_of_weights(model) # store sum of initial weights
    
    # Create a trainer
    trainer = Trainer(hyperparams)
    
    # Train model (under the hood same as model.fi
    history = trainer.fit(
            model,
            gen_train,
            gen_valid,
            iters=500,
            steps_per_epoch=100,
            validation_freq=5,
            callbacks=None,
    )
    post_weights[i] = reproducibility.sum_of_weights(history.model) # store sum of trained weights
    
    models[i] = model # store model
    
    x, y = next(gen_train)
    data[i] = np.sum(x['dat']) # store sum of a batch of data

The set_seeds function is called every time the classes (Dataset, Model, Trainer ) are instantiated __init__. The loss function being used is tf.keras.losses.SparseCategoricalCrossentropy.

Output

# sum of initial model weights
{0: 1377.537437170744, 1: 1377.537437170744, 2: 1377.537437170744, 3: 1377.537437170744, 4: 1377.537437170744}

# sum of trained model weights
{0: 1167.8577085053264, 1: 1167.3095767943782, 2: 1165.1663435424903, 3: 1165.0273841904536, 4: 1167.2645659474476}

# sum of a batch of data
{0: -3.637978807091713e-12, 1: -3.637978807091713e-12, 2: -3.637978807091713e-12, 3: -3.637978807091713e-12, 4: -3.637978807091713e-12}

The initial weights seem to be seeded correctly but the weights after training are not, I wonder what the issue could be...

Am I missing something? This was all trained on the same gpu and kernel session.

I recall that the reduce_ operations were non-deterministic at one point, but I assume they have been fixed. Are there any recommended resources that shed light on the potentially non-deterministic operations and suggestions to ensure reproducibility?

Thanks in advance 😊, it is of tremendous value that we are able to reproduce results whether it be for for general model comparison or even life impacting applications.

Raise more specific exceptions

Some of the exceptions thrown should be more specific to enable specific exception handling. For example, if no patch is available, a NotImplementedError could be thrown here:

https://github.com/NVIDIA/tensorflow-determinism/blob/25d4b51006c765fbcec845baa50c59bbe8d14c01/tfdeterminism/patch.py#L75-L76

This allows users to prevent linter messages such as

Catching too general exception Exception

How can I confirm that the deterministic environment variable is working?

I've use the newest NGC container, and specify os.environ['TF_DETERMINISTIC_OPS'] = '1' at the begin of my main entry. But I don't know wether the environ is work or not. I have set tf log level to debug and didn't find any tf determinism relate log.

Reproducibility issue with transformers (BERT) and tf2.2

Dear @duncanriach,
Thank you for your contributions, work and guidance towards making tensorflow deterministic in the recent releases.
Unfortunately, for popular keras NLP models (BERT) some problems seem to remain (see also related issue in this repository #14).

In spite of combining learnings from:

the "complete recipe" in your slides from gputechconf
your recently suggested workaround for issues with crossentropy loss

... I am still arriving at the following short, non-deterministic colab notebook example.

My results for the sum of model weights (as computed with a function you had suggested) after training for only 5 steps is (differences are highlighted below):

	Device	Before training	After training
Run 1	GPU	-641227.5609667897224	-641237.442 `5159916282`
Run 2	GPU	-641227.5609667897224	-641237.442 `3093758523`

Run 1	CPU	-641227.5609667301178	-641238.1506845243275
Run 2	CPU	-641227.5609667301178	-641238.1506845243275

This variance gets increasingly more pronounced when the model is trained for longer periods of time.

Could you please help identify the source of non-determinism and provide guidance on how we can resolve it?

As transformers is a very popular package (29.1k Github stars), I am expecting that many other people are silently impacted by this very phenomenon.

Note: As shown above, I have observed that the same code becomes fully deterministic when running on the colab CPU runtime.

Lack of reproducibility when using Huggingface transformers library (TensorFlow version)

Dear developers,

I included in my code all the steps listed in this repository but still could not achieve reproducibility either using TF 2.1 or TF 2.0. Here's the link to my code:

https://github.com/dmitriydligach/Thyme/blob/master/Keras/et.py

Please help.

Running on stock TensorFlow version >= 2.1

I am trying to run the framework on a nightly version of TensorFlow (2.2.0-dev20200428)

"Exception: tfdeterminism: No patch available for version 2.2.0-dev20200428 of TensorFlow"

Is there a workaround to run on this TensorFlow version?

Make tf_determinism.patch() handle tf>=2.1 as well

Hi @duncanriach,
I was wondering if it would not make sense for tf_determinism.patch() to handle tensorflow>=2.2 as well.

My viewpoint is this: maybe the interface that would be easiest to grasp for this package could be that it renders the current version of tensorflow deterministic (as deterministic as possible) on GPU's if there is any hope to do so (tensorflow >=1.14).

Following that logic, for recent tf versions (>=2.1) the following implementation of patch would make sense:

def patch():
    ...
    os.environ["TF_DETERMINISTIC_OPS"] = "1"

Otherwise, it forces users that want to support multiple versions of tf2 to write code of the sort:

import tensorflow
from tensorflow_determinism import patch
try:
    patch()
except TypeError:  # tf>= 2.2
    os.environ["TF_DETERMINISTIC_OPS"] = "1"

import tensorflow
from tensorflow-determinism import patch
# only gives an error for tf <= 1.13 for which it is not possible to be deterministic on GPU's
patch()

which covers tensorflow >=1.14 fine. I understand the need to complain in certain scenarios (tensorflow < 1.14) but I feel that supporting more recent versions would be worthwhile and easier on users.

What do you think?

Are there disadvantages to enabling deterministic ops?

Is there any downside to enabling this? Are we losing any speed optimizations for example?

garder14/byol-tensorflow2 (batch-norm & softmax/cross-entropy)

Running TF 2.4.1 with seeds and envs set I'm getting different results each run for this guy:

https://github.com/garder14/byol-tensorflow2

I currently suspect it's the gradient tape. Not sure how to handle that. Would downgrading TF version help?

Thoughts welcome.

non-determinism on tf.keras.layers.UpSampling2D(..., interpolation='nearest')

I believe there is a source of non-determinism on tf.keras.layers.UpSampling2D with interpolation='nearest', as you suggest on Other Possible GPU-Specific Sources of Non-Determinism.

In my case, I was building an auto-encoder, and on the decoder part, there were some procedures that I was not satisfied with and I was just waiting for the right time to change them. Turns out that this part was responsible for the non-deterministic behavior and as in #12, I replaced:

tf.keras.layers.UpSampling2D(..., interpolation='nearest')
ReflectionPadding2D()  # that is nothing more than a tf.pad(..., mode='REFLECT')
tf.keras.layers.Conv2D(..., padding='valid')

by:

tf.keras.layers.Conv2DTranspose(...,padding='same')

and now I have reproducible results on GPU and on CPU.

Thanks for the effort in building this package, it is very useful.

Best Regards.

How can I check whether I'm in a NGC TensorFlow container?

As it is today, tensorflow-determinism is activated in two different ways, whether we're in a NGC Tensorflow Docker container > 19.06, or we're running stock Tensorflow. This begs the question: how do I check in which of the two enviroments my code is running? I need to write code which is deterministic, no matter whether the user is using Docker or not. Thus, I need to be able to check at runtime if I'm using the Tensorflow version which comes with NGC, or the stock one. How do I do that?

patch tf.image.resize with bilinear by casting image to tf.float64

@duncanriach

For tensorflow<2.4, we can patch tf.image.resize by casting image to tf.float64, like patching segment ops. See some tests here.

It might be better to patch the ops located in tensorflow.python.ops.gen_image_ops, which needs further tests.

In addition, it seems that tf.image.resize with NEAREST_NEIGHBOR on GPU does not introduce non-determinism during backprop.

Exception thrown with "No algorithm worked!" message on NGC 20.09

Prerequisites

I am using latest TensorFlow NGC container nvcr.io/nvidia/tensorflow:20.09-tf2-py3 with nvidia-container-toolkit installed on host

Used script

import os
import tensorflow as tf
from tensorflow.keras.preprocessing import image_dataset_from_directory
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard


PATH = "/home/kw/DATASETS/liveness"
train_dir = os.path.join(PATH, "train")
validation_dir = os.path.join(PATH, "validation")
model_path = os.path.join(f"{PATH}/models", 'cp_weights_test-{epoch:04d}.h5')


BATCH_SIZE = 32
IMG_SIZE = (300, 300)

train_dataset = image_dataset_from_directory(train_dir,
                                             shuffle=True,
                                             batch_size=BATCH_SIZE,
                                             image_size=IMG_SIZE)
validation_dataset = image_dataset_from_directory(validation_dir,
                                                  shuffle=True,
                                                  batch_size=BATCH_SIZE,
                                                  image_size=IMG_SIZE)
class_names = train_dataset.class_names
val_batches = tf.data.experimental.cardinality(validation_dataset)
test_dataset = validation_dataset.take(val_batches // 5)
validation_dataset = validation_dataset.skip(val_batches // 5)

AUTOTUNE = tf.data.experimental.AUTOTUNE

train_dataset = train_dataset.prefetch(buffer_size=AUTOTUNE)
validation_dataset = validation_dataset.prefetch(buffer_size=AUTOTUNE)
test_dataset = test_dataset.prefetch(buffer_size=AUTOTUNE)

data_augmentation = tf.keras.Sequential([
    tf.keras.layers.experimental.preprocessing.RandomFlip('horizontal'),
    tf.keras.layers.experimental.preprocessing.RandomRotation(0.2),
])

preprocess_input = tf.keras.applications.resnet50.preprocess_input
IMG_SHAPE = IMG_SIZE + (3,)

base_model = tf.keras.applications.ResNet50(input_shape=IMG_SHAPE,
                                            include_top=False,
                                            weights='imagenet')

base_model.trainable = False
global_average_layer = tf.keras.layers.GlobalAveragePooling2D()

prediction_layer = tf.keras.layers.Dense(1, activation="sigmoid")

inputs = tf.keras.Input(shape=IMG_SHAPE)
x = data_augmentation(inputs)
x = preprocess_input(x)
x = base_model(x, training=False)
x = global_average_layer(x)
x = tf.keras.layers.Dropout(0.2)(x)
outputs = prediction_layer(x)
model = tf.keras.Model(inputs, outputs)

base_learning_rate = 0.0001
model.compile(optimizer=tf.keras.optimizers.Adam(lr=base_learning_rate),
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=["accuracy"])

cb_checkpointer = ModelCheckpoint(
    filepath=model_path,
    verbose=1,
    monitor='val_loss',
    save_best_only=True,
    save_weights_only=False,
    mode='auto'
)

tensorboard_callback = TensorBoard(log_dir=f"{PATH}/training_process")

initial_epochs = 20

loss0, accuracy0 = model.evaluate(validation_dataset)

print("initial loss: {:.2f}".format(loss0))
print("initial accuracy: {:.2f}".format(accuracy0))

history = model.fit(train_dataset,
                    epochs=initial_epochs,
                    validation_data=validation_dataset,
                    callbacks=[tensorboard_callback, cb_checkpointer])

Run command

docker run --gpus all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm -it -v /home/kw/DATASETS/liveness:/home/kw/DATASETS/liveness -v /home/kw/PycharmProjects/liveness_resnet/:/home/kw/PycharmProjects/liveness_resnet --name tf_2009 nvcr.io/nvidia/tensorflow:20.09-tf2-py3

Bug description

After running above script, an exception with "No algorithm worked!" message is thrown.

Traceback (most recent call last):
  File "resnet50_test.py", line 80, in <module>
    loss0, accuracy0 = model.evaluate(validation_dataset)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 1379, in evaluate
    tmp_logs = test_function(iterator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 840, in _call
    return self._stateless_fn(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 550, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.NotFoundError:  No algorithm worked!
	 [[node functional_1/resnet50/conv1_conv/Conv2D (defined at resnet50_test.py:80) ]] [Op:__inference_test_function_9650]

Function call stack:
test_function

Full output

2020-10-02 14:56:25.284418: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Found 3745 files belonging to 2 classes.
2020-10-02 14:56:26.080110: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2020-10-02 14:56:26.114436: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 14:56:26.114830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:26:00.0 name: GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.77GHz coreCount: 40 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-10-02 14:56:26.114847: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2020-10-02 14:56:26.121483: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2020-10-02 14:56:26.125960: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2020-10-02 14:56:26.127291: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2020-10-02 14:56:26.134645: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2020-10-02 14:56:26.136239: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2020-10-02 14:56:26.136378: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2020-10-02 14:56:26.136497: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 14:56:26.137051: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 14:56:26.137504: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-10-02 14:56:26.143549: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3600005000 Hz
2020-10-02 14:56:26.143911: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5e00e40 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-02 14:56:26.143922: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-10-02 14:56:26.211870: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 14:56:26.212328: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x576b780 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-10-02 14:56:26.212351: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2070 SUPER, Compute Capability 7.5
2020-10-02 14:56:26.212605: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 14:56:26.213213: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:26:00.0 name: GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.77GHz coreCount: 40 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-10-02 14:56:26.213243: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2020-10-02 14:56:26.213268: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2020-10-02 14:56:26.213281: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2020-10-02 14:56:26.213297: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2020-10-02 14:56:26.213318: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2020-10-02 14:56:26.213337: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2020-10-02 14:56:26.213354: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2020-10-02 14:56:26.213457: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 14:56:26.214100: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 14:56:26.214679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-10-02 14:56:26.214710: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2020-10-02 14:56:26.450418: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-02 14:56:26.450448: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 
2020-10-02 14:56:26.450465: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N 
2020-10-02 14:56:26.450641: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 14:56:26.451012: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-02 14:56:26.451334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7019 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 SUPER, pci bus id: 0000:26:00.0, compute capability: 7.5)
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
Found 2402 files belonging to 2 classes.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
94773248/94765736 [==============================] - 4s 0us/step
2020-10-02 14:56:32.404900: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2020-10-02 14:56:32.404939: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1391] Profiler found 1 GPUs
2020-10-02 14:56:32.409414: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcupti.so.11.0
2020-10-02 14:56:32.509387: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1513] CUPTI activity buffer flushed
2020-10-02 14:56:33.290028: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2020-10-02 14:56:34.436539: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2020-10-02 14:56:36.117721: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at conv_ops.cc:1115 : Not found: No algorithm worked!
Traceback (most recent call last):
  File "resnet50_test.py", line 80, in <module>
    loss0, accuracy0 = model.evaluate(validation_dataset)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 1379, in evaluate
    tmp_logs = test_function(iterator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 840, in _call
    return self._stateless_fn(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 550, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.NotFoundError:  No algorithm worked!
	 [[node functional_1/resnet50/conv1_conv/Conv2D (defined at resnet50_test.py:80) ]] [Op:__inference_test_function_9650]

Function call stack:
test_function

System information

OS Platform and Distribution: Linux Ubuntu 20.04.1 (NGC Tensorflow 20.09 Docker image)
TensorFlow installed from (source or binary): source (Nvidia NGC Tensorflow 20.09 Docker image)
TensorFlow version: 2.3.0
CUDA/cuDNN version: CUDA 11.0 / cuDNN 8.0.4
Docker version: 19.03.12, build 48a66213fe
GPU model and memory: GeForce RTX2070 SUPER 8GB
OS driver version: 450.51.06

Additional information

Script provided above works perfectly on a twin configuration (same OS, Docker image, Docker version and OS driver version), but with GTX 1060.

When will the debug tool be available?

Getting OpenNMT-tf to train reproducibly

Hi,

I'm trying to use this patch with Tensorflow 2.0, but the learning is still non-deterministic. I guess it is due to XLA optimization. How can I disable XLA?

Bests,

How can I get deterministic, back-propagatable upsampling?

I found when performing upsampling, both tf.keras.layers.UpSampling2D and tf.image.resize can not ensure determinism in my code (tensorflow2.0, ubuntu 16.04, python 3.7, cuda10.0, cudnn7.6) . Is there any setting or altenative mothod to perform upsampling that ensures determinism? Thank you!

tf.data.experimental.sample_from_datasets non-deterministic in multi-gpu.

Problem Overview

I train my model on the same dataset in two different setups: A) single-gpu, B) multi-gpu. The former leads to deterministic results, the latter leads to non-deterministic results. The moment I replace the tf.data.experimental.sample_from_datasets API call with a direct call to tf.data.Datasets, B also becomes determinsitic.

Environment

Python: 3.7
Cuda: 11.2
Tensorflow: 2.4.1

Code

Relevant API: https://www.tensorflow.org/versions/r2.4/api_docs/python/tf/data/experimental/sample_from_datasets

def load_resampled_data(dataset_dir: str, split: str, batch_size: int, prepare_example: Callable,
                                         distribution=Distribution.DEFAULT) -> tf.data.Dataset:
    """
     Load the samples in dataset_dir in a shuffled order
     at per-class sampling rates determined by distribution.
     :param dataset_dir: Path to the dataset directory.
     :param split: One of the values in constants.VALID_SPLITS.
     :param batch_size: Number of samples per batch.
     :param prepare_example: Function to apply to every sample.
     :param distribution: Distribution enum indicating the
     distribution over label classes for each epoch.
     """
    assert split in constants.VALID_SPLITS
    class_datasets = []
    tf_records_dir = os.path.join(dataset_dir, split)

    # Load class cardinality information.
    class_cardinality_json = os.path.join(tf_records_dir,
        constants.CLASS_CARDINALITY_JSON)
    with file_py.open(class_cardinality_json, 'r') as f:
        class_cardinality = json.load(f)

    # Determine train-time distribution.
    class_distribution = _get_class_distribution(class_cardinality,
        distribution)
    assert round(sum(class_distribution.values()), 2) == 1.0
    print("Train-time class distribution:", class_distribution)

    # Load class-based tf records with re-sampling.
    resampled_distribution = []
    for class_name, class_weight in class_distribution.items():
        tf_record = os.path.join(tf_records_dir, f"{class_name}.tf_record")
        class_dataset = tf.data.TFRecordDataset(tf_record)
        assert class_cardinality[class_name] > 0, class_cardinality
        class_dataset = class_dataset.shuffle(
            min(class_cardinality[class_name], MAX_SHUFFLE_BUFFER_SIZE),
            seed=constants.SEED,
            reshuffle_each_iteration=False)
        class_datasets.append(class_dataset.repeat())
        resampled_distribution.append(class_weight)
    dataset_cardinality = int(class_cardinality[REFERENCE_CLASS] /
        class_distribution[REFERENCE_CLASS])
    dataset = tf.data.experimental.sample_from_datasets(
        class_datasets, resampled_distribution, seed=constants.SEED)

    # Elements cannot be processed in parallel because
    # of the stateful non-determinism in the data augmentations.
    dataset = dataset.map(
        prepare_example, num_parallel_calls=1, deterministic=True)
    dataset = dataset.batch(batch_size, drop_remainder=True)

    return dataset.prefetch(1), dataset_cardinality

I cannot provide the full code I use due to it being proprietary, but here is the data loading portion. If more information is needed to root cause this, let me know, and I will see what I can do to provide it. FYI the main code sets all the seeds correctly and disables horovod fusion as suggested by the repo README.

Thanks a lot for the great work on making Tensorflow deterministic. It, along with the documentation provided, has been incredibly useful in my day-to-day work.

Changing class name/structure changes program functionality (using TensorFlow)

For example:

In my project (https://github.com/edwardyehuang/CAR/blob/master/carnet.py), line 163 (instantiate SegManaged).

If I create a wrapper class that inherited SegManaged and place it on line 163, e.g.

class gbb(SegManaged):

    pass

The performance (e.g. loss) will be different with the original one.
However, I found it is depending on the first letter of the "wrapper class", if it starts with a-g (e.g. cbb, cbx, cxxx), the performance will be different, but the performance will be the same if it starts with h-z. Note that, upper/lowercase has no effect

Unidentified model not training reproducibly

or is it?

nvidia / framework-reproducibility Goto Github PK

framework-reproducibility's Introduction

Framework Reproducibility (fwr13y)

Repository Name Change

Repository Intention

framework-reproducibility's People

Contributors

Stargazers

Watchers

Forkers

framework-reproducibility's Issues

Prerequisites

Issue

Command lines:

Script

Code

Output

Prerequisites

Used script

Run command

Bug description

Full output

System information

Additional information

Problem Overview

Environment

Code

Recommend Projects

Recommend Topics

Recommend Org