Currently dev checkpointing is non-deterministic, with the best model producing a diff

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Some quick issues (all on CPU for now): <code class="notransla

Fixed by PR <a class="issue-link js-issue-link" data-error-text="Failed to load title"

Non-determinism in Scoring / Tuning about metal HOT 10 CLOSED

hazyresearch commented on August 29, 2024

Non-determinism in Scoring / Tuning

from metal.

Comments (10)

ajratner commented on August 29, 2024

@jason-fries thanks for catching this! I'll take a look later today

from metal.

ajratner commented on August 29, 2024

Hey @jason-fries I'm assuming the example above is with the GPU, and the LSTM? Do you know if it occurs without the GPU? And could you let me know how you had the model configured?

I'm a bit behind but going to try to isolate and correct as suggested on this branch, starting with adding basic tests for model and scoring determinism on CPU

Thanks,
Alex

from metal.

jason-fries commented on August 29, 2024

Just checked -- it's also non-deterministic on CPU.

from metal.

ajratner commented on August 29, 2024

Thanks! And could you share the model class (LSTM?) and kwargs / config used?

from metal.

jason-fries commented on August 29, 2024

Perhaps I don't follow how the RandomSearchTuner is configured but there are couple things that seem weird/buggy to me.

For random seeding, I'd expect:

A user-provided seed for model weight initialization that's fixed during hyperparameter search.
A fixed, separate seed for sampling hyperparameters.

If I pass a single seed into the model search grid (see below) and set a different fixed seed for the RandomSearchTuner initialization, then the same hyperparameters settings are sampled for each trained model. If you remove the seed from the search space, then every model config gets a new seed by default (which seems weird). Is that seed separate from the model seed?

The best model also seems to be chosen using accuracy not matter what is chosen as validation_metric. How do you specify that the best model should be selected based on validation_metric?

Model config dictionary

{'input_module': lstm,
 'lstm_reduction': 'attention',
 'middle_modules': [Dropout(p=0.5],
 'seed': 123,
 'show_plots': False,
 'train_config': {'batch_size': 64,
                  'checkpoint': True,
                  'checkpoint_config': {'checkpoint_min': -1,
                                        'checkpoint_runway': 0},
                  'data_loader_config': {'batch_size': 16, 'num_workers': 8},
                  'disable_prog_bar': True,
                  'l2': 0.00166,
                  'n_epochs': 5,
                  'optimizer_config': {'adam_config': {'betas': (0.9,
                                                                 0.999)},
                                       'optimizer': 'adam',
                                       'optimizer_common': {'lr': 0.01},
                                       'sgd_config': {'momentum': 0.9}},
                  'print_every': 1,
                  'scheduler_config': {'exponential_config': {'gamma': 0.9},
                                       'lr_freeze': 0,
                                       'plateau_config': {'factor': 0.5,
                                                          'min_lr': 1e-05,
                                                          'patience': 1,
                                                          'threshold': 0.0001},
                                       'scheduler': 'reduce_on_plateau'},
                  'validation_freq': 1,
                  'validation_metric': 'f1'},
 'use_cuda': False,
 'verbose': True}

and

search_space = {
        'lr': {'range': [0.001, 0.01], 'scale': 'log'},
        'seed':[args.seed]
}

searcher = RandomSearchTuner(EndModel, seed=args.seed + 1, log_writer_class=TensorBoardWriter,  **log_config)
end_model = searcher.search(search_space,
                                dev_loader,
                                train_args=[train_loader],
                                init_args=init_args,
                                init_kwargs=init_kwargs,
                                train_kwargs=model_config['train_config'],
                                max_search=args.n_model_search)

from metal.

ajratner commented on August 29, 2024

@jason-fries looking into this now; one more question, has the issue only occurred when using the tuner, or also when just training the model with dev set checkpointing?

from metal.

ajratner commented on August 29, 2024

Okay well either way, I now have a basic model determinism test that is failing just with the LSTM on CPU, so probably need to fix the seeding in that module; doing this first

from metal.

ajratner commented on August 29, 2024

@jason-fries LSTM with randomly initialized embeddings just needs to be seeded for determinism; passing tests now on model-determinism branch. Moving on to random tuner

from metal.

jason-fries commented on August 29, 2024

Some quick issues (all on CPU for now):

validation_metric doesn't look like it's being respected. I'm using F1, which is printed during each epoch update, but definitely not being printed out in the "best model" summary at the end of each loop. Validation scores also still change.

[E:9]	Train Loss: 0.033	Dev score: 0.522
Restoring best model from iteration 5 with score 0.551
Finished Training
Accuracy: 0.809
        y=1    y=2   
 l=1    36     51    
 l=2    26     290   
Accuracy: 0.824
        y=1    y=2   
 l=1    38     47    
 l=2    24     294

validation scoring during param search should disable gradient history (can't see if this is being done or not now), e.g., something like:

end_model.eval()
with torch.set_grad_enabled(False):
    y_pred, y_true, y_proba = end_model._get_predictions(test, return_probs=True)

This gives deterministic scoring on CPU for me.

from metal.

ajratner commented on August 29, 2024

Fixed by PR #105

from metal.

Non-determinism in Scoring / Tuning about metal HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent