mlcommons / training_results_v0.7 Goto Github PK

This repository contains the results and code for the MLPerf™ Training v0.7 benchmark.

Home Page: https://mlcommons.org/en/training-normal-07/

License: Apache License 2.0

Dockerfile 0.23% Shell 1.98% Python 55.62% Jupyter Notebook 16.71% C++ 19.58% Cuda 2.97% Awk 0.02% HTML 0.19% Lua 0.03% CMake 0.17% Makefile 0.33% Java 0.29% C 0.31% Groovy 0.24% PowerShell 0.02% Clojure 0.94% JavaScript 0.15% CSS 0.01% Ruby 0.01% R 0.20%

training_results_v0.7's Introduction

training_results_v0.7

MLPerf™ Training v0.7 results

training_results_v0.7's People

Contributors

Stargazers

Watchers

training_results_v0.7's Issues

Transfomers benchmark reproducibility

Hi,

I'm trying to reproduce any of the configurations under training_results_v0.7/Google/benchmarks/transformer/ and it seems that some things are missing.
If running run_and_time.sh (from any of the subfolders), it needs a pointer to 'traner.py'. The setup.sh is empty so no environment variables are set.
Are the environment variables are set automatically when using TPU machine? Can the code run on GPU?

MaskRCNN training samples are different in different submissions

118287 vs 80000

which one is correct?

13:45:05|~/mlperf/training_results_v0.7/Google/results/tpu-v4-128-TF/maskrcnn$ grep samples *
result_0.txt::::MLLOG {"namespace": "main", "time_ms": 1593181858730, "key": "train_samples", "value": 118287, "metadata": {"lineno": 111, "file": "${INTERNAL_PATH}/third_party/tensorflow_models/mlperf/models/rough/mask_rcnn/mask_rcnn_main.py"}}

13:45:37|~/mlperf/training_results_v0.7/NVIDIA/results/dgx1_ngc20.06_pytorch/maskrcnn$ grep train_samples *
result_0.txt::::MLLOG {"namespace": "", "time_ms": 1592786644352, "event_type": "POINT_IN_TIME", "key": "train_samples", "value": 80000, "metadata": {"file": "tools/train_mlperf.py", "lineno": 217}}

@jonathan-cohen-nvidia please forward to your colleague.

import SSD._C as C ImportError: /usr/local/lib/python3.6/dist-packages/SSD/_C.cpython-36m-x86_64-linux-gnu.so: undefined symbol: cudnnFindConvolutionForwardAlgorithmEx

I try to compile apex from source based on official pytorch1.8.1 whl: torch-1.8.1-cp36-cp36m-manylinux1_x86_64.whl,

python setup.py build --cpp_ext --cuda_ext
python setup.py bdist_wheel --cpp_ext --cuda_ext -d build_pip
pip3 install build_pip/apex-0.1-cp36-cp36m-linux_x86_64.whl

root@5e359b5d8031:training_results_v0.7/NVIDIA/benchmarks/ssd/implementations/pytorch# python -u -m bind_launch --nproc_per_node 2 --nsockets_per_node 2 --ncores_per_socket 8 train.py --no-dali --dali-cache 0 --delay-allreduce --batch-size 16 --data=/home/datasets/coco/

import SSD._C as C
ImportError: /usr/local/lib/python3.6/dist-packages/SSD/_C.cpython-36m-x86_64-linux-gnu.so: undefined symbol: cudnnFindConvolutionForwardAlgorithmEx

Multiple vulnerabilities in this repo, including log4j

Please fix / update dependencies / delete the offending code.

https://github.com/mlcommons/training_results_v0.7/security/dependabot

NVIDIA MiniGo CUDA driver version is insufficient for CUDA runtime version

Running nVidia's MiniGo benchmark using run_with_docker.sh produces the following error:

Traceback (most recent call last):
File "./ml_perf/train_loop.py", line 989, in
app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "./ml_perf/train_loop.py", line 898, in main
tstate = minigo_train.init_train(rank, tcomm, FLAGS.model_dir)
File "./train.py", line 247, in init_train
sess = dual_net._get_session()
File "./dual_net.py", line 647, in _get_session
return tf.Session(config=session_config)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1585, in init
super(Session, self).init(target, graph, config=config)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 699, in init
self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

Getting run-time from NVIDIA gnmt and transformer logs

Trying to use the end-of-file RESULT statements in logs on training_results_v0.7/NVIDIA/results/dgxa100_ngc20.06_pytorch/gnmt/ and training_results_v0.7/NVIDIA/results/dgxa100_ngc20.06_pytorch/transformer/.

For gnmt:

$ for i in `ls NVIDIA/results/dgxa100_ngc20.06_pytorch/gnmt/result_*` ; do grep -m1 "^RESULT" $i ; done
RESULT,RNN_TRANSLATOR,,618,nvidia,2020-06-17 07:13:26 PM
RESULT,RNN_TRANSLATOR,,499,nvidia,2020-06-17 07:13:26 PM
RESULT,RNN_TRANSLATOR,,500,nvidia,2020-06-17 07:13:27 PM
RESULT,RNN_TRANSLATOR,,501,nvidia,2020-06-17 07:13:23 PM
RESULT,RNN_TRANSLATOR,,500,nvidia,2020-06-17 07:13:28 PM
RESULT,RNN_TRANSLATOR,,500,nvidia,2020-06-17 07:13:29 PM
RESULT,RNN_TRANSLATOR,,500,nvidia,2020-06-17 09:16:36 PM
RESULT,RNN_TRANSLATOR,,502,nvidia,2020-06-17 09:16:37 PM
RESULT,RNN_TRANSLATOR,,502,nvidia,2020-06-17 09:16:39 PM
RESULT,RNN_TRANSLATOR,,501,nvidia,2020-06-17 09:17:25 PM

Here, the average, after ignoring the fastest and slowest run-times, is 8.35 minutes.

(500+501+500+500+500+502+502+501)/(8*60)
8.34583333333333333333

Similarly for transformer:

for i in `ls NVIDIA/results/dgxa100_ngc20.06_pytorch/transformer/result_*` ; do grep -m1 "^RESULT" $i ; done
RESULT,transformer,22836,505,root,2020-06-23 02:24:41 PM
RESULT,transformer,24009,502,root,2020-06-23 02:24:40 PM
RESULT,transformer,2723,504,root,2020-06-23 02:24:38 PM
RESULT,transformer,26020,502,root,2020-06-23 02:24:39 PM
RESULT,transformer,22438,504,root,2020-06-23 02:24:39 PM
RESULT,transformer,4462,502,root,2020-06-23 02:24:39 PM
RESULT,transformer,16552,657,root,2020-06-23 02:24:39 PM
RESULT,transformer,21684,503,root,2020-06-23 02:24:46 PM
RESULT,transformer,29290,502,root,2020-06-23 02:24:39 PM
RESULT,transformer,13023,502,root,2020-06-23 02:24:39 PM

Here, the average, after ignoring the fastest and slowest run-times, is 8.38 minutes.

(505+502+504+502+504+502+503+502)/(8*60)
8.38333333333333333333

But the timings reported on MLCommons results page shows 7.81 minutes for gnmt and 7.84 minutes for transformer.

Is there a different way of calculating run-times from these logfiles?

Google BERT-large in float16 mode reports TypeError

Trying to reproduce Google BERT-large result.

Devices
V100-SXM2-16GB, 4 cores
Actual flags passed to run_pretraining.py:

   --all_reduce_alg=nccl \
   --bert_config_file=/path/to/bert_config.json \
   --device_warmup \
   --do_eval \
   --dtype=fp16 \
   --eval_batch_size=48 \
   --init_checkpoint=/path/to/tf2_ckpt/model.ckpt-28252 \
   --input_files='/path/to/tf_records/part-*' \
   --learning_rate=0.0004 \
   --loss_scale=dynamic \
   --max_predictions_per_seq=76 \
   --max_seq_length=512 \
   --model_dir=/path/to/benmark_1_gpu \
   --num_accumulation_steps=11 \
   --num_gpus=1 \
   --num_steps_per_epoch=32412 \
   --num_train_epochs=1 \
   --optimizer_type=lamb \
   --scale_loss \
   --steps_before_eval_start=100 \
   --steps_between_eval=100 \
   --steps_per_loop=100 \
   --stop_steps=32412 \
   --train_batch_size=22 \
   --verbosity=0 \
   --warmup_steps=0 \
   --enable_xla=False \
   --eval_input_files='/path/to/tf_record_eval/eval.txt'

Reported error

Traceback (most recent call last):
  File "bert/run_pretraining.py", line 337, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "bert/run_pretraining.py", line 333, in main
    run_bert_pretrain(strategy)
  File "bert/run_pretraining.py", line 274, in run_bert_pretrain
    eval_input_files=FLAGS.eval_input_files)
  File "bert/run_pretraining.py", line 227, in run_customized_training
    eval_batch_size=eval_batch_size)
  File "/path/to/mlperf_v0.7_google_bert_tf2/implementations/tf2_common/modeling/model_training_utils.py", line 262, in run_customized_training_loop
    model, sub_model, sub_pretrain_model = model_fn()
  File "bert/run_pretraining.py", line 187, in _get_pretrain_model
    bert_config, max_seq_length, max_predictions_per_seq)
  File "/path/to/mlperf_v0.7_google_bert_tf2/implementations/bert/bert_models.py", line 164, in pretrain_model
    transformer_encoder = get_transformer_encoder(bert_config, seq_length)
  File "/path/to/mlperf_v0.7_google_bert_tf2/implementations/bert/bert_models.py", line 125, in get_transformer_encoder
    return networks.TransformerEncoder(**kwargs)
  File "/path/to/mlperf_v0.7_google_bert_tf2/implementations/modeling/networks/transformer_encoder.py", line 129, in __init__
    name='type_embeddings')(type_ids))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 952, in __call__
    input_list)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 1091, in _functional_construction_call
    inputs, input_masks, args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 822, in _keras_tensor_symbolic_call
    return self._infer_output_signature(inputs, args, kwargs, input_masks)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 863, in _infer_output_signature
    outputs = call_fn(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 670, in wrapper
    raise e.ag_error_metadata.to_exception(e)
TypeError: in user code:

    /path/to/mlperf_v0.7_google_bert_tf2/implementations/modeling/layers/on_device_embedding.py:82 call  *
        embeddings = tf.matmul(one_hot_data, self.embeddings)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py:201 wrapper  **
        return target(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3315 matmul
        a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_math_ops.py:5550 mat_mul
        name=name)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:558 _apply_op_helper
        inferred_from[input_arg.type_attr]))

    TypeError: Input 'b' of 'MatMul' Op has type float16 that does not match type float32 of argument 'a'.

It seems the following line of code causes the error:

training_results_v0.7/Google/benchmarks/bert/implementations/bert-cloud-TF2.0-gpu-v100-8/modeling/layers/on_device_embedding.py

Line 84 in e1c4b96

embeddings = tf.matmul(one_hot_data, self.embeddings)

So I modified it to:

one_hot_data = tf.cast(one_hot_data, self.compute_type)
embeddings = tf.matmul(one_hot_data, self.embeddings)

Now the code runs normally. Unfortunately the accuracy keeps low (about 0.4) and the loss keeps still after a few thousand steps.

But according to the README, fp16 is passed to the training script:

...
--dtype=fp16 \
...

Can you help me pinpoint the problem?

BERT-large unpadded workload "NameError: name 'InitMHACUDAExtension' is not defined"

Trying to run the training for the BERT-large topology, unpadded. We set up an nvidia-docker to run the training workload. However, we run into an error for the unpadded run. Here's an excerpt from the terminal output. The padded workload successfully runs to completion. Padded workload terminal output is below in the comment.

"Namespace(allreduce_post_accumulation=True, allreduce_post_accumulation_fp16=True, bert_config_path='data/uncased_L-24_H-1024_A-16/bert_config.json', bert_model='bert-large-uncased', cache_eval_data=False, checkpoint_activations=False, dense_seq_output=True, disable_apex_softmax=False, disable_fuse_mask=False, disable_fuse_qkv=False, disable_fuse_scale=False, do_train=True, enable_fuse_dropout=True, enable_stream=False, eval_batch_size=128, eval_dir=None, eval_iter_samples=-1, eval_iter_start_samples=3000000, fp16=True, fused_gelu_bias=True, fused_mha=True, gradient_accumulation_steps=1, init_checkpoint='bert_large.pt', init_tf_checkpoint=None, input_dir='./data/hdf5/', keep_n_most_recent_checkpoints=20, learning_rate=0.0004, local_rank=-1, log_freq=1.0, loss_scale=0.0, max_predictions_per_seq=76, max_samples_termination=4500000.0, max_seq_length=512, max_steps=300.0, min_samples_to_start_checkpoints=3000000, n_gpu=1, num_epochs_to_generate_seeds_for=2, num_eval_examples=10000, num_samples_per_checkpoint=500000, opt_lamb_beta_1=0.9, opt_lamb_beta_2=0.999, output_dir='/results', pad=False, phase2=True, resume_from_checkpoint=False, seed=10483, skip_checkpoint=True, target_mlm_accuracy=0.712, train_batch_size=1, train_mlm_accuracy_window_size=0, unpad=True, use_env=False, warmup_proportion=0.0) :::MLLOG {"namespace": "", "time_ms": 1594948688327, "event_type": "POINT_IN_TIME", "key": "opt_base_learning_rate", "value": 0.0004, "metadata": {"file": "run_pretraining.py", "lineno": 524}} :::MLLOG {"namespace": "", "time_ms": 1594948688329, "event_type": "POINT_IN_TIME", "key": "opt_epsilon", "value": 1e-06, "metadata": {"file": "run_pretraining.py", "lineno": 529}} :::MLLOG {"namespace": "", "time_ms": 1594948688329, "event_type": "POINT_IN_TIME", "key": "opt_lamb_beta_1", "value": 0.9, "metadata": {"file": "run_pretraining.py", "lineno": 531}} :::MLLOG {"namespace": "", "time_ms": 1594948688329, "event_type": "POINT_IN_TIME", "key": "opt_lamb_beta_2", "value": 0.999, "metadata": {"file": "run_pretraining.py", "lineno": 532}} :::MLLOG {"namespace": "", "time_ms": 1594948688329, "event_type": "POINT_IN_TIME", "key": "opt_lamb_weight_decay_rate", "value": 0.01, "metadata": {"file": "run_pretraining.py", "lineno": 535}} :::MLLOG {"namespace": "", "time_ms": 1594948688330, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_warmup_steps", "value": 0, "metadata": {"file": ".../benchmarks/bert/implementations/pytorch/schedulers.py", "lineno": 85}} :::MLLOG {"namespace": "", "time_ms": 1594948688330, "event_type": "POINT_IN_TIME", "key": "opt_lamb_learning_rate_decay_poly_power", "value": 1.0, "metadata": {"file": ".../benchmarks/bert/implementations/pytorch/schedulers.py", "lineno": 86}} :::MLLOG {"namespace": "", "time_ms": 1594948688330, "event_type": "POINT_IN_TIME", "key": "start_warmup_step", "value": 0, "metadata": {"file": "run_pretraining.py", "lineno": 543}} Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are: enabled : True opt_level : O2 cast_model_type : torch.float16 patch_torch_functions : False keep_batchnorm_fp32 : True master_weights : True loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O2 cast_model_type : torch.float16 patch_torch_functions : False keep_batchnorm_fp32 : True master_weights : True loss_scale : dynamic

Traceback (most recent call last):

File "run_pretraining.py", line 995, in args, final_loss, train_time_raw = main()

File "run_pretraining.py", line 712, in main InitMHACUDAExtension()

NameError: name 'InitMHACUDAExtension' is not defined"

libnuma: Warning: cpu argument A-B,C-D is out of range when using Slurm

I am able to launch the training on single-node using the alternative launch with nvidia-docker method.
I was trying to launch the training on multi-node using the Slurm method, but failed with CPU binding issue.
I also tried the Slurm method on single-node, and got the same issue. The following details are based on single-node using Slurm.

I have 2 sockets per node, 16 cores per socket, 2 threads per core. So 64 CPUs in one node.
lscpu.txt
numactl_hardware.txt

The numactl --physcpubind= arguments in the log file look fine to me, but the program doesn't like any of them.
slurm-18.txt
line# 173: <8-15,40-47> is invalid
line# 175: libnuma: Warning: cpu argument 8-15,40-47 is out of range
line# 205: libnuma: Warning: cpu argument 24-31,56-63 is out of range
line# 233: <24-31,56-63> is invalid
line# 234: <0-7,32-39> is invalid
line# 235: libnuma: Warning: cpu argument 7,32-39 out of range
line# 263: <16-23,48-55> is invalid
line# 264: libnuma: Warning: cpu argument 16-23,48-55 is out of range

MASKRCNN: RuntimeError: ModuleAttributeError: 'RecursiveScriptModule' object has no attribute '_conv_forward'

Hello! I was trying to run the maskrcnn, and during the training, the following error occurred. Here is the traceback:

Traceback (most recent call last):
  File "tools/train_mlperf.py", line 367, in <module>
    main()
  File "tools/train_mlperf.py", line 356, in main
    model, success = train(cfg, args.local_rank, args.distributed, random_number_generator)
  File "tools/train_mlperf.py", line 248, in train
    per_iter_end_callback_fn=per_iter_callback_fn,
  File "/workspace/object_detection/maskrcnn_benchmark/engine/trainer.py", line 109, in do_train
    loss_dict = model(images, targets)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
    result = self.forward(*input, **kwargs)
  File "/workspace/object_detection/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
    features = self.backbone(images.tensors)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
    result = self.forward(*input, **kwargs)
  File "/workspace/object_detection/maskrcnn_benchmark/modeling/backbone/resnet.py", line 152, in forward
    x = getattr(self, stage_name)(x)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
    result = self.forward(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/workspace/object_detection/maskrcnn_benchmark/modeling/backbone/resnet.py", line 317, in forward

        out = self.conv1(x)
        out = self.bn1(out)
              ~~~~~~~~ <--- HERE
        out = F.relu(out)
  File "/workspace/object_detection/maskrcnn_benchmark/modeling/backbone/resnet.py", line 316, in forward
        identity = x

        out = self.conv1(x)
              ~~~~~~~~~~ <--- HERE
        out = self.bn1(out)
        out = F.relu(out)
RuntimeError: ModuleAttributeError: 'RecursiveScriptModule' object has no attribute '_conv_forward'
At:
  /opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py(621): __getattr__
  /opt/conda/lib/python3.6/site-packages/torch/jit/__init__.py(1617): __getattr__
  /opt/conda/lib/python3.6/site-packages/torch/jit/__init__.py(1836): __getattr__
  /opt/conda/lib/python3.6/site-packages/torch/nn/modules/conv.py(377): forward
  /workspace/object_detection/maskrcnn_benchmark/layers/misc.py(34): forward
  /opt/conda/lib/python3.6/site-packages/torch/jit/_recursive.py(624): lazy_binding_method
  /opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py(577): __call__
  /opt/conda/lib/python3.6/site-packages/torch/nn/modules/container.py(100): forward
  /opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py(577): __call__
  /workspace/object_detection/maskrcnn_benchmark/modeling/backbone/resnet.py(152): forward
  /opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py(577): __call__
  /opt/conda/lib/python3.6/site-packages/torch/nn/modules/container.py(100): forward
  /opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py(577): __call__
  /workspace/object_detection/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py(50): forward
  /opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py(577): __call__
  /workspace/object_detection/maskrcnn_benchmark/engine/trainer.py(109): do_train
  tools/train_mlperf.py(248): train
  tools/train_mlperf.py(356): main
  tools/train_mlperf.py(367): <module>

I was running on NVIDIA-RTX, Ubuntu 16.04, CUDA 10.2.

Any idea on how to fix this?

need help to make resnet50 training converged

Hi,

I am using AMD server with four A100 PCIe.

I run resnet50 with mxnet framework.

container mlperf-nvidia:image_classification

typical speed and accuracy are as below.
Speed: 11328.82 samples/sec accuracy=0.001103

My questions is what are some of the things I can do to make the accuracy higher to become converged?

BERT dataset download link not available

The wikidumps download link referenced in BERT's README.md does not exist anymore. Is there an alternative one?

mlcommons / training_results_v0.7 Goto Github PK

training_results_v0.7's Introduction

training_results_v0.7

training_results_v0.7's People

Contributors

Stargazers

Watchers

Forkers

training_results_v0.7's Issues

Recommend Projects

Recommend Topics

Recommend Org