Giter VIP home page Giter VIP logo

Comments (11)

rllin-fathom avatar rllin-fathom commented on August 26, 2024

I did more runs with the large model as implied was used by the benchmarks.

(8x V100, large, batch size 2): 6.4 examples/sec

(8x V100, large, batch size 3): 7.95 examples/sec

(1x V100, large, batch size 3): 8.9 examples/sec

none of which match up to the claimed speeds. am I missing something obvious?

from deeplearningexamples.

swethmandava avatar swethmandava commented on August 26, 2024

The numbers from run_squad.py are not per gpu. It is, hence, strange that your 1xv100 is faster than 8xv100. Can you tell me your launch command? I want to make sure you're using all your GPUs and AMP because your 1xv100 is close to our FP32 perf.

from deeplearningexamples.

rllin-fathom avatar rllin-fathom commented on August 26, 2024

mpirun -np 8 -H localhost:8 \
--allow-run-as-root --bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python3 /usr/src/bert/run_squad.py \
--bert_config_file="PATH/uncased_L-24_H-1024_A-16/bert_config.json" \
--do_predict="True" \
--do_train="True" \
--doc_stride="128" \
--init_checkpoint="PATH/uncased_L-24_H-1024_A-16/bert_model.ckpt" \
--learning_rate="5e-06" \
--max_seq_length="384" \
--num_train_steps="2" \
--output_dir="PATH" \
--predict_file="PATH/squadv1.1/dev-v1.1.json" \
--save_checkpoints_steps="1000" \
--train_batch_size="3" \
--train_file="PATH/squadv1.1/train-v1.1.json" \
--use_fp16="True" \
--use_xla="True" \
--horovod="True" \
--vocab_file="PATH/uncased_L-24_H-1024_A-16/vocab.txt"

is my command.

i'm actually a little confused by amp vs fp16 flags. could you clarify when both or either need to be used? I'm only using the fp16 flag.

from deeplearningexamples.

rllin-fathom avatar rllin-fathom commented on August 26, 2024

@swethmandava how certain are you that they aren't per gpu? for the 8x run, my logs print out examples/sec much more frequently than the 1x run, which is why I assumed it was per gpu.

from deeplearningexamples.

rllin-fathom avatar rllin-fathom commented on August 26, 2024

specifically, I get 9 sets of 8 pairs of examples/sec and global_step/sec printouts before the first checkpoint at 1000 in the 8x case

e.g.

190522t022500-4y91fhzsmtnk-usr-src-bert-run-squad-pysquad:46:595 [0] NCCL INFO Launch mode Parallel

[2019-05-22 23:01:10,777] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:01:20,830] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:01:30,890] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:01:40,963] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:01:51,021] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:02:01,079] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:02:11,154] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:02:21,219] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:02:31,280] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:02:41,335] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:02:51,394] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:03:01,456] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:03:11,520] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:03:21,581] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:03:31,642] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:03:41,703] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:03:51,791] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:04:01,844] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:04:11,899] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:04:21,964] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:04:32,024] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:04:42,084] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:04:52,171] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:05:02,263] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:05:12,320] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:05:22,387] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:05:32,459] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:05:42,516] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:05:52,578] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:06:02,646] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:06:12,706] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:06:22,767] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:06:32,823] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:06:42,909] {kubernetes_operator.py:709} INFO - INFO:tensorflow:global_step/sec: 0.353741
INFO:tensorflow:examples/sec: 1.06122
INFO:tensorflow:global_step/sec: 0.353368
INFO:tensorflow:examples/sec: 1.0601
INFO:tensorflow:global_step/sec: 0.345642
INFO:tensorflow:global_step/sec: 0.356712
INFO:tensorflow:examples/sec: 1.03692
INFO:tensorflow:examples/sec: 1.07014
INFO:tensorflow:global_step/sec: 0.350498
INFO:tensorflow:examples/sec: 1.05149
INFO:tensorflow:global_step/sec: 0.35857
INFO:tensorflow:examples/sec: 1.07571
INFO:tensorflow:global_step/sec: 0.363175
INFO:tensorflow:examples/sec: 1.08953
INFO:tensorflow:global_step/sec: 0.350679
INFO:tensorflow:examples/sec: 1.05204

[2019-05-22 23:06:52,974] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:07:03,033] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:07:13,086] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:07:23,153] {kubernetes_operator.py:709} INFO - INFO:tensorflow:global_step/sec: 2.64614
INFO:tensorflow:examples/sec: 7.93842
INFO:tensorflow:global_step/sec: 2.54988
INFO:tensorflow:examples/sec: 7.64965
INFO:tensorflow:global_step/sec: 2.54973
INFO:tensorflow:examples/sec: 7.64918
INFO:tensorflow:global_step/sec: 2.55028
INFO:tensorflow:examples/sec: 7.65085
INFO:tensorflow:global_step/sec: 2.54975
INFO:tensorflow:examples/sec: 7.64924
INFO:tensorflow:global_step/sec: 2.54979
INFO:tensorflow:examples/sec: 7.64937
INFO:tensorflow:global_step/sec: 2.5495
INFO:tensorflow:examples/sec: 7.64851
INFO:tensorflow:global_step/sec: 2.54897
INFO:tensorflow:examples/sec: 7.64692

[2019-05-22 23:07:33,220] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:07:43,292] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:07:53,348] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:08:03,421] {kubernetes_operator.py:709} INFO - INFO:tensorflow:global_step/sec: 2.64979
INFO:tensorflow:examples/sec: 7.94938
INFO:tensorflow:global_step/sec: 2.64977
INFO:tensorflow:examples/sec: 7.94931
INFO:tensorflow:global_step/sec: 2.65008
INFO:tensorflow:examples/sec: 7.95023
INFO:tensorflow:global_step/sec: 2.64976
INFO:tensorflow:examples/sec: 7.94928
INFO:tensorflow:global_step/sec: 2.64984
INFO:tensorflow:examples/sec: 7.94952
INFO:tensorflow:global_step/sec: 2.65015
INFO:tensorflow:examples/sec: 7.95045
INFO:tensorflow:global_step/sec: 2.64938
INFO:tensorflow:examples/sec: 7.94814
INFO:tensorflow:global_step/sec: 2.64895
INFO:tensorflow:examples/sec: 7.94684

[2019-05-22 23:08:13,485] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:08:23,541] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:08:33,612] {kubernetes_operator.py:709} INFO - INFO:tensorflow:global_step/sec: 2.64335
INFO:tensorflow:examples/sec: 7.93005
INFO:tensorflow:global_step/sec: 2.6434
INFO:tensorflow:examples/sec: 7.93019
INFO:tensorflow:global_step/sec: 2.64348
INFO:tensorflow:global_step/sec: 2.64318
INFO:tensorflow:examples/sec: 7.93043
INFO:tensorflow:examples/sec: 7.92953
INFO:tensorflow:global_step/sec: 2.64384
INFO:tensorflow:examples/sec: 7.93153
INFO:tensorflow:global_step/sec: 2.64292
INFO:tensorflow:examples/sec: 7.92875
INFO:tensorflow:global_step/sec: 2.64309
INFO:tensorflow:examples/sec: 7.92927
INFO:tensorflow:global_step/sec: 2.64238
INFO:tensorflow:examples/sec: 7.92713

[2019-05-22 23:08:43,670] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:08:53,735] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:09:03,804] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:09:13,876] {kubernetes_operator.py:709} INFO - INFO:tensorflow:global_step/sec: 2.63795
INFO:tensorflow:examples/sec: 7.91385
INFO:tensorflow:global_step/sec: 2.63741
INFO:tensorflow:examples/sec: 7.91224
INFO:tensorflow:global_step/sec: 2.63711
INFO:tensorflow:examples/sec: 7.91133
INFO:tensorflow:global_step/sec: 2.63704
INFO:tensorflow:examples/sec: 7.91113
INFO:tensorflow:global_step/sec: 2.63696
INFO:tensorflow:examples/sec: 7.91089
INFO:tensorflow:global_step/sec: 2.63753
INFO:tensorflow:examples/sec: 7.9126
INFO:tensorflow:global_step/sec: 2.63663
INFO:tensorflow:examples/sec: 7.9099
INFO:tensorflow:global_step/sec: 2.54616
INFO:tensorflow:examples/sec: 7.63848

[2019-05-22 23:09:23,940] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:09:34,006] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:09:44,070] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:09:54,133] {kubernetes_operator.py:709} INFO - INFO:tensorflow:global_step/sec: 2.55341
INFO:tensorflow:global_step/sec: 2.5531
INFO:tensorflow:examples/sec: 7.65931
INFO:tensorflow:examples/sec: 7.66024
INFO:tensorflow:global_step/sec: 2.5527
INFO:tensorflow:examples/sec: 7.65811
INFO:tensorflow:global_step/sec: 2.5534
INFO:tensorflow:examples/sec: 7.66019
INFO:tensorflow:global_step/sec: 2.55294
INFO:tensorflow:examples/sec: 7.65883
INFO:tensorflow:global_step/sec: 2.64426
INFO:tensorflow:examples/sec: 7.93277
INFO:tensorflow:global_step/sec: 2.55252
INFO:tensorflow:examples/sec: 7.65755
INFO:tensorflow:global_step/sec: 2.55204
INFO:tensorflow:examples/sec: 7.65612

[2019-05-22 23:10:04,202] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:10:14,266] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:10:24,324] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:10:34,422] {kubernetes_operator.py:709} INFO - INFO:tensorflow:global_step/sec: 2.66335
INFO:tensorflow:global_step/sec: 2.66344
INFO:tensorflow:examples/sec: 7.99005
INFO:tensorflow:global_step/sec: 2.66273
INFO:tensorflow:examples/sec: 7.98818
INFO:tensorflow:examples/sec: 7.99031
INFO:tensorflow:global_step/sec: 2.66298
INFO:tensorflow:examples/sec: 7.98895
INFO:tensorflow:global_step/sec: 2.66289
INFO:tensorflow:examples/sec: 7.98866
INFO:tensorflow:global_step/sec: 2.66261
INFO:tensorflow:examples/sec: 7.98783
INFO:tensorflow:global_step/sec: 2.66285
INFO:tensorflow:examples/sec: 7.98855
INFO:tensorflow:global_step/sec: 2.66245
INFO:tensorflow:examples/sec: 7.98735

[2019-05-22 23:10:44,485] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:10:54,579] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:11:04,644] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:11:14,702] {kubernetes_operator.py:709} INFO - INFO:tensorflow:global_step/sec: 2.65107
INFO:tensorflow:examples/sec: 7.95322
INFO:tensorflow:global_step/sec: 2.65098
INFO:tensorflow:global_step/sec: 2.65111
INFO:tensorflow:examples/sec: 7.95295
INFO:tensorflow:examples/sec: 7.95334
INFO:tensorflow:global_step/sec: 2.65094
INFO:tensorflow:examples/sec: 7.95282
INFO:tensorflow:global_step/sec: 2.65092
INFO:tensorflow:examples/sec: 7.95276
INFO:tensorflow:global_step/sec: 2.65099
INFO:tensorflow:examples/sec: 7.95298
INFO:tensorflow:global_step/sec: 2.65106
INFO:tensorflow:examples/sec: 7.95318
INFO:tensorflow:global_step/sec: 2.65087
INFO:tensorflow:examples/sec: 7.95262

[2019-05-22 23:11:24,778] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:11:34,847] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:11:44,918] {kubernetes_operator.py:709} INFO - INFO:tensorflow:global_step/sec: 2.65187
INFO:tensorflow:examples/sec: 7.9556
INFO:tensorflow:global_step/sec: 2.6518
INFO:tensorflow:examples/sec: 7.9554
INFO:tensorflow:global_step/sec: 2.65189
INFO:tensorflow:global_step/sec: 2.65191
INFO:tensorflow:examples/sec: 7.95574
INFO:tensorflow:examples/sec: 7.95568
INFO:tensorflow:global_step/sec: 2.65155
INFO:tensorflow:examples/sec: 7.95465
INFO:tensorflow:global_step/sec: 2.65136
INFO:tensorflow:examples/sec: 7.95409
INFO:tensorflow:global_step/sec: 2.65127
INFO:tensorflow:examples/sec: 7.95382

[2019-05-22 23:11:54,978] {kubernetes_operator.py:709} INFO - INFO:tensorflow:global_step/sec: 2.53725
INFO:tensorflow:examples/sec: 7.61175

[2019-05-22 23:12:05,044] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:12:15,108] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:12:25,168] {kubernetes_operator.py:709} INFO - INFO:tensorflow:Saving checkpoints for 1000 into PATH//model.ckpt.

while in the 1x case, I get 9 sets of 1 pair, e.g.


[2019-05-22 23:33:42,235] {kubernetes_operator.py:709} INFO - INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.

[2019-05-22 23:33:52,308] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:34:02,379] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:34:12,462] {kubernetes_operator.py:709} INFO - INFO:tensorflow:Saving checkpoints for 0 into PATH/model.ckpt.

[2019-05-22 23:34:22,554] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:34:32,618] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:34:42,686] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:34:52,771] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:35:02,846] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:35:12,935] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:35:22,994] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:35:33,054] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:35:43,114] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:35:53,175] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:36:03,285] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:36:13,369] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:36:23,450] {kubernetes_operator.py:709} INFO - 2019-05-22 23:36:13.588130: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10 locally

[2019-05-22 23:36:33,523] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:36:43,575] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:36:53,629] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:37:03,683] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:37:13,752] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:37:23,803] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:37:33,856] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:37:43,930] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:37:53,993] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:38:04,048] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:38:14,117] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:38:24,184] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:38:34,234] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:38:44,292] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:38:54,347] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:39:04,408] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:39:14,464] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:39:24,531] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:39:34,602] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:39:44,651] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:39:54,711] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:40:04,764] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:40:14,820] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:40:24,881] {kubernetes_operator.py:709} INFO - INFO:tensorflow:global_step/sec: 0.62738
INFO:tensorflow:examples/sec: 1.88214

[2019-05-22 23:40:34,946] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:40:45,020] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:40:55,078] {kubernetes_operator.py:709} INFO - INFO:tensorflow:global_step/sec: 2.97402
INFO:tensorflow:examples/sec: 8.92207

[2019-05-22 23:41:05,137] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:41:15,200] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:41:25,258] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:41:35,309] {kubernetes_operator.py:709} INFO - INFO:tensorflow:global_step/sec: 2.96774
INFO:tensorflow:examples/sec: 8.90321

[2019-05-22 23:41:45,368] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:41:55,444] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:42:05,494] {kubernetes_operator.py:709} INFO - INFO:tensorflow:global_step/sec: 2.96544
INFO:tensorflow:examples/sec: 8.89632

[2019-05-22 23:42:15,553] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:42:25,604] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:42:35,669] {kubernetes_operator.py:709} INFO - INFO:tensorflow:global_step/sec: 2.8666
INFO:tensorflow:examples/sec: 8.5998

[2019-05-22 23:42:45,732] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:42:55,785] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:43:05,883] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:43:15,935] {kubernetes_operator.py:709} INFO - INFO:tensorflow:global_step/sec: 2.96885
INFO:tensorflow:examples/sec: 8.90656

[2019-05-22 23:43:25,999] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:43:36,069] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:43:46,130] {kubernetes_operator.py:709} INFO - INFO:tensorflow:global_step/sec: 2.96784
INFO:tensorflow:examples/sec: 8.90351

[2019-05-22 23:43:56,222] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:44:06,275] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:44:16,334] {kubernetes_operator.py:709} INFO - INFO:tensorflow:global_step/sec: 2.96816
INFO:tensorflow:examples/sec: 8.90447

[2019-05-22 23:44:26,391] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:44:36,451] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:44:46,510] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:44:56,569] {kubernetes_operator.py:709} INFO - INFO:tensorflow:global_step/sec: 2.85977
INFO:tensorflow:examples/sec: 8.5793

[2019-05-22 23:45:06,627] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:45:16,767] {kubernetes_operator.py:709} INFO -  
[2019-05-22 23:45:26,841] {kubernetes_operator.py:709} INFO - INFO:tensorflow:Saving checkpoints for 1000 into PATH/usr.src.bert.run_squad.py/model.ckpt.

from deeplearningexamples.

swethmandava avatar swethmandava commented on August 26, 2024

Sorry, I misspoke. The examples/sec you're seeing is per gpu. Your scaling makes sense now :) At the end of the training, we print out sentences/sec which is cumulative (all GPUs combined) - this is reported in our README.
Something like this:

INFO:tensorflow:0 Total Training Time = YY Training Time W/O start up overhead = ZZ Sentences processed = AA
INFO:tensorflow:0 Training Performance = XX sentences/sec

Flags for AMP are being set here for your reference.

In addition to --use_fp16, you also have to set the environment variable TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE=1.

from deeplearningexamples.

rllin-fathom avatar rllin-fathom commented on August 26, 2024

@swethmandava

  1. and to triple confirm, each example is a sentence?
  2. got it, i'll enable AMP also and report back if I get the same speed, as I'm still around 30 examples/sec behind in speed

from deeplearningexamples.

swethmandava avatar swethmandava commented on August 26, 2024

Yes, that's correct.

from deeplearningexamples.

rllin-fathom avatar rllin-fathom commented on August 26, 2024

@swethmandava thanks for mentioning amp!

with amp enabled, I get

(8x V100, large, batch size 3): 11.15 examples/sec/gpu

which is much closer to the benchmark (98.75 examples/sec), but still ~10 examples/sec off. Is that worrisome?

Looking at

tf.logging.info("%d Training Performance = %0.4f sentences/sec", hvd_rank, avg_sentences_per_second)
it looks like the calculation is based on training time rather than an average of the averages. I'm leaning towards that the numbers are close enough then unless you can enlighten me to otherwise :)

from deeplearningexamples.

swethmandava avatar swethmandava commented on August 26, 2024

The initial few iterations are slow due to warmup. Our numbers are based on training time from 2 epochs. Please verify with the final performance number that the script returns. Marking issue as resolved!

from deeplearningexamples.

rllin-fathom avatar rllin-fathom commented on August 26, 2024

thanks!

from deeplearningexamples.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.