Hi, seeing my last issue being closed: <a class="issue-link js-issue-link" data-er

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

fails similarly on Vega: <div class="snippet-clipboard-content notranslate positio

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

How do I run a single model using ai-benchmarks? <p dir

My benchmarking fails after 8th test... <div class="snippet-clipboard-content notr

Issues running AI Benchmark.. about tensorflow-directml HOT 8 CLOSED

microsoft commented on July 28, 2024

Issues running AI Benchmark..

from tensorflow-directml.

Comments (8)

PatriceVignola commented on July 28, 2024 2

Hi @oscarbg , this is something that we are actively looking into. As you noticed, tensorflow-directml's memory usage is very high at the moment, which is a problem when training with many batches. We will update this issue once we release a package that addresses these crashes.

from tensorflow-directml.

oscarbg commented on July 28, 2024 1

fails similarly on Vega:

>>   AI-Benchmark-v.0.1.2
>>   Let the AI Games begin..

*  TF Version: 1.15.3
*  Platform: Windows-10-10.0.19564-SP0
*  CPU: N/A
*  CPU RAM: 32 GB
*  GPU/0: N/A
*  GPU RAM: N/A GB
*  CUDA Version: 11.0
*  CUDA Build: V11.0.167

The benchmark is running...
The tests might take up to 20 minutes
Please don't interrupt the script

1/19. MobileNet-V2

1.1 - inference | batch=50, size=224x224: 106 ± 33 ms
1.2 - training  | batch=50, size=224x224: 10541 ± 176 ms

2/19. Inception-V3

2.1 - inference | batch=20, size=346x346: 1238 ± 29 ms
2020-06-28 23:35:55.556323: F tensorflow/core/common_runtime/dml/dml_command_recorder.cc:150] Check failed: (((HRESULT)((((HRESULT)0x8007000EL)))) >= 0) == true (0 vs. 1)

EDIT: on WSL2 fails even earlier on Vega:
after
1.1 - - inference | batch=50, size=224x224: 136 ± 22 ms
it crashes and ends WSL2 process..

from tensorflow-directml.

oscarbg commented on July 28, 2024

Thanks @PatriceVignola!
good to know devs are aware and working on it..

from tensorflow-directml.

PatriceVignola commented on July 28, 2024

Hey @oscarbg , we just released tensorflow-directml 1.15.3.dev200911 with many improvements to the memory allocator. You can try it out and tell us how it goes!

Also, since we have now open-sourced our fork, new tensorflow-directml issues should be opened over here.

from tensorflow-directml.

oscarbg commented on July 28, 2024

Hi @PatriceVignola,
thanks for update!
new update works very nice..
memory usage is good now..
now seems only remaining issue is upping the performance vs CUDA target..

on Titan V DirectML I get:

Device Inference Score: 6468
Device Training Score: 5271
Device AI Score: 11739

on CUDA I got:

Device Inference Score: 15245
Device Training Score: 15619
Device AI Score: 30864

so basically a 2x-3x performace loss using DirectML vs CUDA right now..

posting full benchmark on Titan V on 460.15 drivers:

>>> from ai_benchmark import AIBenchmark
>>> results = AIBenchmark().run()

>>   AI-Benchmark-v.0.1.2
>>   Let the AI Games begin..

*  TF Version: 1.15.3
*  Platform: Windows-10-10.0.20180-SP0
*  CPU: N/A
*  CPU RAM: 32 GB
*  GPU/0: N/A
*  GPU RAM: N/A GB
*  CUDA Version: N/A
*  CUDA Build: N/A

The benchmark is running...
The tests might take up to 20 minutes
Please don't interrupt the script

1/19. MobileNet-V2

1.1 - inference | batch=50, size=224x224: 56.2 ± 7.3 ms
1.2 - training  | batch=50, size=224x224: 1268 ± 10 ms

2/19. Inception-V3

2.1 - inference | batch=20, size=346x346: 87.4 ± 5.0 ms
2.2 - training  | batch=20, size=346x346: 447 ± 7 ms

3/19. Inception-V4

3.1 - inference | batch=10, size=346x346: 89.6 ± 4.8 ms
3.2 - training  | batch=10, size=346x346: 412 ± 6 ms

4/19. Inception-ResNet-V2

4.1 - inference | batch=10, size=346x346: 89.4 ± 1.8 ms
4.2 - training  | batch=8, size=346x346: 370 ± 5 ms

5/19. ResNet-V2-50

5.1 - inference | batch=10, size=346x346: 68.5 ± 2.6 ms
5.2 - training  | batch=10, size=346x346: 276 ± 5 ms

6/19. ResNet-V2-152

6.1 - inference | batch=10, size=256x256: 109 ± 4 ms
6.2 - training  | batch=10, size=256x256: 403 ± 8 ms

7/19. VGG-16

7.1 - inference | batch=20, size=224x224: 112 ± 2 ms
7.2 - training  | batch=2, size=224x224: 86.8 ± 1.9 ms

8/19. SRCNN 9-5-5

8.1 - inference | batch=10, size=512x512: 131 ± 3 ms
8.2 - inference | batch=1, size=1536x1536: 117 ± 4 ms
8.3 - training  | batch=10, size=512x512: 719 ± 13 ms

9/19. VGG-19 Super-Res

9.1 - inference | batch=10, size=256x256: 151 ± 3 ms
9.2 - inference | batch=1, size=1024x1024: 242 ± 4 ms
9.3 - training  | batch=10, size=224x224: 843 ± 9 ms

10/19. ResNet-SRGAN

10.1 - inference | batch=10, size=512x512: 176 ± 6 ms
10.2 - inference | batch=1, size=1536x1536: 159 ± 5 ms
10.3 - training  | batch=5, size=512x512: 479 ± 8 ms

11/19. ResNet-DPED

11.1 - inference | batch=10, size=256x256: 203 ± 2 ms
11.2 - inference | batch=1, size=1024x1024: 329 ± 5 ms
11.3 - training  | batch=15, size=128x128: 484 ± 5 ms

12/19. U-Net

12.1 - inference | batch=4, size=512x512: 493 ± 7 ms
12.2 - inference | batch=1, size=1024x1024: 550 ± 16 ms
12.3 - training  | batch=4, size=256x256: 488 ± 12 ms

13/19. Nvidia-SPADE

13.1 - inference | batch=5, size=128x128: 233 ± 6 ms
13.2 - training  | batch=1, size=128x128: 556 ± 6 ms

14/19. ICNet

14.1 - inference | batch=5, size=1024x1536: 349 ± 4 ms
14.2 - training  | batch=10, size=1024x1536: 1506 ± 7 ms

15/19. PSPNet

15.1 - inference | batch=5, size=720x720: 1086 ± 10 ms
15.2 - training  | batch=1, size=512x512: 398 ± 7 ms

16/19. DeepLab

16.1 - inference | batch=2, size=512x512: 672 ± 4 ms
16.2 - training  | batch=1, size=384x384: 474 ± 4 ms

17/19. Pixel-RNN

17.1 - inference | batch=50, size=64x64: 989 ± 7 ms
17.2 - training  | batch=10, size=64x64: 2643 ± 7 ms

18/19. LSTM-Sentiment

18.1 - inference | batch=100, size=1024x300: 681 ± 13 ms
18.2 - training  | batch=10, size=1024x300: 1388 ± 10 ms

19/19. GNMT-Translation

19.1 - inference | batch=1, size=1x20: 335 ± 5 ms

Device Inference Score: 6468
Device Training Score: 5271
Device AI Score: 11739

For more information and results, please visit http://ai-benchmark.com/alpha

from tensorflow-directml.

megha1906 commented on July 28, 2024

How do I run a single model using ai-benchmarks?

from tensorflow-directml.

jstoecker commented on July 28, 2024

How do I run a single model using ai-benchmarks?

I don't think it's possible without modifying the AIBenchmark scripts. You could (after pip-installing the package, for example) modify the loop in run_tests (ai_benchmark/utils.py) to skip the models that you're not interested in.

from tensorflow-directml.

darkar18 commented on July 28, 2024

My benchmarking fails after 8th test...

>>   AI-Benchmark-v.0.1.2
>>   Let the AI Games begin..

*  TF Version: 1.15.5
*  Platform: Windows-10-10.0.22000-SP0
*  CPU: N/A
*  CPU RAM: 7 GB

The benchmark is running...
The tests might take up to 20 minutes
Please don't interrupt the script

1/19. MobileNet-V2

1.1 - inference | batch=50, size=224x224: 132 ± 2 ms
1.2 - training  | batch=50, size=224x224: 693 ± 1 ms

2/19. Inception-V3

2.1 - inference | batch=20, size=346x346: 150 ± 2 ms
2.2 - training  | batch=20, size=346x346: 483 ± 2 ms

3/19. Inception-V4

3.1 - inference | batch=10, size=346x346: 162 ± 2 ms
3.2 - training  | batch=10, size=346x346: 555 ± 11 ms

4/19. Inception-ResNet-V2

4.1 - inference | batch=10, size=346x346: 182 ± 2 ms
4.2 - training  | batch=8, size=346x346: 514 ± 2 ms

5/19. ResNet-V2-50

5.1 - inference | batch=10, size=346x346: 80.4 ± 2.9 ms
5.2 - training  | batch=10, size=346x346: 266 ± 1 ms

6/19. ResNet-V2-152

6.1 - inference | batch=10, size=256x256: 117 ± 2 ms
6.2 - training  | batch=10, size=256x256: 498 ± 3 ms

7/19. VGG-16

7.1 - inference | batch=20, size=224x224: 116 ± 1 ms
7.2 - training  | batch=2, size=224x224: 96.9 ± 1.5 ms

8/19. SRCNN 9-5-5

8.1 - inference | batch=10, size=512x512: 203 ± 4 ms
8.2 - inference | batch=1, size=1536x1536: 183 ± 5 ms
Traceback (most recent call last):
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call
    return fn(*args)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10,64,512,512] and type float on /job:localhost/replica:0/task:0/device:DML:0 by allocator DmlAllocator
         [[{{node gradients/generator/Relu_grad/ReluGrad}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "ai-test.py", line 3, in <module>
    b.run()
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\__init__.py", line 64, in run
    use_CPU=self.use_CPU, precision=precision, _type="full", start_dir=self.cwd)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\utils.py", line 635, in run_tests
    sess.run(train_step, feed_dict={input_: data, target_: target})
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run
    run_metadata_ptr)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run
    run_metadata)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10,64,512,512] and type float on /job:localhost/replica:0/task:0/device:DML:0 by allocator DmlAllocator
         [[node gradients/generator/Relu_grad/ReluGrad (defined at C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py:1762) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Original stack trace for 'gradients/generator/Relu_grad/ReluGrad':
  File "ai-test.py", line 3, in <module>
    b.run()
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\__init__.py", line 64, in run
    use_CPU=self.use_CPU, precision=precision, _type="full", start_dir=self.cwd)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\utils.py", line 615, in run_tests
    subTest.optimizer, subTest.learning_rate, testInfo.tf_ver_2)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\utils.py", line 202, in constructOptimizer
    train_step = optimizer.minimize(loss_)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\training\optimizer.py", line 403, in minimize
    grad_loss=grad_loss)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\training\optimizer.py", line 512, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\ops\gradients_impl.py", line 158, in gradients
    unconnected_gradients)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\ops\gradients_util.py", line 679, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\ops\gradients_util.py", line 350, in _MaybeCompile
    return grad_fn()  # Exit early
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\ops\gradients_util.py", line 679, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\ops\nn_grad.py", line 415, in _ReluGrad
    return gen_nn_ops.relu_grad(grad, op.outputs[0])
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\ops\gen_nn_ops.py", line 11732, in relu_grad
    "ReluGrad", gradients=gradients, features=features, name=name)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3371, in create_op
    attrs, op_def, compute_device)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3440, in _create_op_internal
    op_def=op_def)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1762, in __init__
    self._traceback = tf_stack.extract_stack()

...which was originally created as op 'generator/Relu', defined at:
  File "ai-test.py", line 3, in <module>
    b.run()
[elided 0 identical lines from previous traceback]
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\__init__.py", line 64, in run
    use_CPU=self.use_CPU, precision=precision, _type="full", start_dir=self.cwd)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\utils.py", line 557, in run_tests
    input_, output_, train_vars_ = getModelSrc(test, testInfo, sess)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\utils.py", line 241, in getModelSrc
    tf.train.import_meta_graph(test.model_src, clear_devices=True)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\training\saver.py", line 1453, in import_meta_graph
    **kwargs)[0]
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\training\saver.py", line 1477, in _import_meta_graph_with_return_elements
    **kwargs))
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\meta_graph.py", line 809, in import_scoped_meta_graph_with_return_elements
    return_elements=return_elements)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\importer.py", line 405, in import_graph_def
    producer_op_list=producer_op_list)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\importer.py", line 517, in _import_graph_def_internal
    _ProcessNewOps(graph)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\importer.py", line 243, in _ProcessNewOps
    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3575, in _add_new_tf_operations
    for c_op in c_api_util.new_tf_operations(self)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3575, in <listcomp>
    for c_op in c_api_util.new_tf_operations(self)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3465, in _create_op_from_tf_operation
    ret = Operation(c_op, self)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1762, in __init__
    self._traceback = tf_stack.extract_stack()

what should I do?

from tensorflow-directml.

Issues running AI Benchmark.. about tensorflow-directml HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent