Giter VIP home page Giter VIP logo

Comments (8)

PatriceVignola avatar PatriceVignola commented on July 28, 2024 2

Hi @oscarbg , this is something that we are actively looking into. As you noticed, tensorflow-directml's memory usage is very high at the moment, which is a problem when training with many batches. We will update this issue once we release a package that addresses these crashes.

from tensorflow-directml.

oscarbg avatar oscarbg commented on July 28, 2024 1

fails similarly on Vega:

>>   AI-Benchmark-v.0.1.2
>>   Let the AI Games begin..

*  TF Version: 1.15.3
*  Platform: Windows-10-10.0.19564-SP0
*  CPU: N/A
*  CPU RAM: 32 GB
*  GPU/0: N/A
*  GPU RAM: N/A GB
*  CUDA Version: 11.0
*  CUDA Build: V11.0.167

The benchmark is running...
The tests might take up to 20 minutes
Please don't interrupt the script

1/19. MobileNet-V2

1.1 - inference | batch=50, size=224x224: 106 ± 33 ms
1.2 - training  | batch=50, size=224x224: 10541 ± 176 ms

2/19. Inception-V3

2.1 - inference | batch=20, size=346x346: 1238 ± 29 ms
2020-06-28 23:35:55.556323: F tensorflow/core/common_runtime/dml/dml_command_recorder.cc:150] Check failed: (((HRESULT)((((HRESULT)0x8007000EL)))) >= 0) == true (0 vs. 1)

EDIT: on WSL2 fails even earlier on Vega:
after
1.1 - - inference | batch=50, size=224x224: 136 ± 22 ms
it crashes and ends WSL2 process..

from tensorflow-directml.

oscarbg avatar oscarbg commented on July 28, 2024

Thanks @PatriceVignola!
good to know devs are aware and working on it..

from tensorflow-directml.

PatriceVignola avatar PatriceVignola commented on July 28, 2024

Hey @oscarbg , we just released tensorflow-directml 1.15.3.dev200911 with many improvements to the memory allocator. You can try it out and tell us how it goes!

Also, since we have now open-sourced our fork, new tensorflow-directml issues should be opened over here.

from tensorflow-directml.

oscarbg avatar oscarbg commented on July 28, 2024

Hi @PatriceVignola,
thanks for update!
new update works very nice..
memory usage is good now..
now seems only remaining issue is upping the performance vs CUDA target..

on Titan V DirectML I get:

Device Inference Score: 6468
Device Training Score: 5271
Device AI Score: 11739

on CUDA I got:

Device Inference Score: 15245
Device Training Score: 15619
Device AI Score: 30864

so basically a 2x-3x performace loss using DirectML vs CUDA right now..

posting full benchmark on Titan V on 460.15 drivers:

>>> from ai_benchmark import AIBenchmark
>>> results = AIBenchmark().run()

>>   AI-Benchmark-v.0.1.2
>>   Let the AI Games begin..

*  TF Version: 1.15.3
*  Platform: Windows-10-10.0.20180-SP0
*  CPU: N/A
*  CPU RAM: 32 GB
*  GPU/0: N/A
*  GPU RAM: N/A GB
*  CUDA Version: N/A
*  CUDA Build: N/A

The benchmark is running...
The tests might take up to 20 minutes
Please don't interrupt the script

1/19. MobileNet-V2

1.1 - inference | batch=50, size=224x224: 56.2 ± 7.3 ms
1.2 - training  | batch=50, size=224x224: 1268 ± 10 ms

2/19. Inception-V3

2.1 - inference | batch=20, size=346x346: 87.4 ± 5.0 ms
2.2 - training  | batch=20, size=346x346: 447 ± 7 ms

3/19. Inception-V4

3.1 - inference | batch=10, size=346x346: 89.6 ± 4.8 ms
3.2 - training  | batch=10, size=346x346: 412 ± 6 ms

4/19. Inception-ResNet-V2

4.1 - inference | batch=10, size=346x346: 89.4 ± 1.8 ms
4.2 - training  | batch=8, size=346x346: 370 ± 5 ms

5/19. ResNet-V2-50

5.1 - inference | batch=10, size=346x346: 68.5 ± 2.6 ms
5.2 - training  | batch=10, size=346x346: 276 ± 5 ms

6/19. ResNet-V2-152

6.1 - inference | batch=10, size=256x256: 109 ± 4 ms
6.2 - training  | batch=10, size=256x256: 403 ± 8 ms

7/19. VGG-16

7.1 - inference | batch=20, size=224x224: 112 ± 2 ms
7.2 - training  | batch=2, size=224x224: 86.8 ± 1.9 ms

8/19. SRCNN 9-5-5

8.1 - inference | batch=10, size=512x512: 131 ± 3 ms
8.2 - inference | batch=1, size=1536x1536: 117 ± 4 ms
8.3 - training  | batch=10, size=512x512: 719 ± 13 ms

9/19. VGG-19 Super-Res

9.1 - inference | batch=10, size=256x256: 151 ± 3 ms
9.2 - inference | batch=1, size=1024x1024: 242 ± 4 ms
9.3 - training  | batch=10, size=224x224: 843 ± 9 ms

10/19. ResNet-SRGAN

10.1 - inference | batch=10, size=512x512: 176 ± 6 ms
10.2 - inference | batch=1, size=1536x1536: 159 ± 5 ms
10.3 - training  | batch=5, size=512x512: 479 ± 8 ms

11/19. ResNet-DPED

11.1 - inference | batch=10, size=256x256: 203 ± 2 ms
11.2 - inference | batch=1, size=1024x1024: 329 ± 5 ms
11.3 - training  | batch=15, size=128x128: 484 ± 5 ms

12/19. U-Net

12.1 - inference | batch=4, size=512x512: 493 ± 7 ms
12.2 - inference | batch=1, size=1024x1024: 550 ± 16 ms
12.3 - training  | batch=4, size=256x256: 488 ± 12 ms

13/19. Nvidia-SPADE

13.1 - inference | batch=5, size=128x128: 233 ± 6 ms
13.2 - training  | batch=1, size=128x128: 556 ± 6 ms

14/19. ICNet

14.1 - inference | batch=5, size=1024x1536: 349 ± 4 ms
14.2 - training  | batch=10, size=1024x1536: 1506 ± 7 ms

15/19. PSPNet

15.1 - inference | batch=5, size=720x720: 1086 ± 10 ms
15.2 - training  | batch=1, size=512x512: 398 ± 7 ms

16/19. DeepLab

16.1 - inference | batch=2, size=512x512: 672 ± 4 ms
16.2 - training  | batch=1, size=384x384: 474 ± 4 ms

17/19. Pixel-RNN

17.1 - inference | batch=50, size=64x64: 989 ± 7 ms
17.2 - training  | batch=10, size=64x64: 2643 ± 7 ms

18/19. LSTM-Sentiment

18.1 - inference | batch=100, size=1024x300: 681 ± 13 ms
18.2 - training  | batch=10, size=1024x300: 1388 ± 10 ms

19/19. GNMT-Translation

19.1 - inference | batch=1, size=1x20: 335 ± 5 ms

Device Inference Score: 6468
Device Training Score: 5271
Device AI Score: 11739

For more information and results, please visit http://ai-benchmark.com/alpha

from tensorflow-directml.

megha1906 avatar megha1906 commented on July 28, 2024

How do I run a single model using ai-benchmarks?

from tensorflow-directml.

jstoecker avatar jstoecker commented on July 28, 2024

How do I run a single model using ai-benchmarks?

I don't think it's possible without modifying the AIBenchmark scripts. You could (after pip-installing the package, for example) modify the loop in run_tests (ai_benchmark/utils.py) to skip the models that you're not interested in.

from tensorflow-directml.

darkar18 avatar darkar18 commented on July 28, 2024

My benchmarking fails after 8th test...

>>   AI-Benchmark-v.0.1.2
>>   Let the AI Games begin..

*  TF Version: 1.15.5
*  Platform: Windows-10-10.0.22000-SP0
*  CPU: N/A
*  CPU RAM: 7 GB

The benchmark is running...
The tests might take up to 20 minutes
Please don't interrupt the script

1/19. MobileNet-V2

1.1 - inference | batch=50, size=224x224: 132 ± 2 ms
1.2 - training  | batch=50, size=224x224: 693 ± 1 ms

2/19. Inception-V3

2.1 - inference | batch=20, size=346x346: 150 ± 2 ms
2.2 - training  | batch=20, size=346x346: 483 ± 2 ms

3/19. Inception-V4

3.1 - inference | batch=10, size=346x346: 162 ± 2 ms
3.2 - training  | batch=10, size=346x346: 555 ± 11 ms

4/19. Inception-ResNet-V2

4.1 - inference | batch=10, size=346x346: 182 ± 2 ms
4.2 - training  | batch=8, size=346x346: 514 ± 2 ms

5/19. ResNet-V2-50

5.1 - inference | batch=10, size=346x346: 80.4 ± 2.9 ms
5.2 - training  | batch=10, size=346x346: 266 ± 1 ms

6/19. ResNet-V2-152

6.1 - inference | batch=10, size=256x256: 117 ± 2 ms
6.2 - training  | batch=10, size=256x256: 498 ± 3 ms

7/19. VGG-16

7.1 - inference | batch=20, size=224x224: 116 ± 1 ms
7.2 - training  | batch=2, size=224x224: 96.9 ± 1.5 ms

8/19. SRCNN 9-5-5

8.1 - inference | batch=10, size=512x512: 203 ± 4 ms
8.2 - inference | batch=1, size=1536x1536: 183 ± 5 ms
Traceback (most recent call last):
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call
    return fn(*args)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10,64,512,512] and type float on /job:localhost/replica:0/task:0/device:DML:0 by allocator DmlAllocator
         [[{{node gradients/generator/Relu_grad/ReluGrad}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "ai-test.py", line 3, in <module>
    b.run()
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\__init__.py", line 64, in run
    use_CPU=self.use_CPU, precision=precision, _type="full", start_dir=self.cwd)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\utils.py", line 635, in run_tests
    sess.run(train_step, feed_dict={input_: data, target_: target})
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run
    run_metadata_ptr)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run
    run_metadata)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10,64,512,512] and type float on /job:localhost/replica:0/task:0/device:DML:0 by allocator DmlAllocator
         [[node gradients/generator/Relu_grad/ReluGrad (defined at C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py:1762) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Original stack trace for 'gradients/generator/Relu_grad/ReluGrad':
  File "ai-test.py", line 3, in <module>
    b.run()
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\__init__.py", line 64, in run
    use_CPU=self.use_CPU, precision=precision, _type="full", start_dir=self.cwd)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\utils.py", line 615, in run_tests
    subTest.optimizer, subTest.learning_rate, testInfo.tf_ver_2)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\utils.py", line 202, in constructOptimizer
    train_step = optimizer.minimize(loss_)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\training\optimizer.py", line 403, in minimize
    grad_loss=grad_loss)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\training\optimizer.py", line 512, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\ops\gradients_impl.py", line 158, in gradients
    unconnected_gradients)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\ops\gradients_util.py", line 679, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\ops\gradients_util.py", line 350, in _MaybeCompile
    return grad_fn()  # Exit early
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\ops\gradients_util.py", line 679, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\ops\nn_grad.py", line 415, in _ReluGrad
    return gen_nn_ops.relu_grad(grad, op.outputs[0])
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\ops\gen_nn_ops.py", line 11732, in relu_grad
    "ReluGrad", gradients=gradients, features=features, name=name)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3371, in create_op
    attrs, op_def, compute_device)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3440, in _create_op_internal
    op_def=op_def)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1762, in __init__
    self._traceback = tf_stack.extract_stack()

...which was originally created as op 'generator/Relu', defined at:
  File "ai-test.py", line 3, in <module>
    b.run()
[elided 0 identical lines from previous traceback]
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\__init__.py", line 64, in run
    use_CPU=self.use_CPU, precision=precision, _type="full", start_dir=self.cwd)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\utils.py", line 557, in run_tests
    input_, output_, train_vars_ = getModelSrc(test, testInfo, sess)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\utils.py", line 241, in getModelSrc
    tf.train.import_meta_graph(test.model_src, clear_devices=True)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\training\saver.py", line 1453, in import_meta_graph
    **kwargs)[0]
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\training\saver.py", line 1477, in _import_meta_graph_with_return_elements
    **kwargs))
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\meta_graph.py", line 809, in import_scoped_meta_graph_with_return_elements
    return_elements=return_elements)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\importer.py", line 405, in import_graph_def
    producer_op_list=producer_op_list)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\importer.py", line 517, in _import_graph_def_internal
    _ProcessNewOps(graph)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\importer.py", line 243, in _ProcessNewOps
    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3575, in _add_new_tf_operations
    for c_op in c_api_util.new_tf_operations(self)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3575, in <listcomp>
    for c_op in c_api_util.new_tf_operations(self)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3465, in _create_op_from_tf_operation
    ret = Operation(c_op, self)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1762, in __init__
    self._traceback = tf_stack.extract_stack()

what should I do?

from tensorflow-directml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.