Hello! first thanks for your code, it helps a lot!
But I was encountered some problems when I install the tensorflow-binding. I build the libwarprnnt.so successfully and passes the test cases " test_cpu test_gpu test_time test_time_gpu".
And then "cd tensorflow_binding; sudo -E CUDA=/usr/local/cuda python3 setup.py install", it seems performing well, but when I ran "sudo -E CUDA=/usr/local/cuda python3 setup.py test", it failed.
Here is the output log:
#####################################log start
_setup.py:63: UserWarning: Assuming tensorflow was compiled without C++11 ABI. It is generally true if you are using binary pip package. If you compiled tensorflow from source with gcc >= 5 and didn't set -D_GLIBCXX_USE_CXX11_ABI=0 during compilation, you need to set environment variable TF_CXX11_ABI=1 when compiling this bindings. Also be sure to touch some files in src to trigger recompilation. Also, you need to set (or unsed) this environment variable if getting undefined symbol: _ZN10tensorflow... errors
warnings.warn("Assuming tensorflow was compiled without C++11 ABI. "
running test
running egg_info
writing dependency_links to warprnnt_tensorflow.egg-info/dependency_links.txt
writing warprnnt_tensorflow.egg-info/PKG-INFO
writing top-level names to warprnnt_tensorflow.egg-info/top_level.txt
reading manifest file 'warprnnt_tensorflow.egg-info/SOURCES.txt'
writing manifest file 'warprnnt_tensorflow.egg-info/SOURCES.txt'
running build_ext
copying build/lib.linux-x86_64-3.5/warprnnt_tensorflow/kernels.cpython-35m-x86_64-linux-gnu.so -> warprnnt_tensorflow
/disk2/syd/warp-transducer/tensorflow_binding/setup.py:63: UserWarning: Assuming tensorflow was compiled without C++11 ABI. It is generally true if you are using binary pip package. If you compiled tensorflow from source with gcc >= 5 and didn't set -D_GLIBCXX_USE_CXX11_ABI=0 during compilation, you need to set environment variable TF_CXX11_ABI=1 when compiling this bindings. Also be sure to touch some files in src to trigger recompilation. Also, you need to set (or unsed) this environment variable if getting undefined symbol: _ZN10tensorflow... errors
warnings.warn("Assuming tensorflow was compiled without C++11 ABI. "
running test
running egg_info
writing dependency_links to warprnnt_tensorflow.egg-info/dependency_links.txt
writing warprnnt_tensorflow.egg-info/PKG-INFO
writing top-level names to warprnnt_tensorflow.egg-info/top_level.txt
reading manifest file 'warprnnt_tensorflow.egg-info/SOURCES.txt'
writing manifest file 'warprnnt_tensorflow.egg-info/SOURCES.txt'
running build_ext
copying build/lib.linux-x86_64-3.5/warprnnt_tensorflow/kernels.cpython-35m-x86_64-linux-gnu.so -> warprnnt_tensorflow
2019-05-21 15:44:57.870631: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-05-21 15:45:02.016217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:02:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2019-05-21 15:45:02.607413: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 1 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:03:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2019-05-21 15:45:03.151200: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 2 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:82:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2019-05-21 15:45:03.778234: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 3 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:83:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2019-05-21 15:45:03.781227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3
2019-05-21 15:45:05.104860: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-21 15:45:05.104906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2 3
2019-05-21 15:45:05.104912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y N N
2019-05-21 15:45:05.104915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N N N
2019-05-21 15:45:05.104918: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: N N N Y
2019-05-21 15:45:05.104921: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3: N N Y N
2019-05-21 15:45:05.105817: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11357 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:02:00.0, compute capability: 6.1)
2019-05-21 15:45:05.208888: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11357 MB memory) -> physical GPU (device: 1, name: TITAN Xp, pci bus id: 0000:03:00.0, compute capability: 6.1)
2019-05-21 15:45:05.312473: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 11357 MB memory) -> physical GPU (device: 2, name: TITAN Xp, pci bus id: 0000:82:00.0, compute capability: 6.1)
2019-05-21 15:45:05.415492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 11357 MB memory) -> physical GPU (device: 3, name: TITAN Xp, pci bus id: 0000:83:00.0, compute capability: 6.1)
[4.280653 3.9384367]
[array([[[[-1.86843872e-01, -6.25548586e-02, 2.49398723e-01],
[-2.03376621e-01, 2.02399313e-01, 9.77304531e-04],
[-1.41016081e-01, 7.91234747e-02, 6.18926138e-02]],
[[-1.15518123e-02, -8.12801942e-02, 9.28320065e-02],
[-1.54257059e-01, 2.29432628e-01, -7.51755759e-02],
[-2.46593088e-01, 1.46404624e-01, 1.00188479e-01]],
[[-1.29182916e-02, -6.15932457e-02, 7.45115355e-02],
[-5.59857599e-02, 2.19830781e-01, -1.63845018e-01],
[-4.97627079e-01, 2.09239975e-01, 2.88387090e-01]],
[[ 1.36048663e-02, -3.02196350e-02, 1.66147687e-02],
[ 1.13924518e-01, 6.27811924e-02, -1.76705718e-01],
[-6.67078257e-01, 3.67658824e-01, 2.99419463e-01]]],
[[[-3.56343716e-01, -5.53474464e-02, 4.11691159e-01],
[-9.69219282e-02, 2.94591114e-02, 6.74628168e-02],
[-6.35175407e-02, 2.76544876e-02, 3.58630568e-02]],
[[-1.54498979e-01, -7.39419907e-02, 2.28440970e-01],
[-1.66789874e-01, -8.78970968e-05, 1.66877761e-01],
[-1.72369599e-01, 1.05565295e-01, 6.68042973e-02]],
[[ 2.38749050e-02, -1.18255846e-01, 9.43809450e-02],
[-1.04707167e-01, -1.08934328e-01, 2.13641495e-01],
[-3.69844109e-01, 1.80117995e-01, 1.89726129e-01]],
[[ 2.57137045e-02, -7.94617534e-02, 5.37480488e-02],
[ 1.22328207e-01, -2.38788620e-01, 1.16460413e-01],
[-5.98686934e-01, 3.02203119e-01, 2.96483815e-01]]]],
dtype=float32)]
test_forward (test_warprnnt_op.WarpRNNTTest) ... 2019-05-21 15:45:05.675877: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3
2019-05-21 15:45:05.676081: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-21 15:45:05.676101: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2 3
2019-05-21 15:45:05.676113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y N N
2019-05-21 15:45:05.676123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N N N
2019-05-21 15:45:05.676132: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: N N N Y
2019-05-21 15:45:05.676142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3: N N Y N
2019-05-21 15:45:05.676874: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11357 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:02:00.0, compute capability: 6.1)
2019-05-21 15:45:05.677075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11357 MB memory) -> physical GPU (device: 1, name: TITAN Xp, pci bus id: 0000:03:00.0, compute capability: 6.1)
2019-05-21 15:45:05.677804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 11357 MB memory) -> physical GPU (device: 2, name: TITAN Xp, pci bus id: 0000:82:00.0, compute capability: 6.1)
2019-05-21 15:45:05.678136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 11357 MB memory) -> physical GPU (device: 3, name: TITAN Xp, pci bus id: 0000:83:00.0, compute capability: 6.1)
[4.4956665]
ok
test_multiple_batches_cpu (test_warprnnt_op.WarpRNNTTest) ... /disk2/syd/warp-transducer/tensorflow_binding/tests/test_warprnnt_op.py:14: DeprecationWarning: Please use assertEqual instead.
self.assertEquals(acts.shape, expected_grads.shape)
2019-05-21 15:45:05.746685: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3
2019-05-21 15:45:05.746852: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-21 15:45:05.746871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2 3
2019-05-21 15:45:05.746882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y N N
2019-05-21 15:45:05.746891: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N N N
2019-05-21 15:45:05.746899: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: N N N Y
2019-05-21 15:45:05.746908: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3: N N Y N
2019-05-21 15:45:05.747531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11357 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:02:00.0, compute capability: 6.1)
2019-05-21 15:45:05.747657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11357 MB memory) -> physical GPU (device: 1, name: TITAN Xp, pci bus id: 0000:03:00.0, compute capability: 6.1)
2019-05-21 15:45:05.747763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 11357 MB memory) -> physical GPU (device: 2, name: TITAN Xp, pci bus id: 0000:82:00.0, compute capability: 6.1)
2019-05-21 15:45:05.747888: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 11357 MB memory) -> physical GPU (device: 3, name: TITAN Xp, pci bus id: 0000:83:00.0, compute capability: 6.1)
ok
test_multiple_batches_gpu (test_warprnnt_op.WarpRNNTTest) ... 2019-05-21 15:45:05.769992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3
2019-05-21 15:45:05.770181: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-21 15:45:05.770202: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2 3
2019-05-21 15:45:05.770215: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y N N
2019-05-21 15:45:05.770226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N N N
2019-05-21 15:45:05.770237: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: N N N Y
2019-05-21 15:45:05.770247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3: N N Y N
2019-05-21 15:45:05.770935: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/device:GPU:0 with 11357 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:02:00.0, compute capability: 6.1)
2019-05-21 15:45:05.771051: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/device:GPU:1 with 11357 MB memory) -> physical GPU (device: 1, name: TITAN Xp, pci bus id: 0000:03:00.0, compute capability: 6.1)
2019-05-21 15:45:05.771161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/device:GPU:2 with 11357 MB memory) -> physical GPU (device: 2, name: TITAN Xp, pci bus id: 0000:82:00.0, compute capability: 6.1)
2019-05-21 15:45:05.771274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/device:GPU:3 with 11357 MB memory) -> physical GPU (device: 3, name: TITAN Xp, pci bus id: 0000:83:00.0, compute capability: 6.1)
2019-05-21 15:45:05.789642: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3
2019-05-21 15:45:05.789855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-21 15:45:05.789882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2 3
2019-05-21 15:45:05.789896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y N N
2019-05-21 15:45:05.789908: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N N N
2019-05-21 15:45:05.789919: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: N N N Y
2019-05-21 15:45:05.789931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3: N N Y N
2019-05-21 15:45:05.790727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11357 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:02:00.0, compute capability: 6.1)
2019-05-21 15:45:05.790897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11357 MB memory) -> physical GPU (device: 1, name: TITAN Xp, pci bus id: 0000:03:00.0, compute capability: 6.1)
2019-05-21 15:45:05.791044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 11357 MB memory) -> physical GPU (device: 2, name: TITAN Xp, pci bus id: 0000:82:00.0, compute capability: 6.1)
2019-05-21 15:45:05.791222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 11357 MB memory) -> physical GPU (device: 3, name: TITAN Xp, pci bus id: 0000:83:00.0, compute capability: 6.1)
ok
test_session (test_warprnnt_op.WarpRNNTTest)
Returns a TensorFlow Session for use in executing tests. ... ok
Ran 4 tests in 0.159s
OK
test_basic (unittest.loader._FailedTest) ... ERROR
test_warprnnt_op (unittest.loader._FailedTest) ... ERROR
======================================================================
ERROR: test_basic (unittest.loader._FailedTest)
ImportError: Failed to import test module: test_basic
Traceback (most recent call last):
File "/usr/lib/python3.5/unittest/loader.py", line 428, in _find_test_path
module = self._get_module_from_name(name)
File "/usr/lib/python3.5/unittest/loader.py", line 369, in _get_module_from_name
import(name)
File "/disk2/syd/warp-transducer/tensorflow_binding/tests/test_basic.py", line 3, in
from warprnnt_tensorflow import rnnt_loss
File "/disk2/syd/warp-transducer/tensorflow_binding/warprnnt_tensorflow/init.py", line 37, in
@ops.RegisterGradient("WarpRNNT")
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2285, in call
_gradient_registry.register(f, self._op_type)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/registry.py", line 62, in register
(self._name, name, function_name, filename, line_number))
KeyError: "Registering two gradient with name 'WarpRNNT' !(Previous registration was in setup /usr/lib/python3.5/distutils/core.py:148)"
======================================================================
ERROR: test_warprnnt_op (unittest.loader._FailedTest)
ImportError: Failed to import test module: test_warprnnt_op
Traceback (most recent call last):
File "/usr/lib/python3.5/unittest/loader.py", line 428, in _find_test_path
module = self._get_module_from_name(name)
File "/usr/lib/python3.5/unittest/loader.py", line 369, in _get_module_from_name
import(name)
File "/disk2/syd/warp-transducer/tensorflow_binding/tests/test_warprnnt_op.py", line 3, in
from warprnnt_tensorflow import rnnt_loss
File "/disk2/syd/warp-transducer/tensorflow_binding/warprnnt_tensorflow/init.py", line 37, in
@ops.RegisterGradient("WarpRNNT")
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2285, in call
_gradient_registry.register(f, self._op_type)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/registry.py", line 62, in register
(self._name, name, function_name, filename, line_number))
KeyError: "Registering two gradient with name 'WarpRNNT' !(Previous registration was in setup /usr/lib/python3.5/distutils/core.py:148)"
Ran 2 tests in 0.000s
FAILED (errors=2)
Test failed: <unittest.runner.TextTestResult run=2 errors=2 failures=0>
error: Test failed: <unittest.runner.TextTestResult run=2 errors=2 failures=0>_
####################################log end
BTW,when I ran "python3 tests/test_basic.py" and "python3 tests/test_warprnnt_op.py" it passed. I'm confused about the differences and it seems "python setup.py test" runs the test cases twice and it repo the issues the second time.
Do you have any idea about what happened here ?
When i use this interface(warprnnt_tensorflow.rnnt_loss) in my own tensorflow train program(timit as the train dataset), I encountered "CUDA_ERROR_ILLEGAL_ADDRESS" error randomly, sometimes it occurs after 1 step, sometimes several steps. The log details is:
###start###
2019-05-21 15:05:25.985187: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:649] failed to record completion event; therefore, failed to create inter-stream dependency
2019-05-21 15:05:25.985270: I tensorflow/stream_executor/stream.cc:4793] stream 0x7fdeba0adbb0 did not memcpy host-to-device; source: 0x104e2486900
2019-05-21 15:05:25.985289: E tensorflow/stream_executor/stream.cc:318] Error recording event in stream: error recording CUDA event on stream 0x7fdeba0adc80: CUDA_ERROR_ILLEGAL_ADDRESS; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2019-05-21 15:05:25.985312: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS
2019-05-21 15:05:25.985329: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:206] Unexpected Event status: 1
Aborted
###end###
thanks for your time. I'm really appreciate if you could reply.
tf.version: 1.10.1
cuda: 9.0
GPU: TITAN Xp 12G
platfom: Linux Ubuntu 16.04.3 LTS
corpus: timit(123dim fbank to 61phns)