Giter VIP home page Giter VIP logo

Comments (17)

DavidDoukhan avatar DavidDoukhan commented on June 23, 2024

Dear @miguel-negrao , thanks for this report.
In order to help us fixing this issue, could you provide us the output of the command nvidia-smi , launched outside of the docker container ?

from inaspeechsegmenter.

miguel-negrao avatar miguel-negrao commented on June 23, 2024
Mon Feb 15 19:47:38 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 2060    Off  | 00000000:01:00.0  On |                  N/A |
| N/A   52C    P8     8W /  N/A |    496MiB /  5926MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               

Note that I don't have CUDA Version: 11.2 installed. nvidia-smi does not report the right version, I read in multiple places.

apt list --installed | grep cuda

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

cuda-cudart-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-cufft-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-curand-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-cusolver-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-cusparse-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-libraries-10-1/unknown,now 10.1.243-1 amd64 [installed]
cuda-license-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-npp-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-nvgraph-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-nvjpeg-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-nvrtc-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-repo-ubuntu1804/unknown,now 10.0.130-1 amd64 [installed,upgradable to: 10.2.89-1]
libcudnn7/now 7.6.5.32-1+cuda10.1 amd64 [installed,local]

from inaspeechsegmenter.

DavidDoukhan avatar DavidDoukhan commented on June 23, 2024

It's quite a strange behavior.
I believe the problems you're facing are related to a mismatch between your nvidia drivers and your cuda version.

I guess the docker image uses its own cuda components, but that they need to be compatible with your drivers.
It will be difficult for me to change the drivers of my machines to do the tests for you, however, I could suggest that you change the tensorflow image used within the dockerfile for a more recent one.

In other word, I would suggest to modify the first line of the docker file:
FROM tensorflow/tensorflow:2.3.0-gpu-jupyter

into
FROM tensorflow/tensorflow:2.4.0-gpu-jupyter
or
FROM tensorflow/tensorflow:2.4.1-gpu-jupyter

Could you try this trick and let me know if it fixes your problem ?

from inaspeechsegmenter.

DavidDoukhan avatar DavidDoukhan commented on June 23, 2024

You'll find bellow the driver information machines on which the docker image works fine.
It seems these drivers are older than yours.
I guess your driver configuration may require a more recent version of tensorflow than the version provided within the default docker image.

NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1
NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |

from inaspeechsegmenter.

miguel-negrao avatar miguel-negrao commented on June 23, 2024

Will try downgrading driver. I thought that drivers were backwards compatible with older version of the cuda runtime...

from inaspeechsegmenter.

miguel-negrao avatar miguel-negrao commented on June 23, 2024

Downgraded to 418.56

Mon Feb 15 20:50:31 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2060    Off  | 00000000:01:00.0  On |                  N/A |
| N/A   52C    P8     1W /  N/A |    316MiB /  5896MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               

I get the same error:

coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 5.76GiB deviceMemoryBandwidth: 312.97GiB/s
2021-02-15 20:49:20.360036: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-02-15 20:49:20.360068: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-02-15 20:49:20.360085: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-02-15 20:49:20.360101: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-02-15 20:49:20.360117: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-02-15 20:49:20.360133: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-02-15 20:49:20.360149: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-02-15 20:49:20.360207: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-15 20:49:20.360863: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-15 20:49:20.361474: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-02-15 20:49:20.361495: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-02-15 20:49:21.005766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-02-15 20:49:21.005795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 
2021-02-15 20:49:21.005804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N 
2021-02-15 20:49:21.006184: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-15 20:49:21.006992: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-15 20:49:21.007698: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5045 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)
batch_processing 1 files
1/1 [('./dn-1-44.1-10.csv', 0, 'ok')]
2021-02-15 20:49:23.483746: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 43416912 exceeds 10% of free system memory.
2021-02-15 20:49:23.728978: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-02-15 20:49:24.033760: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-02-15 20:49:25.223208: E tensorflow/stream_executor/cuda/cuda_dnn.cc:328] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2021-02-15 20:49:25.232554: E tensorflow/stream_executor/cuda/cuda_dnn.cc:328] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "/usr/local/bin/ina_speech_segmenter.py", line 77, in <module>
    seg.batch_process(input_files, output_files, verbose=True)
  File "/usr/local/lib/python3.6/dist-packages/inaSpeechSegmenter/segmenter.py", line 288, in batch_process
    lseg = self.segment_feats(mspec, loge, difflen, 0)
  File "/usr/local/lib/python3.6/dist-packages/inaSpeechSegmenter/segmenter.py", line 239, in segment_feats
    lseg = self.vad(mspec, lseg, difflen)
  File "/usr/local/lib/python3.6/dist-packages/inaSpeechSegmenter/segmenter.py", line 138, in __call__
    rawpred = self.nn.predict(batch, batch_size=self.batch_size)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 130, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 1599, in predict
    tmp_batch_outputs = predict_function(iterator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 846, in _call
    return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 550, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node sequential_3/conv2d_12/Conv2D (defined at /lib/python3.6/dist-packages/inaSpeechSegmenter/segmenter.py:138) ]] [Op:__inference_predict_function_2269]

Function call stack:
predict_function

from inaspeechsegmenter.

DavidDoukhan avatar DavidDoukhan commented on June 23, 2024

Did you manage to use any other program using tensorflow 2.3.0 and cudnn ?

from inaspeechsegmenter.

miguel-negrao avatar miguel-negrao commented on June 23, 2024

I did not attempt that.

from inaspeechsegmenter.

miguel-negrao avatar miguel-negrao commented on June 23, 2024

Sorry, actually I did. I can run Mozilla's Deepspeech which is built with tensorflow v2.3.0-6-g23ad988, but it only runs with a driver version higher than 418, for instance with 460 it runs fine.

2021-02-15 21:29:12.507114: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
TensorFlow: v2.3.0-6-g23ad988
DeepSpeech: v0.9.3-0-gf2e9c85
2021-02-15 21:29:12.512584: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-02-15 21:29:12.514048: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-02-15 21:29:12.568676: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-15 21:29:12.568996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 5.79GiB deviceMemoryBandwidth: 312.97GiB/s
2021-02-15 21:29:12.569012: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-02-15 21:29:12.574885: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-02-15 21:29:12.577498: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-02-15 21:29:12.579103: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-02-15 21:29:12.581661: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-02-15 21:29:12.582685: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-02-15 21:29:12.585920: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-02-15 21:29:12.585996: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-15 21:29:12.586391: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-15 21:29:12.587223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-02-15 21:29:12.961778: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-02-15 21:29:12.961797: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 
2021-02-15 21:29:12.961804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N 
2021-02-15 21:29:12.961903: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-15 21:29:12.962204: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-15 21:29:12.962478: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-15 21:29:12.962813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5138 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)
2021-02-15 21:29:13.044230: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10

With a driver 418.x I get attempting to perform BLAS operation using StreamExecutor without BLAS support driver

from inaspeechsegmenter.

miguel-negrao avatar miguel-negrao commented on June 23, 2024

I've compiled a simple cudnn example from here.

export LIBRARY_PATH=/usr/local/cuda-10.2/lib64:${LIBRARY_PATH}
make

It runs without problem:

./RNN 20 2 512 64 2
Forward: 2479 GFLOPS
Backward: 2679 GFLOPS, (2261 GFLOPS), (3287 GFLOPS)
i checksum 5.749536E+05     c checksum 4.365091E+05     h checksum 5.774818E+04
di checksum 3.842186E+02    dc checksum 9.323786E+03    dh checksum 1.182566E+01
dw checksum 4.313455E+08

ldd RNN
        linux-vdso.so.1 (0x00007ffcc63a7000)
        libcublas.so.10 => /usr/local/cuda-10.2/lib64/libcublas.so.10 (0x00007f9ff4f23000)
        libcudnn.so.7 => /usr/lib/x86_64-linux-gnu/libcudnn.so.7 (0x00007f9fdb36a000)
        libcudart.so.10.1 => /usr/local/cuda-10.1/lib64/libcudart.so.10.1 (0x00007f9fdb0ee000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f9fdaf6a000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f9fdaf60000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f9fdaf3f000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f9fdaf38000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f9fdadb5000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f9fdad9b000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9fdabda000)
        libcublasLt.so.10 => /usr/local/cuda-10.2/lib64/libcublasLt.so.10 (0x00007f9fd8d45000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f9ff91e8000)

If I add some code for gpu info

  int nDevices;

  cudaGetDeviceCount(&nDevices);
  for (int i = 0; i < nDevices; i++) {
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, i);
    printf("Device Number: %d\n", i);
    printf("  Device name: %s\n", prop.name);
    printf("  Memory Clock Rate (KHz): %d\n",
           prop.memoryClockRate);
    printf("  Memory Bus Width (bits): %d\n",
           prop.memoryBusWidth);
    printf("  Peak Memory Bandwidth (GB/s): %f\n\n",
           2.0*prop.memoryClockRate*(prop.memoryBusWidth/8)/1.0e6);
  }

Then I get

./RNN 20 2 512 64 3
Device Number: 0
  Device name: GeForce RTX 2060
  Memory Clock Rate (KHz): 7001000
  Memory Bus Width (bits): 192
  Peak Memory Bandwidth (GB/s): 336.048000

Forward: 2238 GFLOPS
Backward: 2635 GFLOPS, (2255 GFLOPS), (3169 GFLOPS)
i checksum 6.358978E+05     h checksum 6.281680E+04
di checksum 6.296609E+00    dh checksum 2.289960E+05
dw checksum 5.397424E+07

This was done using
nvidia driver 460.4
CUDA: 10.1.243
cuDNN: 7.6.5.32

Info on Cuda installation:

$ apt list --installed | grep cuda

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

cuda-command-line-tools-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-compiler-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-cudart-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-cudart-dev-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-cufft-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-cufft-dev-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-cuobjdump-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-cupti-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-curand-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-curand-dev-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-cusolver-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-cusolver-dev-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-cusparse-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-cusparse-dev-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-documentation-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-driver-dev-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-gdb-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-gpu-library-advisor-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-libraries-10-1/unknown,now 10.1.243-1 amd64 [installed]
cuda-libraries-dev-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-license-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-memcheck-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-misc-headers-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-npp-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-npp-dev-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-nsight-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-nsight-compute-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-nsight-systems-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-nvcc-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-nvdisasm-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-nvgraph-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-nvgraph-dev-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-nvjpeg-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-nvjpeg-dev-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-nvml-dev-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-nvprof-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-nvprune-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-nvrtc-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-nvrtc-dev-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-nvtx-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-nvvp-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-repo-ubuntu1804/unknown,now 10.0.130-1 amd64 [installed,upgradable to: 10.2.89-1]
cuda-samples-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-sanitizer-api-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-toolkit-10-1/unknown,now 10.1.243-1 amd64 [installed]
cuda-tools-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
cuda-visual-tools-10-1/unknown,now 10.1.243-1 amd64 [installed,automatic]
libcudnn7-dev/now 7.6.5.32-1+cuda10.1 amd64 [installed,local]
libcudnn7/now 7.6.5.32-1+cuda10.1 amd64 [installed,local]

nvidia-smi:
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |

from inaspeechsegmenter.

DavidDoukhan avatar DavidDoukhan commented on June 23, 2024

Right now, I do not know the reason of this issue...
I'll may do some tests tomorrow on a GPU machine using different driver configurations.

Here are my suggestions:

If you need to process a small amount of files, you may use only the cpu and avoid these cudnn issues. In order to do this, you may remove the "--gpus all" option from docker's command line. This would be the faster trick to test if you need results on small amounts of data.

I you have some more time, I may suggest to change the Dockerfile in order to use more recent versions of tensorflow & cudnn:
In other word, changing the file

FROM tensorflow/tensorflow:2.3.0-gpu-jupyter

into
FROM tensorflow/tensorflow:2.4.0-gpu-jupyter
or
FROM tensorflow/tensorflow:2.4.1-gpu-jupyter

Let me know if any of this option works for you,

Kind regards and sorry for the inconvenience.

from inaspeechsegmenter.

miguel-negrao avatar miguel-negrao commented on June 23, 2024

I've also tested a simple keras example from here

It runs correctly.

The environment was setup as follows:

 1999  virtualenv -p python3 kerasEnv
 2003  source kerasEnv/bin/activate
 2004  pip install tensorflow-gpu==2.3.0 keras

The output:

python3 keras-io/examples/keras_recipes/antirectifier.py 
2021-02-16 10:18:44.746433: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 1s 0us/step
60000 train samples
10000 test samples
2021-02-16 10:18:47.734782: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-02-16 10:18:47.770107: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-16 10:18:47.770420: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 5.79GiB deviceMemoryBandwidth: 312.97GiB/s
2021-02-16 10:18:47.770440: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-02-16 10:18:47.771565: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-02-16 10:18:47.779928: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-02-16 10:18:47.791318: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-02-16 10:18:47.804959: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-02-16 10:18:47.812665: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-02-16 10:18:47.817837: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-02-16 10:18:47.817941: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-16 10:18:47.818321: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-16 10:18:47.818606: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-02-16 10:18:47.818895: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-02-16 10:18:47.824684: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2208000000 Hz
2021-02-16 10:18:47.825492: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4ad1b10 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-02-16 10:18:47.825510: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-02-16 10:18:47.907968: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-16 10:18:47.908435: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4b3d530 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-02-16 10:18:47.908451: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2060, Compute Capability 7.5
2021-02-16 10:18:47.908582: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-16 10:18:47.908886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 5.79GiB deviceMemoryBandwidth: 312.97GiB/s
2021-02-16 10:18:47.908906: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-02-16 10:18:47.908921: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-02-16 10:18:47.908930: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-02-16 10:18:47.908940: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-02-16 10:18:47.908948: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-02-16 10:18:47.908957: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-02-16 10:18:47.908966: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-02-16 10:18:47.909002: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-16 10:18:47.909288: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-16 10:18:47.909545: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-02-16 10:18:47.909565: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-02-16 10:18:48.253697: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-02-16 10:18:48.253727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 
2021-02-16 10:18:48.253733: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N 
2021-02-16 10:18:48.253877: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-16 10:18:48.254196: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-16 10:18:48.254472: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4769 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)
2021-02-16 10:18:48.377276: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 159936000 exceeds 10% of free system memory.
Epoch 1/20
2021-02-16 10:18:48.833628: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
377/399 [===========================>..] - ETA: 0s - loss: 0.3892 - sparse_categorical_accuracy: 0.88582021-02-16 10:18:49.757790: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 28224000 exceeds 10% of free system memory.
399/399 [==============================] - 1s 2ms/step - loss: 0.3799 - sparse_categorical_accuracy: 0.8887 - val_loss: 0.1907 - val_sparse_categorical_accuracy: 0.9436
Epoch 2/20
399/399 [==============================] - 1s 2ms/step - loss: 0.1768 - sparse_categorical_accuracy: 0.9511 - val_loss: 0.3933 - val_sparse_categorical_accuracy: 0.8974
Epoch 3/20
399/399 [==============================] - 1s 2ms/step - loss: 0.1414 - sparse_categorical_accuracy: 0.9617 - val_loss: 0.1320 - val_sparse_categorical_accuracy: 0.9689
Epoch 4/20
399/399 [==============================] - 1s 2ms/step - loss: 0.1144 - sparse_categorical_accuracy: 0.9695 - val_loss: 0.1651 - val_sparse_categorical_accuracy: 0.9607
Epoch 5/20
399/399 [==============================] - 1s 2ms/step - loss: 0.1012 - sparse_categorical_accuracy: 0.9734 - val_loss: 0.1102 - val_sparse_categorical_accuracy: 0.9741
Epoch 6/20
399/399 [==============================] - 1s 2ms/step - loss: 0.0890 - sparse_categorical_accuracy: 0.9765 - val_loss: 0.1160 - val_sparse_categorical_accuracy: 0.9751
Epoch 7/20
399/399 [==============================] - 1s 2ms/step - loss: 0.0830 - sparse_categorical_accuracy: 0.9787 - val_loss: 0.1200 - val_sparse_categorical_accuracy: 0.9767
Epoch 8/20
399/399 [==============================] - 1s 2ms/step - loss: 0.0739 - sparse_categorical_accuracy: 0.9810 - val_loss: 0.1387 - val_sparse_categorical_accuracy: 0.9719
Epoch 9/20
399/399 [==============================] - 1s 2ms/step - loss: 0.0699 - sparse_categorical_accuracy: 0.9824 - val_loss: 0.1475 - val_sparse_categorical_accuracy: 0.9746
Epoch 10/20
399/399 [==============================] - 1s 2ms/step - loss: 0.0701 - sparse_categorical_accuracy: 0.9835 - val_loss: 0.1089 - val_sparse_categorical_accuracy: 0.9803
Epoch 11/20
399/399 [==============================] - 1s 2ms/step - loss: 0.0616 - sparse_categorical_accuracy: 0.9849 - val_loss: 0.1276 - val_sparse_categorical_accuracy: 0.9747
Epoch 12/20
399/399 [==============================] - 1s 2ms/step - loss: 0.0584 - sparse_categorical_accuracy: 0.9861 - val_loss: 0.1671 - val_sparse_categorical_accuracy: 0.9720
Epoch 13/20
399/399 [==============================] - 1s 2ms/step - loss: 0.0562 - sparse_categorical_accuracy: 0.9874 - val_loss: 0.1305 - val_sparse_categorical_accuracy: 0.9776
Epoch 14/20
399/399 [==============================] - 1s 2ms/step - loss: 0.0567 - sparse_categorical_accuracy: 0.9873 - val_loss: 0.2097 - val_sparse_categorical_accuracy: 0.9718
Epoch 15/20
399/399 [==============================] - 1s 2ms/step - loss: 0.0529 - sparse_categorical_accuracy: 0.9884 - val_loss: 0.1503 - val_sparse_categorical_accuracy: 0.9780
Epoch 16/20
399/399 [==============================] - 1s 2ms/step - loss: 0.0550 - sparse_categorical_accuracy: 0.9885 - val_loss: 0.1728 - val_sparse_categorical_accuracy: 0.9738
Epoch 17/20
399/399 [==============================] - 1s 2ms/step - loss: 0.0514 - sparse_categorical_accuracy: 0.9895 - val_loss: 0.1413 - val_sparse_categorical_accuracy: 0.9796
Epoch 18/20
399/399 [==============================] - 1s 2ms/step - loss: 0.0543 - sparse_categorical_accuracy: 0.9896 - val_loss: 0.1742 - val_sparse_categorical_accuracy: 0.9791
Epoch 19/20
399/399 [==============================] - 1s 2ms/step - loss: 0.0522 - sparse_categorical_accuracy: 0.9900 - val_loss: 0.1967 - val_sparse_categorical_accuracy: 0.9769
Epoch 20/20
399/399 [==============================] - 1s 2ms/step - loss: 0.0562 - sparse_categorical_accuracy: 0.9892 - val_loss: 0.2591 - val_sparse_categorical_accuracy: 0.9687
2021-02-16 10:19:02.702304: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 31360000 exceeds 10% of free system memory.
313/313 [==============================] - 0s 742us/step - loss: 0.2965 - sparse_categorical_accuracy: 0.9659
Tue Feb 16 10:19:00 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 2060    Off  | 00000000:01:00.0  On |                  N/A |
| N/A   77C    P2    58W /  N/A |   5694MiB /  5926MiB |     42%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
...
|    0   N/A  N/A     23819      C   python3                          5069MiB |
...
+-----------------------------------------------------------------------------+

I think that NVIDIA drivers are backwards compatible with older CUDA, since I have no problem running these examples or Mozilla's Deepspeech.

from inaspeechsegmenter.

DavidDoukhan avatar DavidDoukhan commented on June 23, 2024

I've got an additional question: the keras example you used does not seems to take advantage of convolutionnal layers.
If you've got some more time, could you do a similar test using the following keras example :
https://github.com/keras-team/keras-io/blob/master/examples/vision/mnist_convnet.py

from inaspeechsegmenter.

miguel-negrao avatar miguel-negrao commented on June 23, 2024

I've got an additional question: the keras example you used does not seems to take advantage of convolutionnal layers.
If you've got some more time, could you do a similar test using the following keras example :
https://github.com/keras-team/keras-io/blob/master/examples/vision/mnist_convnet.py

Thanks a lot for looking into this issue. :-) Indeed that example doesn't run, so we are getting closer to understanding the issue.

 python3 keras-io/examples/vision/mnist_convnet.py 
2021-02-16 10:37:38.970902: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
2021-02-16 10:37:40.197059: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-02-16 10:37:40.235572: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-16 10:37:40.235965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 5.79GiB deviceMemoryBandwidth: 312.97GiB/s
2021-02-16 10:37:40.235983: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-02-16 10:37:40.238312: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-02-16 10:37:40.239422: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-02-16 10:37:40.239626: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-02-16 10:37:40.240848: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-02-16 10:37:40.241520: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-02-16 10:37:40.244078: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-02-16 10:37:40.244174: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-16 10:37:40.244521: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-16 10:37:40.244804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-02-16 10:37:40.245005: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-02-16 10:37:40.249929: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2208000000 Hz
2021-02-16 10:37:40.250506: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4c588e0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-02-16 10:37:40.250526: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-02-16 10:37:40.328646: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-16 10:37:40.329059: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4212400 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-02-16 10:37:40.329073: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2060, Compute Capability 7.5
2021-02-16 10:37:40.329191: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-16 10:37:40.329469: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 5.79GiB deviceMemoryBandwidth: 312.97GiB/s
2021-02-16 10:37:40.329489: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-02-16 10:37:40.329501: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-02-16 10:37:40.329510: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-02-16 10:37:40.329519: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-02-16 10:37:40.329528: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-02-16 10:37:40.329536: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-02-16 10:37:40.329545: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-02-16 10:37:40.329578: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-16 10:37:40.329863: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-16 10:37:40.330123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-02-16 10:37:40.330142: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-02-16 10:37:40.676391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-02-16 10:37:40.676421: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 
2021-02-16 10:37:40.676428: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N 
2021-02-16 10:37:40.676570: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-16 10:37:40.676919: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-02-16 10:37:40.677208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4759 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
flatten (Flatten)            (None, 1600)              0         
_________________________________________________________________
dropout (Dropout)            (None, 1600)              0         
_________________________________________________________________
dense (Dense)                (None, 10)                16010     
=================================================================
Total params: 34,826
Trainable params: 34,826
Non-trainable params: 0
_________________________________________________________________
2021-02-16 10:37:40.778528: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 169344000 exceeds 10% of free system memory.
Epoch 1/15
2021-02-16 10:37:41.153945: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-02-16 10:37:41.324921: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-02-16 10:37:41.813508: E tensorflow/stream_executor/cuda/cuda_dnn.cc:328] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2021-02-16 10:37:41.822206: E tensorflow/stream_executor/cuda/cuda_dnn.cc:328] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "keras-io/examples/vision/mnist_convnet.py", line 71, in <module>
    model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1)
  File "/home/miguel/Development/IPL/investigacao/speech_to_text_dev/ina_speech_related/kerasEnv/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/home/miguel/Development/IPL/investigacao/speech_to_text_dev/ina_speech_related/kerasEnv/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1098, in fit
    tmp_logs = train_function(iterator)
  File "/home/miguel/Development/IPL/investigacao/speech_to_text_dev/ina_speech_related/kerasEnv/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/home/miguel/Development/IPL/investigacao/speech_to_text_dev/ina_speech_related/kerasEnv/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 840, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/miguel/Development/IPL/investigacao/speech_to_text_dev/ina_speech_related/kerasEnv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/miguel/Development/IPL/investigacao/speech_to_text_dev/ina_speech_related/kerasEnv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "/home/miguel/Development/IPL/investigacao/speech_to_text_dev/ina_speech_related/kerasEnv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/miguel/Development/IPL/investigacao/speech_to_text_dev/ina_speech_related/kerasEnv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 550, in call
    ctx=ctx)
  File "/home/miguel/Development/IPL/investigacao/speech_to_text_dev/ina_speech_related/kerasEnv/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node sequential/conv2d/Conv2D (defined at keras-io/examples/vision/mnist_convnet.py:71) ]] [Op:__inference_train_function_777]

Function call stack:
train_function

from inaspeechsegmenter.

DavidDoukhan avatar DavidDoukhan commented on June 23, 2024

Thanks a lot for looking into this issue. :-) Indeed that example doesn't run, so we are getting closer to understanding the issue.

Ok.
So I guess this means you'll need to find the good mix between nvidia drivers, tensorflow installation, and cudnn version in order to make this keras example work.

I would also suggest to try to make this work outside of a docker image first.
unfortunately, docker images do not embed nvidia drivers.

from inaspeechsegmenter.

miguel-negrao avatar miguel-negrao commented on June 23, 2024

Ok, after searching a bit more for the types of errors I'm getting I was able to get the example and inaSpeechSegmenter to work fine, on tensorflow 2.3.0 with cuda 10.1 by adding the following code:

import tensorflow as tf
from tensorflow.compat.v1.keras.backend import set_session
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
config.log_device_placement = True
sess = tf.compat.v1.Session(config=config)
set_session(sess)

I have no idea why, but the problems seems to be related with the fact that without this code tf tries to get all gpu memory. I'm using the nvidia card as the main graphics card on my system, so it is being used by graphical programs. In any case this solves it for me. Btw, I also get the same errors on tf 2.4 and cuda 11.0, and they are also fixed in the same way. It might be worth mentioning this in the help file, although it is difficult to determine in which cases this change is needed.

Another suggestion would be to document which version of tensorflow each version of iss uses. At least placing that info on the readme, so that going back to previous git versions it's possible to know which tf to use.

Again, thanks for all the help !

from inaspeechsegmenter.

yujack333 avatar yujack333 commented on June 23, 2024

Ok, after searching a bit more for the types of errors I'm getting I was able to get the example and inaSpeechSegmenter to work fine, on tensorflow 2.3.0 with cuda 10.1 by adding the following code:

import tensorflow as tf
from tensorflow.compat.v1.keras.backend import set_session
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
config.log_device_placement = True
sess = tf.compat.v1.Session(config=config)
set_session(sess)

I have no idea why, but the problems seems to be related with the fact that without this code tf tries to get all gpu memory. I'm using the nvidia card as the main graphics card on my system, so it is being used by graphical programs. In any case this solves it for me. Btw, I also get the same errors on tf 2.4 and cuda 11.0, and they are also fixed in the same way. It might be worth mentioning this in the help file, although it is difficult to determine in which cases this change is needed.

Another suggestion would be to document which version of tensorflow each version of iss uses. At least placing that info on the readme, so that going back to previous git versions it's possible to know which tf to use.

Again, thanks for all the help !

Hi, i meet the same problem. i want ask where should the code add. is it add in my code? like:

from inaSpeechSegmenter import Segmenter
import tensorflow as tf
from tensorflow.compat.v1.keras.backend import set_session
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
config.log_device_placement = True
sess = tf.compat.v1.Session(config=config)
set_session(sess)

if __name__ == "__main__":
   media = './cache.wav'
   seg = Segmenter()
   segmentation = seg(media)

Looking forward your reply! it's will help me a lot. thank you.

from inaspeechsegmenter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.