$ nvidia-docker run --ipc=host -it -v /home/ec2-user/data:/data --network=host -v /home/ec2-user/DALI:/DALI nvcr.io/nvidia/pytorch:21.05-py3
$ cd /DALI/docs/examples/use_cases/pytorch/resnet50/
$ python -m torch.distributed.launch --nproc_per_node=1 \
--nnodes=2 --node_rank=1 \
--master_addr="ip-172-31-44-53.ec2.internal" --master_port=443 \
main.py --dali_cpu --arch resnet50 --workers 1 --batch-size 16 --epochs 1 --lr 4.096 /data
I can remove the GPU training logics from the script and modify it to a version that merely reading imagenet data using dali cpu. The updated script locates here.
The updated script works with the previous nvidia-docker command.
$ docker run --ipc=host -it -v /home/ec2-user/data:/data -v /home/ec2-user/DALI:/DALI nvcr.io/nvidia/pytorch:21.05-py3 bash
$ cd /DALI/docs/examples/use_cases/pytorch/resnet50/
$ python main.py --dali_cpu --arch resnet50 --workers 1 --batch-size 16 --epochs 1 --lr 4.096 /data
root@4fd5caa961fc:/DALI/docs/examples/use_cases/pytorch/resnet50# python main.py --dali_cpu --arch resnet50 --workers 1 --batch-size 16 --epochs 1 --lr 4.096 /data
dali device is cpu, decoder device is cpu
dlopen "libcuda.so" failed!
Traceback (most recent call last):
File "main.py", line 291, in <module>
main()
File "main.py", line 186, in main
pipe.build()
File "/opt/conda/lib/python3.8/site-packages/nvidia/dali/pipeline.py", line 657, in build
self._init_pipeline_backend()
File "/opt/conda/lib/python3.8/site-packages/nvidia/dali/pipeline.py", line 562, in _init_pipeline_backend
self._pipe = b.Pipeline(self._max_batch_size,
RuntimeError: [/opt/dali/dali/core/device_guard.cc:33] Assert on "cuInitChecked()" failed: Failed to load libcuda.so. Check your library paths and if the driver is installed correctly.
Basically my target is to run the DALI data loader in an instance without GPU and in a docker image without GPU.
My questions are: