baidu-research / deepbench Goto Github PK

View Code? Open in Web Editor NEW

1.1K 110.0 239.0 5.39 MB

Benchmarking Deep Learning operations on different hardware

License: Apache License 2.0

Makefile 4.14% Shell 6.67% C 10.23% C++ 51.17% Cuda 27.78%

deepbench's People

Contributors

Stargazers

Watchers

Forkers

codeaudit briansp2020 caomw ruantongtong ml-lab wanjinchang cuijianzhu shyamalschandra benjamesbabala hsaputra allenhsin ai-cdrone abyoussef hnkulkarni hemel-cse reply2vikas shobhit-agarwal amangpt777 bygreencn jasonshih pengwubj anonom okuchaiev dmudiger alheinecke caweinshenker palashshastri gyasic ermin-gong luciany walkoncross hqy200714813 rsdubtso masatoshihanai yochju jeff-svds davenso garigejyothi forkedreposbak magurosan beckett1124 yetigeti winning1120xx karinaflor leliaonvidia kafkafield hongyunnchen dataracer11 steenax86 sharannarang shivajid mutual-ai vatsl runngezhang amita-kapoor yongjianxu xindaya mgouicem bin2000 arundasan91 chingu163 qyi1 hejiandong anandharidass cristidruta thatguymike garyson bryancatanzaro hoangt yyuzhong kexinzhao daisyden efeguney jonathan-beard jammyzhou kaidix phenixi midasc dailyactie anilsener anijain2305 gnanam336 jfurtek thetaoism kirpich30000 dreadlord1984 trantorrepository dibasc mbraihan robbie-cao bhaskar24 prabindh jayden11 thomas-yang mridul-dev unholywhale yolle103 andrei-pokrovsky avinashchakravarthi1 wit543

deepbench's Issues

Make DeepBench compile cleanly with provided libs on RHEL

The DeepBench makefiles assume certain libs and header files (e.g. libmpi) are located in the same directory tree. For reasons I have yet to discern RHEL installs the libraries in one directory structure and the headers in another (in the -devel packages). By applying the below patch and adding "MPI_INCLUDE_PATH=/usr/include/openmpi-x86_64" DeepBench now builds cleanly on RHEL; I've coded the change such that it shouldn't require any changes for systems that have libraries and header files in the same directory.

diff -ur DeepBench.orig/code/baidu_allreduce/Makefile DeepBench/code/baidu_allreduce/Makefile
--- DeepBench.orig/code/baidu_allreduce/Makefile 2017-12-14 14:53:03.255428367 -0500
+++ DeepBench/code/baidu_allreduce/Makefile 2017-12-12 12:15:19.396655494 -0500
@@ -6,6 +6,7 @@
CUDA_PATH?=/usr/local/cuda
CUDA_LIB64=$(CUDA_PATH)/lib64
MPI_PATH?=/usr/local/openmpi
+MPI_INCLUDE_PATH?=/usr/local/openmpi
BAIDU_ALLREDUCE_PATH?=/local/baidu-allreduce
BIN_DIR?=bin
MKDIR=mkdir -p
@@ -21,8 +22,8 @@

ring_all_reduce:
$(MKDIR) $(BIN_DIR)

$(MPI_PATH)/bin/$(CC) -c -std=c++11 -I $(MPI_PATH)/include -I $(BAIDU_ALLREDUCE_PATH) -I $(CUDA_PATH)/include -I $(KERNELS_DIR) -DOMPI_SKIP_MPICXX= ring_all_reduce_mpi.cpp -o $(BIN_DIR)/ring_all_reduce_mpi.o
$(CUDA_PATH)/bin/$(NVCC) -c -std=c++11 -I $(MPI_PATH)/include -I $(BAIDU_ALLREDUCE_PATH) -I $(CUDA_PATH)/include -DOMPI_SKIP_MPICXX= $(BAIDU_ALLREDUCE_PATH)/collectives.cu -o $(BIN_DIR)/collectives.o

$(MPI_PATH)/bin/$(CC) -c -std=c++11 -I $(MPI_INCLUDE_PATH) -I $(BAIDU_ALLREDUCE_PATH) -I $(CUDA_PATH)/include -I $(KERNELS_DIR) -DOMPI_SKIP_MPICXX= ring_all_reduce_mpi.cpp -o $(BIN_DIR)/ring_all_reduce_mpi.o
$(CUDA_PATH)/bin/$(NVCC) -c -std=c++11 -I $(MPI_INCLUDE_PATH) -I $(BAIDU_ALLREDUCE_PATH) -I $(CUDA_PATH)/include -DOMPI_SKIP_MPICXX= $(BAIDU_ALLREDUCE_PATH)/collectives.cu -o $(BIN_DIR)/collectives.o
$(MPI_PATH)/bin/$(CC) -o $(BIN_DIR)/ring_all_reduce $(BIN_DIR)/ring_all_reduce_mpi.o $(BIN_DIR)/collectives.o -L$(CUDA_PATH)/lib64 -L$(MPI_PATH)/lib -lcudart -lmpi -DOMPI_SKIP_MPICXX=

clean:
diff -ur DeepBench.orig/code/nvidia/Makefile DeepBench/code/nvidia/Makefile
--- DeepBench.orig/code/nvidia/Makefile 2017-12-14 14:53:03.257428373 -0500
+++ DeepBench/code/nvidia/Makefile 2017-12-12 11:49:20.379360097 -0500
@@ -7,6 +7,7 @@
CUDNN_PATH?=/usr/local/cudnn
NCCL_PATH?=/usr/local/nccl
MPI_PATH?=/usr/local/openmpi
+MPI_INCLUDE_PATH?=/usr/include/openmpi
BIN_DIR?=bin
MKDIR=mkdir -p
#BLAS
@@ -45,7 +46,7 @@

nccl_mpi:
$(MKDIR) $(BIN_DIR)

$(CUDA_PATH)/bin/$(NVCC) nccl_mpi_all_reduce.cu -o $(BIN_DIR)/nccl_mpi_all_reduce -I $(KERNELS_DIR) -I $(NCCL_PATH)/include/ -I $(CUDNN_PATH)/include/ -I $(MPI_PATH)/include -L $(NCCL_PATH)/lib/ -L $(CUDNN_PATH)/lib64 -L $(MPI_PATH)/lib -lnccl -lcurand -lcudart -lmpi $(NVCC_ARCH_ARGS) -std=c++11

$(CUDA_PATH)/bin/$(NVCC) nccl_mpi_all_reduce.cu -o $(BIN_DIR)/nccl_mpi_all_reduce -I $(KERNELS_DIR) -I $(NCCL_PATH)/include/ -I $(CUDNN_PATH)/include/ -I $(MPI_INCLUDE_PATH) -L $(NCCL_PATH)/lib/ -L $(CUDNN_PATH)/lib64 -L $(MPI_PATH)/lib -lnccl -lcurand -lcudart -lmpi $(NVCC_ARCH_ARGS) -std=c++11

sparse:
$(MKDIR) $(BIN_DIR)
diff -ur DeepBench.orig/code/osu_allreduce/Makefile DeepBench/code/osu_allreduce/Makefile
--- DeepBench.orig/code/osu_allreduce/Makefile 2017-12-14 14:53:03.258428376 -0500
+++ DeepBench/code/osu_allreduce/Makefile 2017-12-12 11:51:07.655654032 -0500
@@ -3,6 +3,7 @@
CC_FLAGS= -c -O2 -pthread -Wall -march=native

MPI_PATH?=/usr/local/openmpi
+MPI_INCLUDE_PATH?=/usr/local/openmpi
CUDA_PATH?=/usr/local/cuda
MKDIR=mkdir -p
BIN_DIR?=bin
@@ -17,10 +18,10 @@

coll:
$(MKDIR) $(BIN_DIR)

$(CC) -o $(BIN_DIR)/osu_coll.o $(CC_FLAGS) -I$(CUDA_PATH)/include -I$(MPI_PATH)/include osu_coll.c

$(CC) -o $(BIN_DIR)/osu_coll.o $(CC_FLAGS) -I$(CUDA_PATH)/include -I$(MPI_INCLUDE_PATH) osu_coll.c

allreduce:

$(CC) -o $(BIN_DIR)/osu_allreduce.o $(CC_FLAGS) -I $(KERNELS_DIR) -I$(CUDA_PATH)/include -I$(MPI_PATH)/include osu_allreduce.c

$(CC) -o $(BIN_DIR)/osu_allreduce.o $(CC_FLAGS) -I $(KERNELS_DIR) -I$(CUDA_PATH)/include -I$(MPI_INCLUDE_PATH) osu_allreduce.c

clean:
rm -rf $(BIN_DIR)

error while executing the binary files

Hi,
I got some error while executing,i compiled it with cuda 7.5, gcc/4.9.4, cudnn/5.0.5 and i am trying to execute the generated binary file:
"nccl_mpi_all_reduce" and
"nccl_single_all_reduce"
like
mpirun -np 1 bin/nccl_mpi_all_reduce
and
srun ./bin/nccl_single_all_reduce 1
respectively, and i am getting an error

load gcc/4.9.4 (PATH, MANPATH, LD_LIBRARY_PATH)
Set GNU compilers as MPI wrappers backend
load CUDNN/5.0.5 (LD_LIBRARY_PATH, LIBRARY_PATH, C_INCLUDE_PATH, CPLUS_INCLUDE_PATH)
terminate called after throwing an instance of 'thrust::system::system_error'
what(): function_attributes(): after cudaFuncGetAttributes: invalid device function
[nva10:06393] *** Process received signal ***
[nva10:06393] Signal: Aborted (6)
[nva10:06393] Signal code: (-6)
[nva10:06393] [ 0] /lib64/libpthread.so.0() [0x3b3ee0f790]
[nva10:06393] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x3b3e632625]
[nva10:06393] [ 2] /lib64/libc.so.6(abort+0x175) [0x3b3e633e05]
[nva10:06393] [ 3] /apps/GCC/4.9.4/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x15d) [0x2b3c2f147acd]

and
load gcc/4.9.4 (PATH, MANPATH, LD_LIBRARY_PATH)
Set GNU compilers as MPI wrappers backend
load CUDNN/5.0.5 (LD_LIBRARY_PATH, LIBRARY_PATH, C_INCLUDE_PATH, CPLUS_INCLUDE_PATH)
terminate called after throwing an instance of 'thrust::system::system_error'
terminate called after throwing an instance of 'thrust::system::system_error'
what(): function_attributes(): after cudaFuncGetAttributes: invalid device function what(): function_attributes(): after cudaFuncGetAttributes: invalid device function
respectively ..
if there is any solution for this please,,,,
thank you so much

Running the benchmarrk on AMD hardware and Windows machine?

Anyway to do that?

tion

Intel gemm test sizes matrices incorrectly leading to segmentation fault

In code/intel/gemm/bench.cpp:128, the size of matrix b is calculated as std::max(sizea, max_sizeb), this is wrong and leads to a segmentation fault if the test is run with a subset of matrix sizes from the the default input set. Line 128 should read:
max_sizeb = std::max(sizeb, max_sizeb);

Interpreting the results

I was able to compile and run the benchmarks. However, have trouble interpreting the results. Copied below is the result after running the intel benchmarks. I am not sure, how to specify the variables for the benchmark.

Intel GEMM Benchmark:
~~~~~~~~~~~~~~~~~~

SGEMM(N,N,512,32,512) 24.6 usec 682.47345 GFlop/sec
SGEMM(N,N,1024,32,512) 29.6 usec 1133.19128 GFlop/sec
SGEMM(T,N,512,48000,2816) 30200.0 usec 4583.18308 GFlop/sec
SGEMM(T,N,512,48000,2048) 22696.9 usec 4435.11591 GFlop/sec
SGEMM(T,N,512,48000,2560) 27688.0 usec 4544.54369 GFlop/sec
SGEMM(T,N,512,48000,1536) 17718.4 usec 4260.96345 GFlop/sec
SGEMM(T,N,1024,48000,2816) 59694.3 usec 4637.35822 GFlop/sec
SGEMM(T,N,1024,48000,2048) 44873.5 usec 4486.53753 GFlop/sec
SGEMM(T,N,1024,48000,2560) 54299.2 usec 4634.65544 GFlop/sec
SGEMM(T,N,1024,48000,1536) 34313.9 usec 4400.40483 GFlop/sec
SGEMM(N,T,512,32,512) 25.9 usec 647.68659 GFlop/sec
SGEMM(N,T,1024,32,512) 29.5 usec 1136.88506 GFlop/sec
Total time 7512760.1 usec, Overall Performance: 3554.80515 GFlop/sec


Intel Convolution Benchmark
~~~~~~~~~~~~~~~~~~~~~~



##########################################
#   Performance - FWD (custom-Storage)   #
##########################################
GFLOP  = 1.6442
fp time = 0.00063374
GFLOPS  = 2594.4
PERFDUMP,FP,1.8.2-185,64,16,2048,512,7,7,1,1,1,0,0,0.00063374,2594.4,15346.564300,15346.564276,0.000037,1.749877,0.000001,1.745342,0.000031
##########################################
#   Performance - BWD (custom-Storage)   #
##########################################
GFLOP  = 1.6442
bp time = 0.047238
GFLOPS  = 34.806
PERFDUMP,BP,1.8.2-185,64,16,2048,512,7,7,1,1,1,0,0,0.047238,34.806,45863.489576,45863.489619,0.000034,0.138520,0.000000,0.101015,0.000031
##########################################
#   Performance - UPD (custom-Storage)   #
##########################################
GFLOP  = 1.6442
wu time = 0.00059244
GFLOPS  = 2775.2
PERFDUMP,WU,1.8.2-185,64,16,2048,512,7,7,1,1,1,0,0,0.00059244,2775.2,31919.598646,31919.598646,0.000000,0.000000,0.000000,0.000000,0.000000

How to submit results ?

Hi,

I can provide you results with Tesla P100 and multiple Intel CPUs but I was wondering what was the best way to submit the results ?

Thanks,
Jerome

Fail to compile on Ubuntu 16.4

os: ubuntu 16.4
GPU: Tesla V100-SXM2 *4 ,one node
cuda:9.1
nccl: 2.1.15
cudnn:7.0
cmd:
make CUDA_PATH=/usr/local/cuda CUDNN_PATH=/usr/lib/x86_64-linux-gnu MPI_PATH=/usr/local/openmpi-1.10.2_cuda9.1 NCCL_PATH=/usr/lib/x86_64-linux-gnu USE_TENSOR_CORES=1 ARCH=sm_70

while compile the benchmark, it met the error, just like:
./kernels/gemm_problems.h:2:0: required from here
/usr/include/c++/6/tuple:489:65: error: mismatched argument pack lengths while expanding ‘std::is_convertible<_UElements&&, _Elements>’
return _and<is_convertible<_UElements&&, _Elements>...>::value;
^~~~~
/usr/include/c++/6/tuple:490:1: error: body of constexpr function ‘static constexpr bool std::_TC<, _Elements>::_ImplicitlyMoveConvertibleTuple() [with _UElements = {const std::tuple<int, int, int, bool, bool>&}; bool = true; _Elements = {int, int, int, bool, bool}]’ not a return-statement
}
^
Makefile:30: recipe for target 'gemm' failed
make[1]: *** [gemm] Error 1
make[1]: Leaving directory '/root/DeepBench/code/nvidia'

can you give me some suggestion.

exact revison to the Intel Benchmarks (ICC、MKL、MPI) in current test

Hi , I'm a beginner to the Deepbench , I'm confused about the deepbench working environment as follows:
1. the exact revison to the ICC MKL and MPI and how about the ubuntu 16.1(GNU/Linux 4.8.0-22-generic x86_64) or which ubuntu revsion suggest?
2. I'm very interested the All-Reduce ,but the operation" run_allreduce_ia.sh <osu_allreduce binary> " what the osu_allreduce binary and the hostfile refer?

look forward to you reply, thank you!

Function calc_flops does not contain backward flops.

Hi, in DeepBench/code/intel/convolution/mkl_conv/std_conv_bench.cpp: function calc_flops only calculate the forward flops, it did not contain backward flops, is this a bug ?

can it compile to executable binary file then run it on gem5 simulator and how to compile it

Cuda 7.5.18 doesn't appear to support the ARCH=sm_61 compile flag

I'm looking to keep things consistent when running your benchmark against a GTX 1060; were any of the library versions changed from those listed (such as using cuda 8.0 instead of 7.5) when testing the TitanX Pascal?

Computation complexity of operators

I have noticed that a formulation is used to calculate the TFLOPS of each operator in the XLS files.
For recurrent layers - LSTM, it is calculated as =(8*$E296*$D296*$C296*$C296)/(G296/1000)/10^12.
It seems it just includes the computation of GEMM, and doesn't include compution of sigmoid or tanh.
Could you explain the formulation more detaily, especially for LSTM/GRU items. Thanks.

scripts [run_mkl_conv_ia_SKX.sh] [run_mkl_conv_ia_KNL.sh] [run_mkl_conv_ia_generic.sh] not found

In DeepBench/code/intel/convolution/mkl_conv/run_mkl_conv_ia.sh,

 33 if lscpu | grep Flags | grep -qs avx512dq; then
 34     ./run_mkl_conv_ia_SKX.sh
 35 elif lscpu | grep Flags | grep -qa avx512f; then
 36     ./run_mkl_conv_ia_KNL.sh
 37 else
 38     
 39 fi

But the /run_mkl_conv_ia_SKX.sh /run_mkl_conv_ia_KNL.sh /run_mkl_conv_ia_KNL.sh can not be found in the repo.

Deepbench with Theano/Keras on

Hello,

I want to use Deepbench for matrix multiplications and convolutions with low precision(8 bit, float16)with Theano and Keras. Can you please let me know how to integrate and use this library with Theano and Keras? What do I need to change in Theano and Keras ? I am only doing inference and not training with 8-bit etc. i.e I don't need to calculate backward propagations in low precisions.

Setup: A CNN trained in Keras/Theano with FP32 weights, bias and activations which is then quantized to 8-bit, 16-bit etc. That means I have a quantized model. I want to do inference or forward path by matrix multiplications of these 8-bit, 16-bit quantized weights, bias and activations (i.e these matrices have 8-bit values)... This means I have to do 8-bit multiply and 32-bit accumulate using Deepbench. I will use cuDNN v6 and CUDA8.0 on GTX 1080 TI Nvidia processor(Pascal architecture) which has support for DP4A.

unrecognized command line option ‘-mfpu=neon-fp-armv8

./run_gemm_bench.sh mkdir -p bin
g++ -O3 --std=c++11 -I ../kernels/ -lpthread -mfpu=neon-fp-armv8 -o bin/gemm_bench gemm_bench.cc
g++: error: unrecognized command line option ‘-mfpu=neon-fp-armv8’
make: *** [gemm] Error 1
build success!
start running!
./run_gemm_bench.sh: line 8: bin/gemm_bench: No such file or directory
running complete!
sorry，i debug for a long time，but i don't know where the problem happened

Cudnn 7 and Volta support/results?

V100 benchmark results: single Tesla V100 or 8 of them in DGX-1?

It's exciting to see benchmarks for V100, especially given recent release of Tesla V, which is expected to have similar performance.

The numbers in DeepBench look amazingly good: basically V100 is 3 times faster than 1080ti in RNNs.

In fact, numbers are so good I started to doubt: did you benchmark a single V100 or a full DGX-1 with 8 Tesla V100's?

No BackwardWeights in RNNs

What is the reason for not incorporating/benchmarking BackwardWeights at least for NVIDIA? There is no use of cudnnRNNBackwardWeights.

How to execute the benchmarks on cpu?? Is there a change in makefile that is required

Error in running DeepBench

Hi,
I used Cuda 8.0, cuDNN 5.0 for Cuda 8.0, Openmpi 1.10.2, while running gemm_bench, but I met the error below. I used the following command to compile the code.
make CUDA_PATH=/usr/local/cuda-8.0 CUDNN_PATH=/home/wangweiw/cuda MPI_PATH=/home/wangweiw/openmpi-1.10.2/build NCCL_PATH=/home/wangweiw/nccl/build ARCH=sm_61
The error is
Running training benchmark
Times

m       n      k      a_t     b_t      precision        time (usec)

terminate called after throwing an instance of 'thrust::system::system_error'
what(): function_attributes(): after cudaFuncGetAttributes: invalid device function
1760 16 1760 0 0Aborted
I don't know what goes wrong.

How long does it take to run the Deepbench script on Xeon machine?

I am seeing it run for several hours on a KNL machine. Can someone advise me what Config setup to be ensured before running from the BIOS, SW and HW standpoints?

Support for Mobile Platforms

Recent mobile processors have CPUs, GPUs, DSPs and NPUs that are capable of Teraflop-level deep learning operations.

Would be great to have support for high-end mobile processors, such as Snapdragon 845 with their acceleration cores.

Failed to compile NVIDIA benchmark

I'm trying to run deepbench on my Nvidia GTX 1080, with cuda 8, cudnn 5, NCCL and openmpi2.1. But meet some problems as the following shown, can anyone help on this?

/DeepBench/code# make CUDA_PATH=/usr/local/cuda-8.0 CUDNN_PATH=/usr/local/cuda-8.0/targets/x86_64-linux MPI_PATH=/usr/local/open[196/1916]
ATH=/usr/local/nccl
mkdir -p bin
make -C nvidia
make[1]: Entering directory '/DeepBench/code/nvidia'
mkdir -p bin
/usr/local/cuda-8.0/bin/nvcc gemm_bench.cu -DUSE_TENSOR_CORES=0 -DPAD_KERNELS=1 -o bin/gemm_bench -I ../kernels/ -I /usr/local/cuda-8.0/include -L /usr/loc$
l/cuda-8.0/lib64 -lcublas -L /usr/local/cuda-8.0/lib64 -lcurand --generate-code arch=compute_52,code=sm_52 -std=c++11
mkdir -p bin
/usr/local/cuda-8.0/bin/nvcc conv_bench.cu -DUSE_TENSOR_CORES=0 -DPAD_KERNELS=1 -o bin/conv_bench -I ../kernels/ -I /usr/local/cuda-8.0/include -I /usr/loc$
l/cuda-8.0/targets/x86_64-linux/include/ -L /usr/local/cuda-8.0/targets/x86_64-linux/lib64/ -L /usr/local/cuda-8.0/lib64 -lcurand -lcudnn --generate-code a$
ch=compute_52,code=sm_52 -std=c++11
mkdir -p bin
/usr/local/cuda-8.0/bin/nvcc rnn_bench.cu -DUSE_TENSOR_CORES=0 -o bin/rnn_bench -I ../kernels/ -I /usr/local/cuda-8.0/include -I /usr/local/cuda-8.0/target$
/x86_64-linux/include/ -L /usr/local/cuda-8.0/targets/x86_64-linux/lib64/ -L /usr/local/cuda-8.0/lib64 -lcurand -lcudnn --generate-code arch=compute_52,cod$
=sm_52 -std=c++11
mkdir -p bin
/usr/local/cuda-8.0/bin/nvcc nccl_single_all_reduce.cu -o bin/nccl_single_all_reduce -I ../kernels/ -I /usr/local/nccl/include/ -I /usr/local/cuda-8.0/targ$
ts/x86_64-linux/include/ -L /usr/local/nccl/lib/ -L /usr/local/cuda-8.0/targets/x86_64-linux/lib64 -lnccl -lcudart -lcurand --generate-code arch=compute_52$
code=sm_52 -std=c++11
mkdir -p bin
/usr/local/cuda-8.0/bin/nvcc nccl_mpi_all_reduce.cu -o bin/nccl_mpi_all_reduce -I ../kernels/ -I /usr/local/nccl/include/ -I /usr/local/cuda-8.0/targets/x86
_64-linux/include/ -I /usr/local/openmpi/include -L /usr/local/nccl/lib/ -L /usr/local/cuda-8.0/targets/x86_64-linux/lib64 -L /usr/local/openmpi/lib -lnccl
-lcurand -lcudart -lmpi --generate-code arch=compute_52,code=sm_52 -std=c++11
/usr/bin/ld: warning: libopen-rte.so.20, needed by /usr/local/openmpi/lib/libmpi.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libopen-pal.so.20, needed by /usr/local/openmpi/lib/libmpi.so, not found (try using -rpath or -rpath-link)
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_argv_append' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_var_get'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_strerror' /usr/local/openmpi/lib/libmpi.so: undefined reference to orte_process_info'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_getcwd' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_pmix_pdata_t_class'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hash_table_get_first_key_uint32' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_free_list_t_class'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_show_help' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_allocator_component_lookup'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hwloc1112_hwloc_bitmap_compare' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_backtrace_buffer'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_datatype_predefined_elem_desc' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_mpool_base_framework'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hash_table_get_value_ptr' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_thread_get_self'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_process_name_print' /usr/local/openmpi/lib/libmpi.so: undefined reference to orte_session_dir_cleanup'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hwloc_base_get_topology' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_convertor_pack'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hwloc_base_single_cpu' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_path_nfs'
/usr/local/openmpi/lib/libmpi.so: undefined reference to orte_errmgr' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_namelist_t_class'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_info_close_components' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_mpool_base_alloc'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_value_load' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_bitmap_clear_bit'
/usr/local/openmpi/lib/libmpi.so: undefined reference to mca_mpool_base_free' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_proc_local_set'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_pointer_array_t_class' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_pvar_find_by_name'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_datatype_create_desc' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_convertor_prepare_for_send'
/usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_var_enum_create_flag' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_progress_set_event_poll_rate'
/usr/local/openmpi/lib/libmpi.so: undefined reference to orte_info_show_orte_version' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hash_table_set_value_uint32'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_pointer_array_test_and_set_item' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_free_list_grow_st'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_progress' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_mutex_t_class'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_datatype_set_element_count' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_datatype_commit'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_finalize_util' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_progress_set_yield_when_idle'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_pmix_collect_all_data' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_argv_append_unique_nosize'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hwloc1112_hwloc_bitmap_isincluded' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_init_util'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_pmix' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hash_table_init'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_bitmap_t_class' /usr/local/openmpi/lib/libmpi.so: undefined reference to orte_name_wildcard'
/usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_var_register' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hash_table_remove_value_uint32'
/usr/local/openmpi/lib/libmpi.so: undefined reference to orte_rml' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_pvar_handle_reset'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_pmix_base_exchange' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hash_table_set_value_ptr'
/usr/local/openmpi/lib/libmpi.so: undefined reference to orte_info_close_components' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_built_with_cuda_support'
/usr/local/openmpi/lib/libmpi.so: undefined reference to orte_in_parallel_debugger' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_mpool_base_tree_print' [107/1916]
/usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_framework_components_close' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_strncpy'
/usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_var_enum_create' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_convertor_raw'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_bitmap_set_bit' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_rand'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_proc_t_class' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_progress_unregister'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_argv_free' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_info_out'
/usr/local/openmpi/lib/libmpi.so: undefined reference to orte_util_print_name_args' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_pvar_handle_stop'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_arch_set_fortran_logical_size' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_memory_base_malloc_init_hook'
/usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_component_close' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hwloc_topology'
/usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_var_group_get_stamp' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_allocator_base_framework'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hwloc1112_hwloc_get_obj_by_depth' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_framework_is_open'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hwloc_base_get_available_cpus' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_pvar_handle_read_value'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_install_dirs' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_var_get_count'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_thread_self_compare' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hwloc1112_hwloc_get_type_depth'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_local_arch' /usr/local/openmpi/lib/libmpi.so: undefined reference to orte_proc_is_bound'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_free_list_init' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_var_group_get'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_info_register_framework_params' /usr/local/openmpi/lib/libmpi.so: undefined reference to orte_rml_recv_cb_t_class'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_pmix_base_async_modex' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_progress_event_users_increment'
/usr/local/openmpi/lib/libmpi.so: undefined reference to orte_init' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_convert_process_name_to_string'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_output_close' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_backtrace_print'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hash_table_set_value_uint64' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_convert_string_to_process_name'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_progress_set_event_flag' /usr/local/openmpi/lib/libmpi.so: undefined reference to orte_rml_recv_callback'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_datatype_dump_data_desc' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_value_t_class'

/usr/local/openmpi/lib/libmpi.so: undefined reference to orte_util_convert_process_name_to_string' [62/1916] /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_cuda_support'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_output_stream_t_class' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_var_dump'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_datatype_resize' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_argv_split'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_srand' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_datatype_dump_data_flags'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hwloc1112_hwloc_get_cpubind' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_list_item_t_class'
/usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_var_find_by_name' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_select'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_btl_base_framework' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hash_table_remove_value_ptr'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_progress_event_users_decrement' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_setenv'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_bitmap_is_set_bit' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hash_table_get_value_uint64'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_bitmap_init' /usr/local/openmpi/lib/libmpi.so: undefined reference to orte_util_compare_name_fields'
/usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_component_to_string' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_free_list_item_t_class'
/usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_var_set_value' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_proc_for_name'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hash_table_get_next_key_uint32' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_pvar_handle_write_value'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_convertor_t_class' /usr/local/openmpi/lib/libmpi.so: undefined reference to orte_proc_applied_binding'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_datatype_finalize' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_argv_join'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_condition_t_class' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_components_close'
/usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_var_get_value' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_var_register_synonym'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_list_sort' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_component_repository_release'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_pointer_array_add' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_output'
/usr/local/openmpi/lib/libmpi.so: undefined reference to orte_finalize' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_datatype_contain_basic_datatypes'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_abort_print_stack' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hwloc1112_hwloc_bitmap_alloc'
/usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_component_var_register' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_compare_proc'
/usr/local/openmpi/lib/libmpi.so: undefined reference to `mca_base_var_group_get_count'

/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_info_register_project_frameworks' [17/1916] /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_framework_close'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_abort_delay' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_pvar_get_count'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_datatype_copy_content_same_ddt' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_convertor_create'
/usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_framework_open' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_output_verbose'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_output_open' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_pvar_session_t_class'
/usr/local/openmpi/lib/libmpi.so: undefined reference to orte_odls' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hwloc1112_hwloc_bitmap_free'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_pointer_array_init' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_info_make_version_str'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_argv_append_nosize' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_pvar_handle_alloc'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hash_table_get_value_uint32' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_pvar_get'
/usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_pvar_handle_free' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_value_unload'
/usr/local/openmpi/lib/libmpi.so: undefined reference to orte_info_register_framework_params' /usr/local/openmpi/lib/libmpi.so: undefined reference to orte_session_dir_finalize'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_class_initialize' /usr/local/openmpi/lib/libmpi.so: undefined reference to orte_util_convert_string_to_process_name'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hash_table_t_class' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_component_list_item_t_class'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_progress_register' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_var_find'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_datatype_init' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_datatype_add'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hwloc_base_cset2mapstr' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_datatype_get_element_count'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_datatype_clone' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_info_show_opal_version'
/usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_var_group_find_by_name' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_buffer_t_class'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_convertor_prepare_for_recv' /usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_framework_components_open'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_datatype_t_class' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_bitmap_find_and_set_first_unset_bit'
/usr/local/openmpi/lib/libmpi.so: undefined reference to mca_base_pvar_handle_start' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_pointer_array_set_item'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_dss' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_rcache_base_framework'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_process_info' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_bitmap_set_max_size'
/usr/local/openmpi/lib/libmpi.so: undefined reference to opal_object_t_class' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_list_t_class'
/usr/local/openmpi/lib/libmpi.so: undefined reference to orte_standalone_operation' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_pmix_app_t_class'
/usr/local/openmpi/lib/libmpi.so: undefined reference to orte_ess' /usr/local/openmpi/lib/libmpi.so: undefined reference to opal_hwloc_base_cset2str'
/usr/local/openmpi/lib/libmpi.so: undefined reference to `opal_convertor_unpack'
collect2: error: ld returned 1 exit status
Makefile:47: recipe for target 'nccl_mpi' failed
make[1]: *** [nccl_mpi] Error 1
make[1]: Leaving directory '/DeepBench/code/nvidia'
Makefile:6: recipe for target 'nvidia' failed
make: *** [nvidia] Error 2

Failed to build convolution.cpp on arm64 platform

I'm trying to run deepbench on my arm64 platform. I had installed ARM compute library on my platform and point ARM_COMPUTE_INCLUDE_PATH, ARM_COMPUTE_LIB_PATH to correct path. like as followings:
ARM_COMPUTE_INCLUDE_PATH?=/root/DeepBench/code/arm/arm_compute-v17.06-bin
ARM_COMPUTE_LIB_PATH?=/root/DeepBench/code/arm/arm_compute-v17.06-bin
ARM_COMPUTE_LIB=$(ARM_COMPUTE_LIB_PATH)/lib/linux-arm64-v8a-neon

Build logs as followings, can someone help on this?

root@localhost:~/DeepBench/code/arm# make conv
mkdir -p bin
g++ -O3 -std=c++11 -I /root/DeepBench/code/arm/arm_compute-v17.06-bin -I ../kernels/ -std=c++11 -larm_compute -L /root/DeepBench/code/arm/arm_compute-v17.06-bin/lib/linux-arm64-v8a-neon convolution.cpp -o bin/conv_bench
/tmp/ccGfptls.o: In function time_cnn(unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, int, unsigned int, unsigned int, unsigned int, unsigned int, int)': convolution.cpp:(.text+0x5c): undefined reference to arm_compute::Tensor::Tensor()'
convolution.cpp:(.text+0x64): undefined reference to arm_compute::Tensor::Tensor()' convolution.cpp:(.text+0x6c): undefined reference to arm_compute::Tensor::Tensor()'
convolution.cpp:(.text+0x74): undefined reference to arm_compute::Tensor::Tensor()' convolution.cpp:(.text+0xb0): undefined reference to arm_compute::NEConvolutionLayer::NEConvolutionLayer()'
convolution.cpp:(.text+0x104): undefined reference to arm_compute::Tensor::allocator()' convolution.cpp:(.text+0x120): undefined reference to arm_compute::TensorInfo::TensorInfo(arm_compute::TensorShape const&, unsigned long, arm_compute::DataType, int)'
convolution.cpp:(.text+0x12c): undefined reference to arm_compute::ITensorAllocator::init(arm_compute::TensorInfo const&)' convolution.cpp:(.text+0x1e8): undefined reference to arm_compute::Tensor::allocator()'
convolution.cpp:(.text+0x204): undefined reference to arm_compute::TensorInfo::TensorInfo(arm_compute::TensorShape const&, unsigned long, arm_compute::DataType, int)' convolution.cpp:(.text+0x210): undefined reference to arm_compute::ITensorAllocator::init(arm_compute::TensorInfo const&)'
convolution.cpp:(.text+0x218): undefined reference to arm_compute::Tensor::allocator()' convolution.cpp:(.text+0x234): undefined reference to arm_compute::TensorInfo::TensorInfo(arm_compute::TensorShape const&, unsigned long, arm_compute::DataType, int)'
convolution.cpp:(.text+0x240): undefined reference to arm_compute::ITensorAllocator::init(arm_compute::TensorInfo const&)' convolution.cpp:(.text+0x248): undefined reference to arm_compute::Tensor::allocator()'
convolution.cpp:(.text+0x264): undefined reference to arm_compute::TensorInfo::TensorInfo(arm_compute::TensorShape const&, unsigned long, arm_compute::DataType, int)' convolution.cpp:(.text+0x270): undefined reference to arm_compute::ITensorAllocator::init(arm_compute::TensorInfo const&)'
convolution.cpp:(.text+0x2cc): undefined reference to arm_compute::NEConvolutionLayer::configure(arm_compute::ITensor const*, arm_compute::ITensor const*, arm_compute::ITensor const*, arm_compute::ITensor*, arm_compute::PadStrideInfo const&, arm_compute::WeightsInfo const&)' convolution.cpp:(.text+0x2d4): undefined reference to arm_compute::Tensor::allocator()'
convolution.cpp:(.text+0x2e8): undefined reference to arm_compute::Tensor::allocator()' convolution.cpp:(.text+0x2fc): undefined reference to arm_compute::Tensor::allocator()'
convolution.cpp:(.text+0x310): undefined reference to arm_compute::Tensor::allocator()' convolution.cpp:(.text+0x324): undefined reference to arm_compute::NEConvolutionLayer::run()'
convolution.cpp:(.text+0x340): undefined reference to arm_compute::NEConvolutionLayer::run()' convolution.cpp:(.text+0x364): undefined reference to vtable for arm_compute::NEConvolutionLayer'
convolution.cpp:(.text+0x368): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text+0x36c): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text+0x374): undefined reference to vtable for arm_compute::NEConvolutionLayer' convolution.cpp:(.text+0x378): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text+0x37c): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text+0x3c4): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text+0x3c8): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text+0x400): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text+0x404): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text+0x43c): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text+0x440): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text+0x474): undefined reference to vtable for arm_compute::NEConvolutionLayerReshapeWeights'
convolution.cpp:(.text+0x47c): undefined reference to vtable for arm_compute::NEConvolutionLayerReshapeWeights' convolution.cpp:(.text+0x480): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text+0x484): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text+0x4c0): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text+0x4c4): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text+0x4fc): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text+0x500): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text+0x538): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text+0x53c): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text+0x574): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text+0x578): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text+0xb3c): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text+0xb40): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text+0xb48): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text+0xb4c): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text+0xb64): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text+0xb68): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text+0xb80): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text+0xb84): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text+0xb9c): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text+0xba0): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text+0xbc0): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text+0xbc4): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text+0xbd0): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text+0xbd4): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text+0xbf0): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text+0xbf4): undefined reference to vtable for arm_compute::TensorAllocator' /tmp/ccGfptls.o: In function time_cnn(unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, int, unsigned int, unsigned int, unsigned int, unsigned int, int) [clone .constprop.22]':
convolution.cpp:(.text+0xc78): undefined reference to arm_compute::Tensor::Tensor()' convolution.cpp:(.text+0xc80): undefined reference to arm_compute::Tensor::Tensor()'
convolution.cpp:(.text+0xc88): undefined reference to arm_compute::Tensor::Tensor()' convolution.cpp:(.text+0xc90): undefined reference to arm_compute::Tensor::Tensor()'
convolution.cpp:(.text+0xcc8): undefined reference to arm_compute::NEConvolutionLayer::NEConvolutionLayer()' convolution.cpp:(.text+0xd1c): undefined reference to arm_compute::Tensor::allocator()'
convolution.cpp:(.text+0xd38): undefined reference to arm_compute::TensorInfo::TensorInfo(arm_compute::TensorShape const&, unsigned long, arm_compute::DataType, int)' convolution.cpp:(.text+0xd44): undefined reference to arm_compute::ITensorAllocator::init(arm_compute::TensorInfo const&)'
convolution.cpp:(.text+0xe00): undefined reference to arm_compute::Tensor::allocator()' convolution.cpp:(.text+0xe1c): undefined reference to arm_compute::TensorInfo::TensorInfo(arm_compute::TensorShape const&, unsigned long, arm_compute::DataType, int)'
convolution.cpp:(.text+0xe28): undefined reference to arm_compute::ITensorAllocator::init(arm_compute::TensorInfo const&)' convolution.cpp:(.text+0xe30): undefined reference to arm_compute::Tensor::allocator()'
convolution.cpp:(.text+0xe4c): undefined reference to arm_compute::TensorInfo::TensorInfo(arm_compute::TensorShape const&, unsigned long, arm_compute::DataType, int)' convolution.cpp:(.text+0xe58): undefined reference to arm_compute::ITensorAllocator::init(arm_compute::TensorInfo const&)'
convolution.cpp:(.text+0xe60): undefined reference to arm_compute::Tensor::allocator()' convolution.cpp:(.text+0xe7c): undefined reference to arm_compute::TensorInfo::TensorInfo(arm_compute::TensorShape const&, unsigned long, arm_compute::DataType, int)'
convolution.cpp:(.text+0xe88): undefined reference to arm_compute::ITensorAllocator::init(arm_compute::TensorInfo const&)' convolution.cpp:(.text+0xee4): undefined reference to arm_compute::NEConvolutionLayer::configure(arm_compute::ITensor const*, arm_compute::ITensor const*, arm_compute::ITensor const*, arm_compute::ITensor*, arm_compute::PadStrideInfo const&, arm_compute::WeightsInfo const&)'
convolution.cpp:(.text+0xeec): undefined reference to arm_compute::Tensor::allocator()' convolution.cpp:(.text+0xf00): undefined reference to arm_compute::Tensor::allocator()'
convolution.cpp:(.text+0xf14): undefined reference to arm_compute::Tensor::allocator()' convolution.cpp:(.text+0xf28): undefined reference to arm_compute::Tensor::allocator()'
convolution.cpp:(.text+0xf3c): undefined reference to arm_compute::NEConvolutionLayer::run()' convolution.cpp:(.text+0xf50): undefined reference to arm_compute::NEConvolutionLayer::run()'
convolution.cpp:(.text+0xf70): undefined reference to vtable for arm_compute::NEConvolutionLayer' convolution.cpp:(.text+0xf74): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text+0xf78): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text+0xf80): undefined reference to vtable for arm_compute::NEConvolutionLayer'
convolution.cpp:(.text+0xf84): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text+0xf88): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text+0xfd0): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text+0xfd4): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text+0x100c): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text+0x1010): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text+0x1048): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text+0x104c): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text+0x1080): undefined reference to vtable for arm_compute::NEConvolutionLayerReshapeWeights' convolution.cpp:(.text+0x1088): undefined reference to vtable for arm_compute::NEConvolutionLayerReshapeWeights'
convolution.cpp:(.text+0x108c): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text+0x1090): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text+0x10cc): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text+0x10d0): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text+0x1108): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text+0x110c): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text+0x1144): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text+0x1148): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text+0x1180): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text+0x1184): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text+0x174c): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text+0x1750): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text+0x1758): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text+0x175c): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text+0x1774): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text+0x1778): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text+0x1790): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text+0x1794): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text+0x17ac): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text+0x17b0): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text+0x17d0): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text+0x17d4): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text+0x17e0): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text+0x17e4): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text+0x1800): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text+0x1804): undefined reference to vtable for arm_compute::TensorAllocator'
/tmp/ccGfptls.o: In function arm_compute::NEConvolutionLayer::~NEConvolutionLayer()': convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD2Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x14): undefined reference to vtable for arm_compute::NEConvolutionLayer'
convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD2Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x1c): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD2Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x20): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD2Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x28): undefined reference to vtable for arm_compute::NEConvolutionLayer' convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD2Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x2c): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD2Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x30): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD2Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x6c): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD2Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x70): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD2Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0xa8): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD2Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0xac): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD2Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0xe4): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD2Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0xe8): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD2Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x11c): undefined reference to vtable for arm_compute::NEConvolutionLayerReshapeWeights'
convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD2Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x124): undefined reference to vtable for arm_compute::NEConvolutionLayerReshapeWeights' convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD2Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x128): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD2Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x12c): undefined reference to vtable for arm_compute::TensorAllocator' /tmp/ccGfptls.o: In function arm_compute::NEConvolutionLayer::~NEConvolutionLayer()':
convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD0Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x14): undefined reference to vtable for arm_compute::NEConvolutionLayer' convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD0Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x1c): undefined reference to vtable for arm_compute::Tensor'
convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD0Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x20): undefined reference to vtable for arm_compute::TensorAllocator' convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD0Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x28): undefined reference to vtable for arm_compute::NEConvolutionLayer'
convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD0Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x2c): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD0Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x30): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD0Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x6c): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD0Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x70): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD0Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0xa8): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD0Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0xac): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD0Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0xe4): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD0Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0xe8): undefined reference to vtable for arm_compute::TensorAllocator'
convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD0Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x11c): undefined reference to vtable for arm_compute::NEConvolutionLayerReshapeWeights' convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD0Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x124): undefined reference to vtable for arm_compute::NEConvolutionLayerReshapeWeights'
convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD0Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x128): undefined reference to vtable for arm_compute::Tensor' convolution.cpp:(.text._ZN11arm_compute18NEConvolutionLayerD0Ev[_ZN11arm_compute18NEConvolutionLayerD5Ev]+0x12c): undefined reference to vtable for arm_compute::TensorAllocator'
collect2: error: ld returned 1 exit status
Makefile:21: recipe for target 'conv' failed
make: *** [conv] Error 1

why mkl has much slower for backward than forward?

according to benchmark result, mkl's deep learning with convolution (not gemm) has a much slower backward speed than the forward pass.

for example , for W=341, H=79,C=32,N=4, K=32, R=5, S=10, in KNL7250 platform, forward 0.91ms, backward with input is 68.79 ms, with weight is 74.98 ms! so backward is 68 times slower than forward.

as a comparison, in titanx, forward is 0.74ms, backward with input is 3.09 ms, with weight is 0.76 ms. For forward, KNL7250 is only a little slower than titanx , but for backward, KNL7250 is much much slower. This is similar with other W,H,C configuration.

can any one give me the reason? is it because mkl has not made much optimization for backward yet?

how to make sbench.c

Hi,
I want to make sbench.c file. Any tool or package should be installed before doing this?

GEMM dimensions, TN case

There seems to be a discrepency with the dimensions printed and those passed to cublasSgemm for the TN case. I'm looking at file gemm_bench.cu.

Effectively, at line 177
C is k_printed by n_printed
B is m_printed by n_printed
A is m_printed by k_printed

so that
m_to_cublas, k_to_cublas = k_printed, m_printed.

This possible discrepency applies to 15 of the 78 cases.

Nvidia has released new cudnn7.0 , any plan to upgrade to this new version ?

Matrix sizes for LSTMs and GRU

I am new to LSTMs and GRUs. As far as I understand, SGEMM (though on small matrices) can still potentially take up a significant time in end to end execution of RNNs.

I was hoping to find the matrix sizes of the multiplications from the XLS file. But, it only mentions hidden units and timesteps. Is there a way to figure out the SGEMM call parameters from the excel sheet? Or may be from one of the codes?

NCCL 2?

I noticed that DeepBench is still using NCCL 1 for its benchmarking. Is anyone interested in NCCL 2 benchmarks or already working on them?

One of the interesting things about NCCL 2 is the ability to communicate across nodes (not just within a node). Unfortunately, I have immediate access only to a cluster of K80 machines, so my set up is not ideal for evaluating it with state-of-the-art processors, so I'm curious if anyone else is interesting in working on NCCL 2.

Cannot find the script that running recurrent layer benchmarks on Intel hardware.

For recurrent layer benchmarks, I have seen that you have updated its' results for Intel hardware. But I cannot find the scripts to run this workload on Intel hardware. Could you please help to update it.
Thanks.

DeepBench on a DGX-1

Hi everyone,
We will soon have an NVIDIA DGX-1 for benchmarking and I was wondering which modifications could be applied to DeepBench (parameters and so on) so that it can be used effectively on this supercomputer. We will make our results available for the community. Thanks!

I source the corresponding path about intel (icc, mkl, mpi), but there are no results.

I do these:
source .../compilervars.sh intel64
etc.

confusion regarding parameters used in Excel sheet

Is there any explanation of how the benchmarking is done? so as to know the exact parameters which are focussed and then can be a better insight. I am very much interested to know how deepbench is working and would also want to know the P and Q parameters in the convolution results.

wrong parameter when setting cudnnCNN forwarding algo.

The parameter should be fwd_perf in line 137 at file code/nvidia/conv_bench.cu

how to solve this error?

I am trying to benchmark my GPU, but I have some problem:

terminate called after throwing an instance of 'std::runtime_error'
what(): CUDNN failure: CUDNN_STATUS_NOT_INITIALIZED in cudnn_helper.h at line: 32

This is full text of my situation.
root@fe903e806138:/DeepBench/code/bin# ls
conv_bench gemm_bench nccl_mpi_all_reduce nccl_single_all_reduce rnn_bench
root@fe903e806138:/DeepBench/code/bin# ./conv_bench
Times

w h c n k r s pad_w pad_h stride_w stride_h fwd_time (usec) bwd_inputs_time (usec) bwd_params_time (usec) total_time (usec) fwd_algo

terminate called after throwing an instance of 'std::runtime_error'
what(): CUDNN failure: CUDNN_STATUS_NOT_INITIALIZED in cudnn_helper.h at line: 32

Aborted (core dumped)
root@fe903e806138:/DeepBench/code/bin# ^C
root@fe903e806138:/DeepBench/code/bin# ./rnn_bench
terminate called after throwing an instance of 'std::runtime_error'
what(): CUDNN failure: CUDNN_STATUS_NOT_INITIALIZED in rnn_bench.cu at line: 254

Aborted (core dumped)
root@fe903e806138:/DeepBench/code/bin#

if you have any solution, please tell me.

Results spreadsheets have incorrect configs

Example below. The padding is shown to be 3x3 but the actual config in problems file shows a 1x1 padding.

Spreadsheet -

7	7	2048	8	512	1	1	3	3	2	2

File -
std::make_tuple(7, 7, 2048, 8, 512, 1, 1, 0, 0, 1, 1),

GEMM results and gemm_configs are out of sync

One example:

This spreadsheet has data for the config below:
6144 | 48000 | 2048 | T | N

But the above config is not found in the training_set here:
https://github.com/baidu-research/DeepBench/blob/master/code/kernels/gemm_problems.h

Improved GEMM performance using ISAAC

Hello!

I have just done a preliminary tuning of ISAAC (https://github.com/ptillet/isaac) for the Pascal Titan X, and it seems to outperform cuBLAS on many (but not all) shapes. What is the policy concerning which library should be used for the benchmarks ? (especially when there is not a clear winner!)

The benchmark is below. Non-overclocked Pascal Titan X
https://gist.github.com/ptillet/92caaeb0036cb2022e021da87e38b096

NCCL failure: invalid device pointer in nccl_mpi_all_reduce.cu at line: 76

Hi,
~~when trying to run the ncc_mpi_all_reduce over in a system with two P100 I get:~~
~~Any idea on what could be the problem?~~
Sorry, I read the instructions wrong.

Thanks

Compilation flag for KNL

Intel's mkl_conv Makefile is missing the -xMIC-AVX512 flag, which is required for evaluating this benchmark on KNL. Is this intentional?

error while compiling

Hi,
sorry it might be the basic things, although i am the beginner...
i am getting the error on include nccl.h no such file or directory and i am little confuse in NCCl_PATH
if some one could explain this issue
thank you so much.

confusion regarding parameters used in Excel sheet

Hi,
While looking at the results, i came across forward(ms) w.rt inputs and parameters. What does that actually mean? Does it mean testing the forward pass with changes in the image size while doing a forward pass?
If i get some insight into how exactly the excel file mean? Thanks.

Typo in convolution configs?

Hi,
Looks like there is a typo in spreadsheets:

Output of the first DeepSpeech convolution does not fit the second layer:
out_W = (W + 2 * pad_w - filter_w + 1) / stride_w = (700 + 0 - 5 + 1) / 2 = 348
out_H = (H + 2 * pad_h - filter_h + 1) / stride_h = (161 + 0 - 20 + 1) / 2 = 71

I guess R should be filter height and S should be filter width. In that case DeepSpeech layers fit perfectly:
out_W = (700 + 0 - 20 + 1) / 2 = 341
out_H = (161 + 0 - 5 + 1) / 2 = 79

Please also chech KWS case for the same issue.

Intel benchmark should use header files in kernels folder

I see the following header files present in the Intel benchmarks representing the kernels:

https://github.com/baidu-research/DeepBench/blob/master/code/intel/convolution/mkl_conv/input.h
https://github.com/baidu-research/DeepBench/blob/master/code/intel/convolution/mkl_conv/input_topologies.h
https://github.com/baidu-research/DeepBench/blob/master/code/intel/gemm/input.h
https://github.com/baidu-research/DeepBench/blob/master/code/intel/sgemm/input.h

These are still being used in the benchmark code. Could you please switch to using the header files in the kernels folder? It would make it a lot easier to maintain the kernels going forward. @dmudiger , could you please help make these changes?

Run coredumped

My system is installed 4 P100 GPUs and CUDA 8.0. For NCCL, it runs well. And I compile the benchmark by 'make CUDA_PATH=/usr/local/cuda CUDNN_PATH=/usr/local/cuda MPI_PATH=/home/userid/ompi NCCL_PATH=/home/userid/weike/nccl/ ARCH=sm_61'

Anyone can help me the coredump?

userid@ubuntu-WK-4xP100:~/weike/DeepBench/code$ source /.bashrc
userid@ubuntu-WK-4xP100:/weike/DeepBench/code$ bin/gemm_bench
Times

m       n      k      a_t     b_t      time (usec)

main: #1.
main: #2.
main: #3.
terminate called after throwing an instance of 'thrust::system::system_error'
what(): function_attributes(): after cudaFuncGetAttributes: invalid device function
Aborted (core dumped)
userid@ubuntu-WK-4xP100:~/weike/DeepBench/code$

RNN Bench results poor on P100 vs M40 - doubt

Hello,

I ran the RNN-bench on both an M40 and P100 GPU. Results shows that the Vanilla RNN TeraFLOPS result for P100 is lesser than M40 GPU.

LSTM values are better.

Do you have any insights on this ? Have you guys faced this issue ? I re-ran the benchmark but got similar results. Conv bench and GEMM bench results are favoring P100 by a huge margin.

Thanks,
Arun

Amd Vega support & results via MIOpen?

Where are the results of "run_DeepBench_ia.sh" stored?

Hello

I ran the above script run_DeepBench_ia.sh in a Intel KNL machine and after 4.5 hours it ended, but I am unable to find the results or summary in any of the folders. Can someone advise how I can benchmark an Intel KNL machine versus another lower end Intel machine?

Thanks
Jay Mahalingam
Calligo Technologies
Bangalore/ India

baidu-research / deepbench Goto Github PK

deepbench's People

Contributors

Stargazers

Watchers

Forkers

deepbench's Issues

This is full text of my situation. root@fe903e806138:/DeepBench/code/bin# ls conv_bench gemm_bench nccl_mpi_all_reduce nccl_single_all_reduce rnn_bench root@fe903e806138:/DeepBench/code/bin# ./conv_bench Times

w h c n k r s pad_w pad_h stride_w stride_h fwd_time (usec) bwd_inputs_time (usec) bwd_params_time (usec) total_time (usec) fwd_algo

userid@ubuntu-WK-4xP100:~/weike/DeepBench/code$ source /.bashrc userid@ubuntu-WK-4xP100:/weike/DeepBench/code$ bin/gemm_bench Times

Recommend Projects

Recommend Topics

Recommend Org

This is full text of my situation.
root@fe903e806138:/DeepBench/code/bin# ls
conv_bench gemm_bench nccl_mpi_all_reduce nccl_single_all_reduce rnn_bench
root@fe903e806138:/DeepBench/code/bin# ./conv_bench
Times

userid@ubuntu-WK-4xP100:~/weike/DeepBench/code$ source /.bashrc
userid@ubuntu-WK-4xP100:/weike/DeepBench/code$ bin/gemm_bench
Times