Giter VIP home page Giter VIP logo

cubert's Introduction

Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL

Build Status

Highly customized and optimized BERT inference directly on NVIDIA (CUDA, CUBLAS) or Intel MKL, without tensorflow and its framework overhead.

ONLY BERT (Transformer) is supported.

Benchmark

Environment

  • Tesla P4
  • 28 * Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
  • Debian GNU/Linux 8 (jessie)
  • gcc (Debian 4.9.2-10+deb8u1) 4.9.2
  • CUDA: release 9.0, V9.0.176
  • MKL: 2019.0.1.20181227
  • tensorflow: 1.12.0
  • BERT: seq_length = 32

GPU (cuBERT)

batch size 128 (ms) 32 (ms)
tensorflow 255.2 70.0
cuBERT 184.6 54.5

CPU (mklBERT)

batch size 128 (ms) 1 (ms)
tensorflow 1504.0 69.9
mklBERT 984.9 24.0

Note: MKL should be run under OMP_NUM_THREADS=? to control its thread number. Other environment variables and their possible values includes:

  • KMP_BLOCKTIME=0
  • KMP_AFFINITY=granularity=fine,verbose,compact,1,0

Mixed Precision

cuBERT can be accelerated by Tensor Core and Mixed Precision on NVIDIA Volta and Turing GPUs. We support mixed precision as variables stored in fp16 with computation taken in fp32. The typical accuracy error is less than 1% compared with single precision inference, while the speed achieves more than 2x acceleration.

API

API .h header

Pooler

We support following 2 pooling method.

  • The standard BERT pooler, which is defined as:
with tf.variable_scope("pooler"):
  # We "pool" the model by simply taking the hidden state corresponding
  # to the first token. We assume that this has been pre-trained
  first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
  self.pooled_output = tf.layers.dense(
    first_token_tensor,
    config.hidden_size,
    activation=tf.tanh,
    kernel_initializer=create_initializer(config.initializer_range))
  • Simple average pooler:
self.pooled_output = tf.reduce_mean(self.sequence_output, axis=1)

Output

Following outputs are supported:

cuBERT_OutputType python code
cuBERT_LOGITS model.get_pooled_output() * output_weights + output_bias
cuBERT_PROBS probs = tf.nn.softmax(logits, axis=-1)
cuBERT_POOLED_OUTPUT model.get_pooled_output()
cuBERT_SEQUENCE_OUTPUT model.get_sequence_output()
cuBERT_EMBEDDING_OUTPUT model.get_embedding_output()

Build from Source

mkdir build && cd build
# if build with CUDA
cmake -DCMAKE_BUILD_TYPE=Release -DcuBERT_ENABLE_GPU=ON -DCUDA_ARCH_NAME=Common ..
# or build with MKL
cmake -DCMAKE_BUILD_TYPE=Release -DcuBERT_ENABLE_MKL_SUPPORT=ON ..
make -j4

# install to /usr/local
# it will also install MKL if -DcuBERT_ENABLE_MKL_SUPPORT=ON
sudo make install

If you would like to run tfBERT_benchmark for performance comparison, please first install tensorflow C API from https://www.tensorflow.org/install/lang_c.

Run Unit Test

Download BERT test model bert_frozen_seq32.pb and vocab.txt from Dropbox, and put them under dir build before run make test or ./cuBERT_test.

Python

We provide simple Python wrapper by Cython, and it can be built and installed after C++ building as follows:

cd python
python setup.py bdist_wheel

# install
pip install dist/cuBERT-xxx.whl

# test
python cuBERT_test.py

Please check the Python API usage and examples at cuBERT_test.py for more details.

Java

Java wrapper is implemented through JNA . After installing maven and C++ building, it can be built as follows:

cd java
mvn clean package # -DskipTests

When using Java JAR, you need to specify jna.library.path to the location of libcuBERT.so if it is not installed to the system path. And jna.encoding should be set to UTF8 as -Djna.encoding=UTF8 in the JVM start-up script.

Please check the Java API usage and example at ModelTest.java for more details.

Install

Pre-built python binary package (currently only with MKL on Linux) can be installed as follows:

  • Download and install MKL to system path.

  • Download the wheel package and pip install cuBERT-xxx-linux_x86_64.whl

  • run python -c 'import libcubert' to verify your installation.

Dependency

Protobuf

cuBERT is built with protobuf-c to avoid version and code conflicting with tensorflow protobuf.

CUDA

Libraries compiled by CUDA with different versions are not compatible.

MKL

MKL is dynamically linked. We install both cuBERT and MKL in sudo make install.

Threading

We assume the typical usage case of cuBERT is for online serving, where concurrent requests of different batch_size should be served as fast as possible. Thus, throughput and latency should be balanced, especially in pure CPU environment.

As the vanilla class Bert is not thread-safe because of its internal buffers for computation, a wrapper class BertM is written to hold locks of different Bert instances for thread safety. BertM will choose one underlying Bert instance by a round-robin manner, and consequence requests of the same Bert instance might be queued by its corresponding lock.

GPU

One Bert is placed on one GPU card. The maximum concurrent requests is the number of usable GPU cards on one machine, which can be controlled by CUDA_VISIBLE_DEVICES if it is specified.

CPU

For pure CPU environment, it is more complicate than GPU. There are 2 level of parallelism:

  1. Request level. Concurrent requests will compete CPU resource if the online server itself is multi-threaded. If the server is single-threaded (for example some server implementation in Python), things will be much easier.

  2. Operation level. The matrix operations are parallelized by OpenMP and MKL. The maximum parallelism is controlled by OMP_NUM_THREADS, MKL_NUM_THREADS, and many other environment variables. We refer our users to first read Using Threaded Intel® MKL in Multi-Thread Application and Recommended settings for calling Intel MKL routines from multi-threaded applications .

Thus, we introduce CUBERT_NUM_CPU_MODELS for better control of request level parallelism. This variable specifies the number of Bert instances created on CPU/memory, which acts same like CUDA_VISIBLE_DEVICES for GPU.

  • If you have limited number of CPU cores (old or desktop CPUs, or in Docker), it is not necessary to use CUBERT_NUM_CPU_MODELS. For example 4 CPU cores, a request-level parallelism of 1 and operation-level parallelism of 4 should work quite well.

  • But if you have many CPU cores like 40, it might be better to try with request-level parallelism of 5 and operation-level parallelism of 8.

In summary, OMP_NUM_THREADS or MKL_NUM_THREADS defines how many threads one model could use, and CUBERT_NUM_CPU_MODELS defines how many models in total.

Again, the per request latency and overall throughput should be balanced, and it diffs from model seq_length, batch_size, your CPU cores, your server QPS, and many many other things. You should take a lot benchmark to achieve the best trade-off. Good luck!

Authors

  • fanliwen
  • wangruixin
  • fangkuan
  • sunxian

cubert's People

Contributors

dependabot[bot] avatar fangkuann avatar levyfan avatar wrxdm avatar xuqiang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cubert's Issues

mklBERT设置batch比较大的时候,性能很差,跟你的测试结果不一致。

我的测试环境如下:
28 * Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Ubuntu 16.04.5 LTS
gcc (Ubuntu5.4.06ubuntu116.04.11) 5.4.0 21060609
MKL: 2019.0.1.20181227
和你的环境只是系统和gcc版本不同。但我测试的mklBERT在batch=128的时候,不设置任何环境变量,解码时间要5000ms+,设置OMP_NUM_THREADS=28后解码时间需要2000ms+,batch=1时的解码时间和你的一致。请问可能是什么原因?

另外,我分别设置了OMP_NUM_THREADS、KMP_BLOCKTIME、KMP_AFFINITY这些参数。都没有达到README中给出的984.9ms的性能。

我是采用的cuBERT_benchmark进行测试的。我测试时,cuBERT_benchmark的解码时间波动很大,你们是怎么精确到小数位时间的?烦请指教,谢谢~

cubert是否支持多分类?

bert做四分类任务,希望输出每个instance在4个分类上的概率,试了下cuBERT_LOGITS,发现结果和tf预测的结果不一致。问下cubert是否支持这个功能(每个instance在各个分类上的概率)?

代码更改问题咨询

您好,假如我用您的代码做单句话情感预测,是可以测试的吗,如何输出0,1或者概率?您的那个pb模型就可以吗?还是需要其他的,若是其他的pb模型,是需要更改输入的代码吗?

What's the setting of the benchmark?

On my CPU, I use CUBERT_NUM_CPU_MODELS=1, OMP_NUM_THREADS =8, MKL_NUM_THREADS =8, but got 37ms of run time.

P.S. the MKL-DNN library supports cpu int8, maybe you can improve the performance a little more

how to run it by other trained model

Hello, I would like to try to use my trained model predictions(trained with origin BERT) and convert them from .ckpt to .pb files. But when I call model.tokenize_compute and I got wrong, "The model does not have additional_output_layer, the output logits is wrong.". It seems that I had wrong in my model conversion. Can you tell me how to convert my trained model to pb?

why not the embedding_table shared by different Bert instances ?

template<typename T, typename V>
Embedding<T, V>::Embedding(size_t vocab_size, size_t embedding_size, V *embedding_table) {
this->vocab_size = vocab_size;
this->embedding_size = embedding_size;
this->embedding_table = static_cast<V *>(cuBERT::malloc(vocab_size * embedding_size * sizeof(V)));
cuBERT::memcpy(this->embedding_table, embedding_table, vocab_size * embedding_size * sizeof(V), 1);
}

The BertM create several Bert instances, and every Bert instances will malloc a new Embedding table, why not different Bert instances share the common Embedding table? This will save a lot of memory .

java中测试mklBert计算时间很高

我的测试环境是
32 * Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
Ubuntu 16.04.5 LTS
gcc 5.4.0
MKL: 2019.0.3.20190220
环境变量值如下(之前在tensorflow mkl性能上总结出的,比默认值好一些):
KMP_BLOCKTIME=1
KMP_AFFINITY=granularity=fine,verbose,compact,1,0
KMP_SETTINGS=1
OMP_NUM_THREADS=4
我是用java中TestCuBERT中的代码测试的,打点记时如下:

for(int i=1;i<=10;i++) {
    long start = System.currentTimeMillis();
    Float[] output = new Float[2];
    model.compute(1, input_ids, input_mask, segment_ids, output, OutputType.LOGITS);
    System.out.println("Compute Time" + i + ":" + (System.currentTimeMillis() - start));
}

结果是这样

Compute Time1:472
Compute Time2:190
Compute Time3:178
Compute Time4:148
Compute Time5:240
Compute Time6:153
Compute Time7:215
Compute Time8:156
Compute Time9:171
Compute Time10:178

请问这可能是什么原因导致,谢谢~

build from source的时候遇到点问题

在执行make -j4那里执行命令如下
- git clone https://github.com/zhihu/cuBERT.git - cd cuBERT - mkdir build && cd build - cmake -DCMAKE_BUILD_TYPE=Release -DcuBERT_ENABLE_GPU=ON -DCUDA_ARCH_NAME=Common .. - make -j4 - make install
然后报出了一点问题,查了一下不是很明白
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found OpenMP_C: -fopenmp (found version "4.0")
-- Found OpenMP_CXX: -fopenmp (found version "4.0")
-- Found OpenMP: TRUE (found version "4.0")
-- Found CUDA: /usr/local/cuda (found suitable version "9.0", minimum required is "9.0")
-- Using CUDA arch flags: sm_30 sm_35 sm_50 sm_52 sm_60 sm_61 sm_70 compute_70
-- Could NOT find tensorflow (missing: tensorflow_LIBRARIES tensorflow_INCLUDE_DIR)
-- Configuring done
-- Generating done
-- Build files have been written to: /cuBERT/build
Scanning dependencies of target googletest
Scanning dependencies of target cub
Scanning dependencies of target protobuf
[ 1%] Creating directories for 'protobuf'
[ 2%] Creating directories for 'googletest'
[ 3%] Creating directories for 'cub'
[ 5%] Performing download step (git clone) for 'protobuf'
[ 5%] Performing download step (git clone) for 'googletest'
[ 6%] Performing download step (download, verify and extract) for 'cub'
-- Downloading...
dst='/cuBERT/build/cub/src/1.8.0.zip'
timeout='none'
-- Using src='https://github.com/NVlabs/cub/archive/1.8.0.zip'
-- Retrying...
-- Using src='https://github.com/NVlabs/cub/archive/1.8.0.zip'
-- Retry after 5 seconds (attempt #2) ...
[ 7%] No patch step for 'googletest'
[ 8%] Performing update step for 'googletest'
[ 9%] Performing configure step for 'googletest'
loading initial cache file /cuBERT/build/googletest/tmp/googletest-cache-Release.cmake
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found PythonInterp: /usr/local/bin/python (found version "3.6.9")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Configuring done
-- Generating done
-- Build files have been written to: /cuBERT/build/googletest/src/googletest
[ 10%] Performing build step for 'googletest'
Scanning dependencies of target gtest
[ 25%] Building CXX object googletest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
-- Using src='https://github.com/NVlabs/cub/archive/1.8.0.zip'
-- Retry after 5 seconds (attempt #3) ...
Submodule 'third_party/benchmark' (https://github.com/google/benchmark.git) registered for path 'third_party/benchmark'
Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/googletest'
-- Using src='https://github.com/NVlabs/cub/archive/1.8.0.zip'
-- Retry after 15 seconds (attempt #4) ...
Submodule path 'third_party/benchmark': checked out '5b7683f49e1e9223cf9927b24f6fd3d6bd82e3f8'
Submodule path 'third_party/googletest': checked out 'c3bb0ee2a63279a803aaad956b9b26d74bf9e6e2'
[ 11%] No patch step for 'protobuf'
[ 12%] Performing update step for 'protobuf'
[ 13%] Performing configure step for 'protobuf'
[ 50%] Linking CXX static library libgtest.a
[ 50%] Built target gtest
Scanning dependencies of target gtest_main
[ 75%] Building CXX object googletest/CMakeFiles/gtest_main.dir/src/gtest_main.cc.o
[100%] Linking CXX static library libgtest_main.a
[100%] Built target gtest_main
[ 14%] No install step for 'googletest'
[ 15%] Completed 'googletest'
[ 15%] Built target googletest
CMakeFiles/protobuf.dir/build.make:106: recipe for target 'protobuf/src/protobuf-stamp/protobuf-configure' failed
CMakeFiles/Makefile2:262: recipe for target 'CMakeFiles/protobuf.dir/all' failed
-- Using src='https://github.com/NVlabs/cub/archive/1.8.0.zip'
-- Retry after 60 seconds (attempt #5) ...
-- Using src='https://github.com/NVlabs/cub/archive/1.8.0.zip'
CMakeFiles/cub.dir/build.make:90: recipe for target 'cub/src/cub-stamp/cub-download' failed
CMakeFiles/Makefile2:299: recipe for target 'CMakeFiles/cub.dir/all' failed
Makefile:140: recipe for target 'all' failed

编译不通过

[ 75%] Building CXX object googletest/CMakeFiles/gtest_main.dir/src/gtest_main.cc.o
[100%] Linking CXX static library libgtest_main.a
[100%] Built target gtest_main
[ 18%] No install step for 'googletest'
[ 19%] Completed 'googletest'
[ 19%] Built target googletest
Note: checking out 'v3.6.0'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

git checkout -b new_branch_name

HEAD is now at ab8edf1... Merge pull request #4713 from acozzette/changelog
Submodule 'third_party/benchmark' (https://github.com/google/benchmark.git) registered for path 'third_party/benchmark'
Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/googletest'
Cloning into 'third_party/benchmark'...
Submodule path 'third_party/benchmark': checked out '5b7683f49e1e9223cf9927b24f6fd3d6bd82e3f8'
Cloning into 'third_party/googletest'...
Submodule path 'third_party/googletest': checked out 'c3bb0ee2a63279a803aaad956b9b26d74bf9e6e2'
[ 20%] No patch step for 'protobuf'
[ 21%] Performing update step for 'protobuf'
[ 22%] Performing configure step for 'protobuf'

  • autoreconf -f -i -Wall,no-obsolete
    /data/allenchen/cuBERT/build/protobuf/src/protobuf/autogen.sh: line 32: autoreconf: command not found
    make[2]: *** [protobuf/src/protobuf-stamp/protobuf-configure] Error 127
    make[1]: *** [CMakeFiles/protobuf.dir/all] Error 2
    make: *** [all] Error 2

我看大家好像都能使用,但是我在centos7上编译报了这个错,求帮助~
我的mkl版本是2019.0

infinite loop when handling bad utf8 string

size_t len = utf8proc_iterate((const utf8proc_uint8_t *) text + subpos, word_bytes, &cp);

        size_t len = utf8proc_iterate((const utf8proc_uint8_t *) text + subpos, word_bytes, &cp);
        if (len < 0) {
            std::cerr << "UTF-8 decode error: " << text << std::endl;
            break;
        }

When utf8proc_iterate meets bad UTF8 chars, it will return negative value. However we assign it to size_t. So (len<0) will never happen. And we will get in infinite loop.

EMBEDDING_OUTPUT

output_type=cuBERT.OutputType.LOGITS
compute_type=cuBERT.ComputeType.FLOAT
output = np.zeros([batch_size], dtype=np.float32, order='C')
如果要输出EMBEDING_OUTPUT,上述参数如何修改。

Compare with bert-as-service

We are currently using bert-as-service done by Tencent AI lab which also support inference on multiple GPUs (output feature embedding layer) .

Will cuBERT be faster? Can you add benchmarks for that?
Thanks

CuBERT not utilizing all threads with multi-cpu

Hi there,

I was running cuBERT_benchmark.py and noticed that CuBERT does not utilize all threads when using multiple CPUs (even when setting MKL_NUM_THREADS and OMP_NUM_THREADS). It seems that only CPU#1 is fully utilized in my case, while CPU#2 is almost idle (see attached image). Is there a reason for this behaviour?

cubert_cpu_util_16thread

I compared by running TF-BERT and it utilizes all threads of both CPUs.

Also, I am trying to use CuBERT in another application where I use multi-processing as well. Is it possible that python's multiprocessing is interfering with CuBERT's multi-threading? Somehow CuBERT is running slower in this application (and it utilizes only some threads totally irregularly) than TF-BERT, while it's faster when I run the benchmark.

Thanks for your help

output softmax in addition to logits

As discussed in #20, a softmax output can be added after the logits as same in

https://github.com/google-research/bert/blob/d66a146741588fb208450bde15aa7db143baaa69/run_classifier.py#L608

logits = tf.matmul(output_layer, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
probabilities = tf.nn.softmax(logits, axis=-1)
log_probs = tf.nn.log_softmax(logits, axis=-1)

The final output could be logits and probabilities and log_probs at the same time.

java support error

利用Java调试cuBert时,发现连续多次调用compute方法时,会造成JVM 崩溃,代码如下
`

    int[] input_ids = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
            32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63 };

    byte[] input_mask = { 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1,
            0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0 };

    byte[] segment_ids = { 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
            0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0 };

    Float[] output = new Float[2];
    for (int i = 0; i < 10; i++) {
        model.compute(2, input_ids, input_mask, segment_ids, output, OutputType.LOGITS);
        System.out.println(output[0] + " " + output[1]);

    }

`

关于test中模型的其他输出

在GPU环境下我想把例子cuBERT_test.cpp中的cuBERT_LOGITS输出替换替换成cuBERT_SEQUENCE_OUTPUT
cuBERT_tokenize_compute(model, tokenizer, batch_size,
text_a,
text_b,
output, cuBERT_SEQUENCE_OUTPUT);
一直报段错误,这个怎么解决?

Different result cmp to bert/extract_features.py

Thanks for your great jobs.
I'am using the same input seq, but get slightly different results from bert/extract_features.py and cuBERT.

eg.
cuBert id
101 6821 3221 6443 1557 102 872 3221 6443 1557 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cuBert pooled emb
0.826238 -0.021815 -0.124155 0.455085 1.10134 -0.764902 -0.00428823 0.0152181 ...

bert extract features id
101 6821 3221 6443 1557 102 872 3221 6443 1557 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
bert extract features pooled emb
0.827281 -0.021752 -0.123682 0.454883 1.100451 -0.765023 -0.004001 0.015624 ...

Any ideas on this inconsistency?

运行训练问题

我想请问一下,我如何进行微调,如何进行训练。在哪个文件进行数据的输入,看了一段时间,不太懂,,希望各位大神、作者解答一下。麻烦了

pooled_output

请问cubert的model.get_pooled_output()是否与原始bert的model.get_pooled_output一致,
model.tokenize_compute(text_a, text_b, output, output_type=output_type),text_b设置为None,
,output_type设置为cuBERT_POOLED_OUTPUT,是表示对text_a进行embedding吗,对text_a进行表征时,text_a前后有加入CLS和SEP吗,bert_frozen_seq32.pb是从bert提供的BERT-Base, Chinese转过来的吗,能否提供.ckpt转化为.pb的代码

Java api run slowerly than Python

I compared java and python performance with CPU and MKL enabled, Java(20ms) is much slower than Python(10ms).

BERT parameters:
batch_size=1, seq_length=24, num_hidden_layers=6, num_attention_heads=12
ENV variables:
export OMP_NUM_THREADS=8
export KMP_BLOCKTIME=0
export KMP_AFFINITY=granularity=fine,verbose,compact,1,0

Then I tested Java Native Api(JNI) instead of JNA, it's also the same phenomenon. I suspect that it's something about cuBERT.so (maybe MKL configuration) making java run slower than python?

java单句子表征

java版本对单个句子进行表征时
String[] textA = new String[]{”知乎“,”知乎"};
String[] textB = new String[]{"“,”“};
这样是对的吗?

Reproduce running time in README.

Hi, thanks for sharing this awesome project!

I have met a problem, when I try to use mklBERT with 32 * Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz, I cannot reproduce running time reported in README (2281ms vs 984.9ms), can you give me some advice for reproducing running time?

在进行build时执行cmake -DCMAKE_BUILD_TYPE=Release -DcuBERT_ENABLE_MKL_SUPPORT=ON .. make -j4,报错

在进行build时,执行
cmake -DCMAKE_BUILD_TYPE=Release -DcuBERT_ENABLE_MKL_SUPPORT=ON ..
make -j4
在执行 make -j4时,报了如下图中的错误。请问如何解决呢?
image

报错信息:
[ 34%] Performing update step for 'protobuf'
[ 34%] No patch step for 'protobuf'
[ 35%] Performing configure step for 'protobuf'

  • autoreconf -f -i -Wall,no-obsolete
    aclocal: couldn't open directory `m4': 没有那个文件或目录
    autoreconf: aclocal failed with exit status: 1
    make[2]: *** [protobuf/src/protobuf-stamp/protobuf-configure] 错误 1
    make[1]: *** [CMakeFiles/protobuf.dir/all] 错误 2

在cpu上的cubert推理速度比tf_bert慢

您好,我在cpu 48线程上测试,cubert推理速度要比tf_bert慢。 我的mkl版本是2018版本,请问mkl版本对于cubert推理速度是否会造成影响?
我的cpu是 48 Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz

What mkl options have you used on benchmark?

I am following your project now.
You marked 2x faster when you mkl, but there is no information about what options have you used.
Would you please let me know mkl option for benchmark performance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.