Giter VIP home page Giter VIP logo

nvidia / transformerengine Goto Github PK

View Code? Open in Web Editor NEW
1.4K 32.0 226.0 4.49 MB

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

Home Page: https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html

License: Apache License 2.0

Shell 0.24% Python 54.04% CMake 0.31% Cuda 31.29% C++ 12.10% C 2.02%
cuda deep-learning gpu machine-learning python pytorch fp8 jax

transformerengine's Introduction

License

Transformer Engine

Quickstart | Installation | User Guide | Examples | FP8 Convergence | Integrations | Release notes

Latest News

H200

What is Transformer Engine?

Transformer Engine (TE) is a library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper GPUs, to provide better performance with lower memory utilization in both training and inference. TE provides a collection of highly optimized building blocks for popular Transformer architectures and an automatic mixed precision-like API that can be used seamlessly with your framework-specific code. TE also includes a framework agnostic C++ API that can be integrated with other deep learning libraries to enable FP8 support for Transformers.

As the number of parameters in Transformer models continues to grow, training and inference for architectures such as BERT, GPT and T5 become very memory and compute-intensive. Most deep learning frameworks train with FP32 by default. This is not essential, however, to achieve full accuracy for many deep learning models. Using mixed-precision training, which combines single-precision (FP32) with lower precision (e.g. FP16) format when training a model, results in significant speedups with minimal differences in accuracy as compared to FP32 training. With Hopper GPU architecture FP8 precision was introduced, which offers improved performance over FP16 with no degradation in accuracy. Although all major deep learning frameworks support FP16, FP8 support is not available natively in frameworks today.

TE addresses the problem of FP8 support by providing APIs that integrate with popular Large Language Model (LLM) libraries. It provides a Python API consisting of modules to easily build a Transformer layer as well as a framework-agnostic library in C++ including structs and kernels needed for FP8 support. Modules provided by TE internally maintain scaling factors and other values needed for FP8 training, greatly simplifying mixed precision training for users.

Highlights

  • Easy-to-use modules for building Transformer layers with FP8 support
  • Optimizations (e.g. fused kernels) for Transformer models
  • Support for FP8 on NVIDIA Hopper and NVIDIA Ada GPUs
  • Support for optimizations across all precisions (FP16, BF16) on NVIDIA Ampere GPU architecture generations and later

Examples

PyTorch

import torch
import transformer_engine.pytorch as te
from transformer_engine.common import recipe

# Set dimensions.
in_features = 768
out_features = 3072
hidden_size = 2048

# Initialize model and inputs.
model = te.Linear(in_features, out_features, bias=True)
inp = torch.randn(hidden_size, in_features, device="cuda")

# Create an FP8 recipe. Note: All input args are optional.
fp8_recipe = recipe.DelayedScaling(margin=0, interval=1, fp8_format=recipe.Format.E4M3)

# Enable autocasting for the forward pass
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    out = model(inp)

loss = out.sum()
loss.backward()

JAX

Flax

import flax
import jax
import jax.numpy as jnp
import transformer_engine.jax as te
import transformer_engine.jax.flax as te_flax
from transformer_engine.common import recipe

BATCH = 32
SEQLEN = 128
HIDDEN = 1024

# Initialize RNG and inputs.
rng = jax.random.PRNGKey(0)
init_rng, data_rng = jax.random.split(rng)
inp = jax.random.normal(data_rng, [BATCH, SEQLEN, HIDDEN], jnp.float32)

# Create an FP8 recipe. Note: All input args are optional.
fp8_recipe = recipe.DelayedScaling(margin=0, interval=1, fp8_format=recipe.Format.HYBRID)

# Enable autocasting for the forward pass
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    model = te_flax.DenseGeneral(features=HIDDEN)

    def loss_fn(params, other_vars, inp):
      out = model.apply({'params':params, **other_vars}, inp)
      return jnp.mean(out)

    # Initialize models.
    variables = model.init(init_rng, inp)
    other_variables, params = flax.core.pop(variables, 'params')

    # Construct the forward and backward function
    fwd_bwd_fn = jax.value_and_grad(loss_fn, argnums=(0, 1))

    for _ in range(10):
      loss, (param_grads, other_grads) = fwd_bwd_fn(params, other_variables, inp)

Installation

Pre-requisites

  • Linux x86_64
  • CUDA 11.8+ for Hopper and CUDA 12.1+ for Ada
  • NVIDIA Driver supporting CUDA 11.8 or later
  • cuDNN 8.1 or later
  • For fused attention, CUDA 12.1 or later, NVIDIA Driver supporting CUDA 12.1 or later, and cuDNN 8.9 or later.

Docker

The quickest way to get started with Transformer Engine is by using Docker images on NVIDIA GPU Cloud (NGC) Catalog. For example to use the NGC PyTorch container interactively,

docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.10-py3

Where 23.10 is the container version. For example, 23.10 for the October 2023 release.

pip

To install the latest stable version of Transformer Engine,

pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable

This will automatically detect if any supported deep learning frameworks are installed and build Transformer Engine support for them. To explicitly specify frameworks, set the environment variable NVTE_FRAMEWORK to a comma-separated list (e.g. NVTE_FRAMEWORK=jax,pytorch).

From source

See the installation guide.

Compiling with FlashAttention-2

Transformer Engine release v0.11.0 adds support for FlashAttention-2 in PyTorch for improved performance.

It is a known issue that FlashAttention-2 compilation is resource-intensive and requires a large amount of RAM (see bug), which may lead to out of memory errors during the installation of Transformer Engine. Please try setting MAX_JOBS=1 in the environment to circumvent the issue. If the errors persist, install a supported version of FlashAttention-1 (v1.0.6 to v1.0.9).

Note that NGC PyTorch 23.08+ containers include FlashAttention-2.

FP8 Convergence

FP8 has been tested extensively across different model architectures and configurations and we found no significant difference between FP8 and BF16 training loss curves. FP8 has also been validated for accuracy on downstream LLM tasks (e.g. LAMBADA and WikiText). Below are examples of models tested for convergence across different frameworks.

Model Framework Source
T5-770M

JAX/T5x

https://github.com/NVIDIA/JAX-Toolbox/tree/main/rosetta/rosetta/projects/t5x#convergence-and-performance
MPT-1.3B

Mosaic Composer

https://www.mosaicml.com/blog/coreweave-nvidia-h100-part-1
GPT-5B

JAX/Paxml

https://github.com/NVIDIA/JAX-Toolbox/tree/main/rosetta/rosetta/projects/pax#h100-results
GPT-5B

NeMo Framework

Available on request
LLama2-7B

Alibaba Pai

https://mp.weixin.qq.com/s/NQT0uKXLbXyh5031zBdeBQ
T5-11B

JAX/T5x

Available on request
MPT-13B

Mosaic Composer

https://www.databricks.com/blog/turbocharged-training-optimizing-databricks-mosaic-ai-stack-fp8
GPT-22B

NeMo Framework

Available on request
LLama2-70B

Alibaba Pai

https://mp.weixin.qq.com/s/NQT0uKXLbXyh5031zBdeBQ
GPT-175B

JAX/Paxml

https://github.com/NVIDIA/JAX-Toolbox/tree/main/rosetta/rosetta/projects/pax#h100-results

Integrations

Transformer Engine has been integrated with popular LLM frameworks such as:

Contributing

We welcome contributions to Transformer Engine! To contribute to Transformer Engine and make pull requests, follow the guidelines outlined in the CONTRIBUTING.rst guide.

Papers

Videos

transformerengine's People

Contributors

asfiyab-nvidia avatar cyanguwa avatar denera avatar erhoo82 avatar galagam avatar hugo-syn avatar jeng1220 avatar jinzex avatar ksivaman avatar marks101 avatar mingxu1067 avatar minitu avatar nouiz avatar nzmora-nvidia avatar phu0ngng avatar ptrendx avatar quentin-anthony avatar rachitgarg91 avatar sanandaraj5597 avatar sbhavani avatar schetlur-nv avatar sudhakarsingh27 avatar timmoon10 avatar tom-zheng avatar trevor-m avatar vasunvidia avatar victarry avatar wong4j avatar xrennvidia avatar zlsh80826 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

transformerengine's Issues

FP8 not support all tensor shapes

When I use input shape (50, 4096) and Linear shape (4096, 16384), it is asserted by the code. It needs to be divided by 8, how do I solve this problem?

setup.py installation error

I am getting a cuda_to _hip_mappings.py error when I run the setup.py installation

NVIDIA GEFORCE RTX 4090
5.15.90.1-microsoft-standard-WSL2 Ubuntu on WIndows 11
venv
CUDA 12.1
Python 3.10.6

Traceback (most recent call last):
  File "/home/antman/DEV/llmenv/TransformerEngine/setup.py", line 212, in <module>
    from torch.utils.cpp_extension import CUDAExtension
  File "/home/antman/DEV/llmenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 19, in <module>
    from .hipify import hipify_python
  File "/home/antman/DEV/llmenv/lib/python3.10/site-packages/torch/utils/hipify/hipify_python.py", line 34, in <module>
    from .cuda_to_hip_mappings import CUDA_TO_HIP_MAPPINGS
  File "/home/antman/DEV/llmenv/lib/python3.10/site-packages/torch/utils/hipify/cuda_to_hip_mappings.py", line 34, in <module>
    rocm_path = subprocess.check_output(["hipconfig", "--rocmpath"]).decode("utf-8")
  File "/usr/lib/python3.10/subprocess.py", line 420, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.10/subprocess.py", line 501, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.10/subprocess.py", line 969, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.10/subprocess.py", line 1845, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
NotADirectoryError: [Errno 20] Not a directory: 'hipconfig'

How does TransformerEngine interact with TensorRT/Onnx?

Hi,
We are often converting PyTorch models to other engines in order to optimize inference speed. Is it possible to convert a PyTorch model that is using TransformerEngine into TensorRT for faster inference? Or is it a choice either between nn.Linear with TensorRT right now?

TF installation failed when using pip install but not apt-get

It seems we updated the installation instructions from

apt-get install ninja-build pybind11-dev

to

pip install pybind11
pip install ninja

This would cause the TF build failure with:

Building wheels for collected packages: transformer-engine
  Building wheel for transformer-engine (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [75 lines of output]
      /usr/local/lib/python3.8/dist-packages/setuptools/dist.py:547: UserWarning: Normalizing '0.8.0dev
      ' to '0.8.0.dev0'
        warnings.warn(tmpl.format(**locals()))
      running bdist_wheel
      running build
      running build_py
      copying transformer_engine/tensorflow/module.py -> build/lib.linux-x86_64-3.8/transformer_engine/tensorflow
      copying transformer_engine/tensorflow/transformer.py -> build/lib.linux-x86_64-3.8/transformer_engine/tensorflow
      running build_ext
      Building CMake extensions!
      Could not find a recent CMake to build Transformer Engine. Attempting to install CMake 3.18 to a temporary location via pip.
      WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
      Running CMake in build/temp.linux-x86_64-3.8/Release:
      /tmp/nvte-cmake-tmp1wzpr34f/bin/run_cmake /home/workspace/repo_zoo/TransformerEngine/transformer_engine -DCMAKE_BUILD_TYPE=Release -DCMAKE_LIBRARY_OUTPUT_DIRECTORY_RELEASE=/home/workspace/repo_zoo/TransformerEngine/build/lib.linux-x86_64-3.8 -GNinja -DENABLE_TENSORFLOW=ON
      /tmp/nvte-cmake-tmp1wzpr34f/bin/run_cmake --build . --config Release
      CMake Error at CMakeLists.txt:38 (find_package):
        Could not find a package configuration file provided by "pybind11" with any
        of the following names:
                                                                                                                                                                                                                                                                                                                                 pybind11Config.cmake
          pybind11-config.cmake

        Add the installation prefix of "pybind11" to CMAKE_PREFIX_PATH or set
        "pybind11_DIR" to a directory containing one of the above files.  If
        "pybind11" provides a separate development package or SDK, be sure it has
        been installed.
      -- Configuring incomplete, errors occurred!
      See also "/home/workspace/repo_zoo/TransformerEngine/build/temp.linux-x86_64-3.8/Release/CMakeFiles/CMakeOutput.log".
      See also "/home/workspace/repo_zoo/TransformerEngine/build/temp.linux-x86_64-3.8/Release/CMakeFiles/CMakeError.log".
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/home/workspace/repo_zoo/TransformerEngine/setup.py", line 393, in <module>
          setup(
        File "/usr/local/lib/python3.8/dist-packages/setuptools/__init__.py", line 108, in setup
          return distutils.core.setup(**attrs)
        File "/usr/lib/python3.8/distutils/core.py", line 148, in setup
          dist.run_commands()
        File "/usr/lib/python3.8/distutils/dist.py", line 966, in run_commands
          self.run_command(cmd)
        File "/usr/local/lib/python3.8/dist-packages/setuptools/dist.py", line 1221, in run_command
          super().run_command(command)
        File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/usr/local/lib/python3.8/dist-packages/wheel/bdist_wheel.py", line 343, in run
          self.run_command("build")
        File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/usr/local/lib/python3.8/dist-packages/setuptools/dist.py", line 1221, in run_command
          super().run_command(command)
        File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/usr/lib/python3.8/distutils/command/build.py", line 135, in run
          self.run_command(cmd_name)
        File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/usr/local/lib/python3.8/dist-packages/setuptools/dist.py", line 1221, in run_command
          super().run_command(command)
        File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/home/workspace/repo_zoo/TransformerEngine/setup.py", line 351, in run
          self.cmake_build_extensions.run()
        File "/usr/local/lib/python3.8/dist-packages/setuptools/command/build_ext.py", line 84, in run
          _build_ext.run(self)
        File "/usr/local/lib/python3.8/dist-packages/Cython/Distutils/old_build_ext.py", line 186, in run
          _build_ext.build_ext.run(self)
        File "/usr/lib/python3.8/distutils/command/build_ext.py", line 340, in run
          self.build_extensions()
        File "/home/workspace/repo_zoo/TransformerEngine/setup.py", line 312, in build_extensions
          subprocess.check_call(command, cwd=cmake_build_dir)
        File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
          raise CalledProcessError(retcode, cmd)
      subprocess.CalledProcessError: Command '['/tmp/nvte-cmake-tmp1wzpr34f/bin/run_cmake', '/home/workspace/repo_zoo/TransformerEngine/transformer_engine', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY_RELEASE=/home/workspace/repo_zoo/TransformerEngine/build/lib.linux-x86_64-3.8', '-GNinja',
 '-DENABLE_TENSORFLOW=ON']' returned non-zero exit status 1.\      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for transformer-engine
  Running setup.py clean for transformer-engine

Also, cc @trevor-m I remembered we have met the similar things before about the version issues of ninja or pybind11.

Why is the first dimension `sequence_length` as opposed to `batch_size` in the input tensor

Why does the library use sequence_length as the first dimension of the input-tensor as opposed to the batch_size?

Is this just a choice of convention from RNNs or is the difference performance related?

From the example code:

bmm1 = torch.bmm(query.transpose(0, 1), key.transpose(0, 1).transpose(1, 2)) / self.norm_factor
https://github.com/NVIDIA/TransformerEngine/blob/main/docs/examples/quickstart_utils.py#L93

I see two successive transpose(0, 1) operations?

Thanks!

build problem torch 2.x latest gcc12

Hi Folks,

Hitting strange issue. Did you try to build it with torch 2.x

/home/spyroot/miniconda3/envs/test/lib/python3.10/site-packages/torch/include/pybind11/detail/../cast.h:42:120: error: expected template-name before ‘<’ token
   42 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                        ^
/home/spyroot/miniconda3/envs/test/lib/python3.10/site-packages/torch/include/pybind11/detail/../cast.h:42:120: error: expected identifier before ‘<’ token
/home/spyroot/miniconda3/envs/test/lib/python3.10/site-packages/torch/include/pybind11/detail/../cast.h:42:123: error: expected primary-expression before ‘>’ token
   42 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                           ^
/home/spyroot/miniconda3/envs/test/lib/python3.10/site-packages/torch/include/pybind11/detail/../cast.h:42:126: error: expected primary-expression before ‘)’ token
   42 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                               
    ```

Compile with Cuda 12.1 and didn't hit issue anything else. 
    

CUDA_DIR=/usr/local/cuda
PATH="$CUDA_DIR/bin:$PATH"
CXXFLAGS='-Wno-maybe-uninitialized -Wno-uninitialized -Wno-free-nonheap-object -Wno-nonnull'
CFLAGS='-Wno-maybe-uninitialized -Wno-uninitialized -Wno-free-nonheap-object -Wno-nonnull'
TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0"
CMAKE_CUDA_ARCHITECTURES="80;86;87;89;90"
CMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
CMAKE_BUILD_TYPE=Release
python setup.py build -j 4

/home/spyroot/miniconda3/envs/test/lib/python3.10/site-packages/setuptools/dist.py:529: UserWarning: Normalizing '0.9.0dev
' to '0.9.0.dev0'
warnings.warn(tmpl.format(**locals()))
running build
running build_py
running build_ext
Building CMake extensions!
Running CMake in build/temp.linux-x86_64-cpython-310/Release:
cmake /home/spyroot/dev/build/test/TransformerEngine/transformer_engine -DCMAKE_BUILD_TYPE=Release -DCMAKE_LIBRARY_OUTPUT_DIRECTORY_RELEASE=/home/spyroot/dev/dev/test/TransformerEngine/build/lib.linux-x86_64-cpython-310
cmake --build . --config Release
-- cudnn found at /usr/lib/x86_64-linux-gnu/libcudnn.so.
-- cudnn_adv_infer found at /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.
-- cudnn_adv_train found at /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.
-- cudnn_cnn_infer found at /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.
-- cudnn_cnn_train found at /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.
-- cudnn_ops_infer found at /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.
-- cudnn_ops_train found at /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.
-- cuDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so
-- cuDNN: /usr/include
-- Configuring done
-- Generating done
-- Build files have been written to: /home/spyroot/dev/build/test/TransformerEngine/build/temp.linux-x86_64-cpython-310/Release

Preferred global encoding format getting changed in TE

When using the TransformerLayer or LayerNormMLP APIs, the preferred encoding format is getting changed. Here is a repro script:

import locale
import torch
from transformer_engine.pytorch import TransformerLayer 

H = 768
seqlen = 2048                                                                                                                                                                                                                                                                                                        
batch_size = 2
nheads = 12

inp = torch.randn(seqlen, batch_size, H, device="cuda")
model = TransformerLayer(H, 4*H, nheads)

assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"
out = model(inp)
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"
out = model(inp)
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"

ONNX

Can ONNX utilize FP8 speedup? If so, how?

Enabling ``sequence_parallel`` slows down training with fp16

I am testing GPT2 model training using TransformerLayer.

Training slows down significantly when sequence_parallel=True, achieves 1/5th of throughput of training without sequence_parallel.
I also observe that sequence_parallel=True results in OOM for some batch sizes where sequence_parallel=False can run successfully.

Do you have any recommendation to achieve better throughput with sequence_parallel and fp16?

Model is ~4.3B with 12 layers, tp_size=4, fp16, seq_len=2048, training with 8 A100 GPUs.

transformer_engine.pytorch.TransformerLayer(
5120,
20480,
40,
layer_number=(l+1),
self_attn_mask_type="causal",
tp_group=tp_group(),
tp_size=tp_size,
params_dtype=torch.float16,
output_layernorm=True,
layer_type="encoder",
set_parallel_mode=True,
fuse_qkv_params=True,
sequence_parallel=True,
qkv_weight_interleaved=False,
attention_softmax_in_fp32=False,
)

FP8 format

I am trying this pytorch example below,
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html?highlight=e4m3#pytorch
It is working fine with E4M3 data format, However, I tried with E5M2 and I got the following error

Traceback (most recent call last):
File "transformer/transformer.py", line 16, in
fp8_recipe = recipe.DelayedScaling(margin=0, interval=1, fp8_format=recipe.Format.E5M2)
File "pydantic/dataclasses.py", line 286, in pydantic.dataclasses._add_pydantic_validation_attributes.handle_extra_init
f'default={self.default!r},'
File "", line 11, in init
File "pydantic/dataclasses.py", line 305, in pydantic.dataclasses._add_pydantic_validation_attributes.new_post_init
def set_name(self, owner, name):
File "/usr/local/lib/python3.10/dist-packages/transformer_engine/common/recipe.py", line 135, in post_init
assert self.fp8_format != Format.E5M2, "Pure E5M2 training is not supported."
AssertionError: Pure E5M2 training is not supported.

Just wondering how to enable E5M2 format in this case. Thanks!

Please consider supporting Windows

Currently, this repository blocks SM8.9 (Lovelace) outright, but AFAIK Lovelace supports the Transformer Engine just like Hopper. Adding Lovelace support would enable more users to use this library.

Another request is to support Windows. By default, it will error with nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified. My attemps to fix it by fiddling with CMake generators (using MSVC generator for example) and paths yielded no results. Enabling verbose output also didn't produce any useful error.

If you're interested in supporting Windows, I would be willing to test any changes on my machine.

Flash Attention Support

Hey there,

Does this library support Flash Attention (Hazy Research)? Are there any constraints on either the head-dimension or sequence length? What about the PyTorch 2.0 SDPA implementation? We are mainly interested in using FP8 via Flash Attention. Also want to add a negative bias prior to softmax (ALiBi).

What happens if we try to use the PyTorch 2.0 compiler with this framework? Is everything fully supported? Will the compiler automatically fuse the various layers? Or should we just follow the tutorial and fuse everything manually?

One more thing, we have a pre-trained model that was trained using FP32. We would like to extend the pre-training process. Is it possible to first convert that model to FP8, and then continue training in FP8? Do you have a process we can follow for doing performing that down-scaling step?

Thanks!

Upgrade `flash-attn` to 1.0.7?

#254 set the version at 1.0.6. It seems that now the container uses version 1.0.7, so to stay compatible, should we update as well, or is there something stopping us from doing that?

How to use KV-cache with PyTorch

The JAX api for TransformerLayer includes an inference mode optimized for faster autoregressive decoding use a key value cache:

decode (bool,default = False) – Indicate whether to prepare and use an autoregressive cache in Multi-head attention (MHA).

Is there no such option for PyTorch inference?

Support simulating FP8 on older hardware

It would be great if this library supported simulating FP8 on eg Ampere hardware, as you did in the FP8 whitepaper. I'm sure a lot of people are interested in seeing if their models will work well in FP8 before investing a lot of money in H100s, let alone the fact that they're barely available yet.

I see https://github.com/IntelLabs/FP8-Emulation-Toolkit, but it's poorly documented and it's not clear if it implements the same tensor scaling algorithms that you have here.

FP8 questions...

Hey there,

I had some quick questions about the FP8 integration; what type of memory/performance improvements should we expect compared to BF16? I know FP8 has two formats: E4M3 and E5M2; is there an additional overhead for switching between the two?

thanks!

Dimension ordering for input sequences

From the examples:

https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/quickstart.html

I noticed that you were ordering the sequence-length first in the synthetic data, as opposed to the batch-size.

I benchmarked both combinations using the speedometer from your quickstart_utils.py file and noticed that the batch-size first tensor had better performance. Is there a preferred approach? thanks!

I also noticed the same convention here:

https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/api/pytorch.html#transformer_engine.pytorch.DotProductAttention

"Input tensors query_layer, key_layer, and value_layer must each be of shape (sequence_length, batch_size, num_attention_heads, kv_channels). Output of shape (sequence_length, batch_size, num_attention_heads * kv_channels) is returned."

What is the advantage of using sequence_lengths first in the tensor? Is this is a performance optimization?

Check that the layer number is not 0

This parameter

layer_number: int, default = `None`
layer number of the current `DotProductAttention` when multiple such modules
are concatenated, for instance in consecutive transformer blocks.

should indicate that the number must be 1-index based. An assertion could be added to enforce it.

Otherwise, if the user accidentally passes 0, then https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/attention.py#L206 will raise a division by 0 error after this multiplication: https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/attention.py#L198

[Feature Request] Compressed Tiled Matrix Multiply for Autoregressive LLM Inference

Is your feature request related to a problem? Please describe.

I would like to request the implementation of a compressed tiled matrix multiply operator for use in large language model inference.
This feature would open up the path to accelerate inference speeds for autoregressive LLMs that have undergone unstructured pruning.
LLM autoregressive inference is notoriously memory bound, with GPU utilization of less than 1% being commonplace.
The main cause of this bottleneck is that the size of the weight tensors is several hundred gigabytes, while there are very few computations performed on them, especially in the case of single-batch inference, which is unfortunately very common when low latency is required.

The simplest solution for this problem is to use compressed tensors on DRAM and decompress them dynamically while calculating the matrix multiply. As matrix multiplication is already performed using tiles to maximize memory locality, I do not think compressing the tiles beforehand would be difficult to integrate.

Describe the solution you'd like

I would like to propose a new primitive where the inputs are large matrices that have been chunked and compressed beforehand.
The compression algorithm would be implemented by first applying byte shuffling and bit shuffling filters on the tile.
See https://earthscience.stackexchange.com/questions/12527/regarding-compression-shuffle-filter-of-netcdf4 for an explanation of shuffle filtering and https://github.com/kiyo-masui/bitshuffle for the implementation of bit shuffling. An alternative shuffling method may be more appropriate for floating point data.

Decompression of the tiles would then occur on shared memory, from where they could be fed into the tiled matrix multiplication. Also, L2 cache utilization would likely be higher due to the smaller data size.

This repository may be useful for cases where the weights have been quantized to integers. https://github.com/powturbo/TurboPFor-Integer-Compression

See https://github.com/Blosc/c-blosc2 for algorithms and design patterns on compression.

Describe alternatives you've considered

In-memory compression is available in data center GPUs such as the A100 or the H100. However, compressed memory allocation is not accessible from the CUDA runtime API and must be allocated from the driver API, making it difficult to integrate with existing libraries. Not everyone has access to data center GPUs, and a software implementation would make this feature available even on consumer GPUs. Also, hardware in-memory compression does not reduce the assigned memory in HBM or L2 cache, making it ineffective in reducing memory size.

The nvCOMP https://github.com/NVIDIA/nvcomp library provides an implementation of LZ4 and other compression algorithms. However, it is no longer open-source and also does not implement shuffling filters. Moreover, it cannot be integrated with tiled matrix multiplication.

Additional context

Unstructured pruning is the easiest kind of model compression to apply but also the least useful because no calculations can be skipped. However, in the highly memory-constrained case of LLM inference, the main bottleneck is DRAM/HBM memory size and read speed, both of which can be alleviated via tensor compression. Assuming that the models have been sparsified sufficiently, even ten-fold memory reduction and acceleration are feasible. Also, it may become possible to perform LLM inference on consumer GPUs at reasonable latency.

logging.h:39 in function check_cublas_: CUBLAS Error: the requested functionality is not supported

I installed and ran the Mnist example with use-fp8 / use-te, worked fine.

Now I am trying to implement fp8 for OpenNMT-py simply using te.fp8_autocast
Got this error nelow.
Iis this an installation issue or a specific operation related?

Traceback (most recent call last):
  File "/home/vincent/nlp/OpenNMT-py/train.py", line 6, in <module>
    main()
  File "/home/vincent/nlp/OpenNMT-py/onmt/bin/train.py", line 67, in main
    train(opt)
  File "/home/vincent/nlp/OpenNMT-py/onmt/bin/train.py", line 52, in train
    train_process(opt, device_id=0)
  File "/home/vincent/nlp/OpenNMT-py/onmt/train_single.py", line 227, in main
    trainer.train(
  File "/home/vincent/nlp/OpenNMT-py/onmt/trainer.py", line 323, in train
    self._gradient_accumulation(
  File "/home/vincent/nlp/OpenNMT-py/onmt/trainer.py", line 588, in _gradient_accumulation
    raise exc
  File "/home/vincent/nlp/OpenNMT-py/onmt/trainer.py", line 573, in _gradient_accumulation
    self.optim.backward(loss)
  File "/home/vincent/nlp/OpenNMT-py/onmt/utils/optimizers.py", line 347, in backward
    loss.backward()
  File "/home/vincent/miniconda3/envs/pytorch1.14/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/vincent/miniconda3/envs/pytorch1.14/lib/python3.10/site-packages/torch/autograd/__init__.py", line 204, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/vincent/miniconda3/envs/pytorch1.14/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/home/vincent/miniconda3/envs/pytorch1.14/lib/python3.10/site-packages/transformer_engine/pytorch/module/linear.py", line 367, in backward
    wgrad = fp8_gemm(
  File "/home/vincent/miniconda3/envs/pytorch1.14/lib/python3.10/site-packages/transformer_engine/pytorch/cpp_extensions.py", line 829, in fp8_gemm
    _ = fn(*args)
  File "/home/vincent/miniconda3/envs/pytorch1.14/lib/python3.10/site-packages/torch/_ops.py", line 646, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: /home/vincent/nlp/TransformerEngine/transformer_engine/common/include/transformer_engine/logging.h:39 in function check_cublas_: CUBLAS Error: the requested functionality is not supported

The TF installation fails with `No such file #include <cudnn_frontend.h>`

With the latest code, it seems we can no longer build the code in the TF containers. After executing pip install ., I got:

FAILED: common/CMakeFiles/transformer_engine.dir/fused_attn/fused_attn_fp8.cu.o
      /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -Dtransformer_engine_EXPORTS -I/home/workspace/repo_zoo/TransformerEngine/transformer_engine -I/home/workspace/repo_zoo/TransformerEngine/transformer_engine/common/include
-I/usr/local/cuda/targets/x86_64-linux/include -I/home/workspace/repo_zoo/TransformerEngine/transformer_engine/../3rdparty/cudnn-frontend/include -I/tmp/tmpy_w9ikze/common/string_headers -isystem=/usr/local/cuda/include --threads 4 --exp
t-relaxed-constexpr -O3 -O3 -DNDEBUG --generate-code=arch=compute_70,code=[compute_70,sm_70] --generate-code=arch=compute_80,code=[compute_80,sm_80] --generate-code=arch=compute_89,code=[compute_89,sm_89] --generate-code=arch=compute_90,
code=[compute_90,sm_90] -Xcompiler=-fPIC -std=c++17 -MD -MT common/CMakeFiles/transformer_engine.dir/fused_attn/fused_attn_fp8.cu.o -MF common/CMakeFiles/transformer_engine.dir/fused_attn/fused_attn_fp8.cu.o.d -x cu -c /home/workspace/re
po_zoo/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_fp8.cu -o common/CMakeFiles/transformer_engine.dir/fused_attn/fused_attn_fp8.cu.o
      In file included from /home/workspace/repo_zoo/TransformerEngine/transformer_engine/common/fused_attn/fused_attn_fp8.cu:10:
      /home/workspace/repo_zoo/TransformerEngine/transformer_engine/common/fused_attn/utils.h:14:10: fatal error: cudnn_frontend.h: No such file or directory
         14 | #include <cudnn_frontend.h>
            |          ^~~~~~~~~~~~~~~~~~

It seems we need the similar setup_tensorflow_extension function as for the pytorch or paddle in the setup.py, where the /opt/tensorflow/cudnn-frontend/include/ needs to be added.

@trevor-m Can you take a look when you have bandwidth?

Also curious: is such installation break captured by our CI/CD?

Latency/Energy Calculation

Is there is a method to calculate performance metrics (Latency, Energy) of a Transformer model (BERT, ViT) on the H100 Transformer Engine using Pytorch?

Increased step time when using `transformer_engine.pytorch.Linear`

I am experiencing increased step time when using transformer_engine.pytorch.Linear instead of torch.nn.Linear with fp16 training.

Mean step time(forward+backward) using single te.Linear is 30% higher than torch.nn.Linear with the following test. Backward pass takes significantly longer.

Set up: TransformerEngine release v0.7, Pytorch 2.0.0, A100 GPU.

Test with single linear layer:

import transformer_engine.pytorch as te

hidden_size = 4096
sequence_length = 2048
batch_size = 4
dtype = torch.float16

input = torch.rand(sequence_length, batch_size, hidden_size).cuda().to(dtype=dtype)
input_y = torch.rand(sequence_length, batch_size, 3 * hidden_size).cuda().to(dtype=dtype)

linear_layer = te.Linear(
    hidden_size,
    3 * hidden_size,
    bias=True,
    params_dtype=dtype,
)

linear_layer.to(dtype=dtype).cuda()

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

number_of_iterations = 50 

start.record()
for i in range(number_of_iterations):
    with te.fp8_autocast(enabled=False):
        loss = linear_layer(input)
    loss.backward(input_y)
end.record()

torch.cuda.synchronize()

print(f"step time: {start.elapsed_time(end)/number_of_iterations}")

Installation errors on Ampere GPUs

Is there a TransformerEngine source version I can use to install on Ampere GPUs?
It seems that the default installation requires CUDA with FP8 which is not supported on Ampere GPUs. - Please correct me if I am wrong.

RuntimeError: _Map_base::at Error when Exporting to ONNX

I am encountering a RuntimeError: _Map_base::at error when attempting to export my network to ONNX format. Here is the relevant code:

import torch
import torch.nn as nn
from transformer_engine import pytorch as te


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = te.Linear(784, 256)
        self.fc2 = te.Linear(256, 128)
        self.fc3 = te.Linear(128, 10)

    def forward(self, x):
        x = x.view(-1, 784)
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

device = torch.device("cuda")
model = Net().to(device)

with te.onnx_export(True):
    torch.onnx.export(
        model,
        torch.randn((1,1,28,28), device="cuda"),
        "./onnx/net_fp8.onnx",
        verbose=True,
        opset_version=13,
        input_names=["input"],
        output_names=["output"],
        do_constant_folding=True)

FP8 vs FP16 performance (seq2seq transformer with te.Linear replacing nn.Linear layers)

Here is what I am getting (see below)

FP8 slower than FP16

for FP16, multiples of 16 make things slower than multiple of 8

Am I missing something ?

Batch_size_multiple 16 // Seqlen multiple 16

FP8 (adam)
[2023-05-17 22:20:28,534 INFO] Step 100/300000; acc: 16.1; ppl: 6038.0; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 14043/16656 tok/s; 61 sec;
[2023-05-17 22:21:06,060 INFO] Step 200/300000; acc: 20.6; ppl: 1059.6; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 23063/27297 tok/s; 99 sec;
[2023-05-17 22:21:43,862 INFO] Step 300/300000; acc: 25.3; ppl: 466.3; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 23082/27262 tok/s; 136 sec;
[2023-05-17 22:22:21,180 INFO] Step 400/300000; acc: 27.6; ppl: 315.5; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 22912/27074 tok/s; 174 sec;
[2023-05-17 22:22:58,740 INFO] Step 500/300000; acc: 30.4; ppl: 236.7; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 22880/27001 tok/s; 211 sec;

FP16 (adam)
[2023-05-17 22:24:39,883 INFO] Step 100/300000; acc: 16.2; ppl: 6127.8; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 18771/22265 tok/s; 46 sec;
[2023-05-17 22:25:04,966 INFO] Step 200/300000; acc: 20.6; ppl: 1061.8; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 34504/40838 tok/s; 71 sec;
[2023-05-17 22:25:30,067 INFO] Step 300/300000; acc: 25.3; ppl: 467.8; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 34760/41057 tok/s; 96 sec;
[2023-05-17 22:25:55,069 INFO] Step 400/300000; acc: 27.4; ppl: 320.1; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 34199/40411 tok/s; 121 sec;
[2023-05-17 22:26:19,589 INFO] Step 500/300000; acc: 30.1; ppl: 241.5; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 35048/41359 tok/s; 145 sec;

FP16 (fusedadam)
[2023-05-17 22:28:29,266 INFO] Step 100/300000; acc: 16.1; ppl: 6160.6; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 20312/24092 tok/s; 42 sec;
[2023-05-17 22:28:49,956 INFO] Step 200/300000; acc: 20.6; ppl: 1063.8; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 41830/49509 tok/s; 63 sec;
[2023-05-17 22:29:11,128 INFO] Step 300/300000; acc: 25.3; ppl: 468.3; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 41213/48678 tok/s; 84 sec;
[2023-05-17 22:29:32,063 INFO] Step 400/300000; acc: 27.4; ppl: 320.2; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 40842/48260 tok/s; 105 sec;
[2023-05-17 22:29:52,720 INFO] Step 500/300000; acc: 30.2; ppl: 241.3; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 41603/49095 tok/s; 126 sec;

Batch_size_multiple 8 // Seqlen multiple 8
FP16 (Fusedadam)
[2023-05-17 22:32:08,412 INFO] Step 100/300000; acc: 16.0; ppl: 6256.0; xent: 8.7; lr: 0.00002; sents: 34120; bsz: 2337/2766/85; 22346/26446 tok/s; 42 sec;
[2023-05-17 22:32:29,029 INFO] Step 200/300000; acc: 20.9; ppl: 1047.4; xent: 7.0; lr: 0.00005; sents: 31128; bsz: 2349/2772/78; 45571/53777 tok/s; 62 sec;
[2023-05-17 22:32:49,643 INFO] Step 300/300000; acc: 24.6; ppl: 482.1; xent: 6.2; lr: 0.00007; sents: 26808; bsz: 2346/2776/67; 45523/53867 tok/s; 83 sec;
[2023-05-17 22:33:10,198 INFO] Step 400/300000; acc: 27.0; ppl: 326.7; xent: 5.8; lr: 0.00010; sents: 28448; bsz: 2341/2771/71; 45563/53917 tok/s; 104 sec;
[2023-05-17 22:33:30,629 INFO] Step 500/300000; acc: 30.0; ppl: 242.5; xent: 5.5; lr: 0.00012; sents: 27072; bsz: 2338/2764/68; 45773/54123 tok/s; 124 sec;

Build failure, missing cudnn-frontend

When building from the head of main, the following error occurred:

/opt/transformer-engine/transformer_engine/common/fused_attn/utils.h:11:10: fatal error: cudnn_frontend.h: No such file or directory
#8 455.6            11 | #include <cudnn_frontend.h>

It appears that the cudnn-frontend package is needed, but the installation guide does not specify how this dependency should be installed. Could this info be added to the docs?

Development install fails due to missing `packaging` package.

Hi! #186 introduces packaging as a prerequisite for installing TE, but it is not documented in the installation guide.

I can see that it is declared in setup.py:

class PyTorchBuilder(FrameworkBuilderBase):
    ...
    @staticmethod
    def install_requires():
        return ["flash-attn>=1.0.2", "packaging"]

but it seems that in development installation this would not work because we will not be able to import packaging in the first place.

Context: our nightly TE-JAX container build failed due to this error. See full log here. I was able to fix the build by installing packaging alongside pybind11 and ninja, but feel that it could be at least better documented.

Using torchaudio alongside TransformerEngine

Hi,
I do not know how to use torchaudio alongide TransformerEngine. The NGC docker nvcr.io/nvidia/pytorch:23.04-py3 doesn't come installed with torchaudio. If I do a normal torch install I lose CUDA 12.1. So, I'm using torch-nightly instead with cuda 12.1: pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

But then when I import import transformer_engine.pytorch as te I get this error:
ImportError: /usr/local/lib/python3.8/dist-packages/transformer_engine_extensions.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

By the way, does TransformerEngine support Python 3.11 or do I have to use Python 3.8?

Thank you

FP8 Memory Usage

How much more GPU memory will the TE layers consume vs a standard Pytorch layer? I'm seeing close to double the consumption for the opt-1.3b model when compared to standard opt-1.3b in fp16.

Cannot import transformer_engine_extensions

Issue

Hit error when simply import transformer_engine_extensions:

ImportError: /usr/local/lib/python3.8/dist-packages/transformer_engine_extensions.cpython-38-x86_64-linux-gnu.so: undefined symbol: nvte_layernorm_bwd

Seems some necessary sources are not included:

pytorch_sources = [
"transformer_engine/pytorch/csrc/extensions.cu",
"transformer_engine/pytorch/csrc/common.cu",
"transformer_engine/pytorch/csrc/ts_fp8_op.cpp",
]

Could this be solved? Thanks.

Environment

  • Host: DGX H100 (cuda_12.0.r12.0/compiler.32267302_0)
  • Docker: nvcr.io/nvidia/pytorch:23.02-py3

To reproduce

  • transformer_engine is already installed in the above docker (v0.5.0), just run the import command in Python.
  • Also tried installing the latest version from source via NVTE_FRAMEWORK=pytorch pip install ., but still met this error.

Loading existing weights trained without TransformerEngine

Hi,
I get the following error:

RuntimeError: Error(s) in loading state_dict for Sequential:
	Missing key(s) in state_dict: "0._extra_state". 

I am trying to get a model that has already been trained on an A100 GPU to run faster on an H100 GPU by leveraging the TransformerEngine. However, since it was trained using vanilla nn.Linear there is no _extra_state.

What exactly does _extra_state represent? Is it the amax values for FP8 recipes? Something different? Is there any way to create it retro-actively? IF yes, can it be possible to do with a forward pass or does it have to be a backward pass? The model already performs well and I do not want to modify its weights, only to make it faster.

I am using nvcr.io/nvidia/pytorch:23.04-py3

The error is easy to replicate like this:

import torch
import torch.nn as nn

# Create a simple model with one linear layer
model = nn.Sequential(nn.Linear(10, 5))
print(model)

# Initialize the weights randomly
for param in model.parameters():
    nn.init.normal_(param, mean=0, std=1)

# Save the model's weights
torch.save(model.state_dict(), 'model_weights.pth')

# Load the same model using TransformerEngine.Linear instead of nn.Linear
del model
import transformer_engine.pytorch as te
model = nn.Sequential(te.Linear(in_features=10, out_features=5, bias=True))
model.load_state_dict(torch.load('model_weights.pth'))

Thanks so much

MPT-7b inference

Hello,
I wanted to ask if anyone was able to make the MPT-7b models work with this repo?

Thank you very much!

It doesn't support the latest RTX 40-series card

Hi, the FP8 should be supported for RTX 40-series as well since it's based on the AD102 architecture which has FP8 capabilities. However running TransformerEngine on the RTX 4090 results in an error: "AssertionError: Device compute capability 9.x required for FP8 execution.". Thus, unable to take advantage of FP8.

'Parameter' object has no attribute 'main_grad'

Any idea what could be going wrong with fuse_wgrad_accumulation?

[1,0]<stderr>:│                                                                                                  │
[1,0]<stderr>:│ /opt/conda/lib/python3.10/site-packages/transformer_engine/pytorch/module/layernorm_mlp.py:640   │
[1,0]<stderr>:│ in backward                                                                                      │
[1,0]<stderr>:│                                                                                                  │
[1,0]<stderr>:│    637 │   │   │   │   │   │   grad=True,                                                        │
[1,0]<stderr>:│    638 │   │   │   │   │   │   use_bias=ctx.use_fc2_bias,  [1,0]<stderr>:                                      │
[1,0]<stderr>:│    639 │   │   │   │   │   │   accumulate=accumulate_wgrad_into_param_main_grad,                 │
[1,0]<stderr>:│ ❱  640 │   │   │   │   │   │   out=fc2_weight.main_grad if ctx.fuse_wgrad_accumulation else Non  │
[1,0]<stderr>:│    641 │   │   │   │   │   )                                                                     │
[1,0]<stderr>:│    642 │   │   │   │                                                                             │
[1,0]<stderr>:│    643 │   │   │   │   if ctx.bias_gelu_nvfusion:                                                │
[1,0]<stderr>:╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
[1,0]<stderr>:AttributeError: 'Parameter' object has no attribute 'main_grad'

Converting `scaled_dot_product_attention` to `DotProductAttention`

Hi!

I am testing out TransformerEngine's attention implementation, but I'm not getting numerical parity.

This means I cannot use this layer in my LLM because the weights I'm loading perform worse with it.

Here's the script I'm using:

import math

import torch

B = 1
nh = 8
T = 128
hs = 64

with torch.device("cuda"):
    q = torch.rand(B, nh, T, hs)
    k = torch.rand(B, nh, T, hs)
    v = torch.rand(B, nh, T, hs)
    mask = torch.tril(torch.ones((T, T), dtype=torch.bool)).unsqueeze(0).unsqueeze(0)
scale = 1 / math.sqrt(hs)

y1 = torch.nn.functional.scaled_dot_product_attention(q, k, v, mask, scale=scale, is_causal=False, dropout_p=0.0)

att = (q @ k.transpose(-2, -1)) * scale
att = torch.masked_fill(att, ~mask, torch.finfo(att.dtype).min)
att = torch.nn.functional.softmax(att, dim=-1)
y2 = att @ v  # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)

print(y1.mean(), y2.mean())
torch.testing.assert_close(y1, y2)

from transformer_engine.pytorch import DotProductAttention

att = DotProductAttention(num_attention_heads=nh, kv_channels=hs, attention_dropout=.0, attn_mask_type="padding")
# requires (T, B, nh, hs)
q = q.permute(2, 0, 1, 3)
k = k.permute(2, 0, 1, 3)
v = v.permute(2, 0, 1, 3)
y3 = att(q, k, v, mask)

y1 = y1.permute(3, 0, 1, 2).reshape(T, B, -1)
print(y1.mean(), y3.mean())
torch.testing.assert_close(y1, y3)

Output:

tensor(0.4994, device='cuda:0') tensor(0.4994, device='cuda:0')
tensor(0.4994, device='cuda:0') tensor(0.4993, device='cuda:0')
Traceback (most recent call last):
  File "kk.py", line 38, in <module>
    torch.testing.assert_close(y1, y3)
  File "/home/carlos/venv/lib/python3.8/site-packages/torch/testing/_comparison.py", line 1511, in assert_close
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 65525 / 65536 (100.0%)
Greatest absolute difference: 0.6001713275909424 at index (86, 0, 256) (up to 1e-05 allowed)
Greatest relative difference: 2020.320556640625 at index (126, 0, 425) (up to 1.3e-06 allowed)

Thanks!

Make an ISSUE template

I happen frequently that issues doesn't specify the FW used.

We should make a template that ask this and the TE version used and other environement, like CUDA version, containers, ....

Add: transformerengine.test()

When end user install as documented as:

pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable

They can't run an TE test to make sure the software stack work well.
We should do as NumPy and add transformerengine.test() (equivalent of numpy.test()).

This way they won't need to clone the repo and checkout the right branch, to test the installation.

How to run LLaMa inference?

Hello!
How is it possible to run LLaMa models with the great FP8 inference speedup?

Would one need to train a new LLM from scratch or is it possible to convert existing models with the same accuracy?

Thank you very much and thank you for all the awesome work!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.