toshas / torch-discounted-cumsum Goto Github PK

Fast Discounted Cumulative Sums in PyTorch

License: Other

Shell 0.85% Python 65.32% C++ 14.83% Cuda 19.00%

pytorch discounted-cumulative-sum rl reinforcement-learning reinforce

torch-discounted-cumsum's Introduction

Fast Discounted Cumulative Sums in PyTorch

This repository implements an efficient parallel algorithm for the computation of discounted cumulative sums and a Python package with differentiable bindings to PyTorch. The discounted cumsum operation is frequently seen in data science domains concerned with time series, including Reinforcement Learning (RL).

The traditional sequential algorithm performs the computation of the output elements in a loop. For an input of size N, it requires O(N) operations and takes O(N) time steps to complete.

The proposed parallel algorithm requires a total of O(N log N) operations, but takes only O(log N) time steps, which is a considerable trade-off in many applications involving large inputs.

Features of the parallel algorithm:

Speed logarithmic in the input size
Better numerical precision than sequential algorithms

Features of the package:

CPU: sequential algorithm in C++
GPU: parallel algorithm in CUDA
Gradients computation for input and gamma
Batch support for input and gamma
Both left and right directions of summation supported
PyTorch bindings

Usage

Installation

pip install torch-discounted-cumsum

See Troubleshooting in case of errors.

API

discounted_cumsum_right: Computes discounted cumulative sums to the right of each position (a standard setting in RL)
discounted_cumsum_left: Computes discounted cumulative sums to the left of each position

Example

import torch
from torch_discounted_cumsum import discounted_cumsum_right

N = 8
gamma = 0.99
x = torch.ones(1, N).cuda()
y = discounted_cumsum_right(x, gamma)

print(y)

Output:

tensor([[7.7255, 6.7935, 5.8520, 4.9010, 3.9404, 2.9701, 1.9900, 1.0000]],
       device='cuda:0')

Up to `K` elements

import torch
from torch_discounted_cumsum import discounted_cumsum_right

N = 8
K = 2
gamma = 0.99
x = torch.ones(1, N).cuda()
y_N = discounted_cumsum_right(x, gamma)
y_K = y_N - (gamma ** K) * torch.cat((y_N[:, K:], torch.zeros(1, K).cuda()), dim=1)   

print(y_K)

Output:

tensor([[1.9900, 1.9900, 1.9900, 1.9900, 1.9900, 1.9900, 1.9900, 1.0000]],
       device='cuda:0')

Parallel Algorithm

For the sake of simplicity, the algorithm is explained for N=16. The processing is performed in-place in the input vector in log2 N stages. Each stage updates N / 2 positions in parallel (that is, in a single time step, provided unrestricted parallelism). A stage is characterized by the size of the group of sequential elements being updated, which is computed as 2 ^ (stage - 1). The group stride is always twice larger than the group size. The elements updated during the stage are highlighted with the respective stage color in the figure below. Here input elements are denoted with their position id in hex, and the elements tagged with two symbols indicate the range over which the discounted partial sum is computed upon stage completion.

Each element update includes an in-place addition of a discounted element, which follows the last updated element in the group. The discount factor is computed as gamma raised to the power of the distance between the updated and the discounted elements. In the figure below, this operation is denoted with tilted arrows with a greek gamma tag. After the last stage completes, the output is written in place of the input.

In the CUDA implementation, N / 2 CUDA threads are allocated during each stage to update the respective elements. The strict separation of updates into stages via separate kernel invocations guarantees stage-level synchronization and global consistency of updates.

The gradients for input can be obtained from the gradients for output by simply taking cumsum operation with the reversed direction of summation.

Numerical Precision

The parallel algorithm produces a more numerically-stable output than the sequential algorithm using the same scalar data type.

The comparison is performed between 3 runs with identical inputs (code). The first run casts inputs to double precision and obtains the output reference using the sequential algorithm. Next, we run both sequential and parallel algorithms with the same inputs cast to single precision and compare the results to the reference. The comparison is performed using the L_inf norm, which is just the maximum of per-element discrepancies.

With 10000-element non-zero-centered input (such as all elements are 1.0), the errors of the algorithms are 2.8e-4 (sequential) and 9.9e-5 (parallel). With zero-centered inputs (such as standard gaussian noise), the errors are 1.8e-5 (sequential) and 1.5e-5 (parallel).

Speed-up

We tested 3 implementations of the algorithm with the same 100000-element input (code):

Sequential in PyTorch on CPU (as in REINFORCE) (Intel Xeon CPU, DGX-1)
Sequential in C++ on CPU (Intel Xeon CPU, DGX-1)
Parallel in CUDA (NVIDIA P-100, DGX-1)

The observed speed-ups are as follows:

PyTorch to C++: 387 times
PyTorch to CUDA: 36573 times
C++ to CUDA: 94 times

Ops-Space-Time Complexity

Assumptions:

A fused operation of raising gamma to a power, multiplying the result by x, and adding y is counted as a single fused operation;
N is a power of two. When it isn't, the parallel algorithm's complexity is the same as with N equal to the next power of two.

Under these assumptions, the sequential algorithm takes N operations and N time steps to complete. The parallel algorithm takes 0.5 * N * log2 N operations and can be completed in log2 N time steps if the parallelism is unrestricted.

Both algorithms can be performed in-place; hence their space complexity is O(1).

In Other Frameworks

PyTorch

As of the time of writing, PyTorch does not provide discounted cumsum functionality via the API. PyTorch RL code samples (e.g., REINFORCE) suggest computing returns in a loop over reward items. Since most RL algorithms do not require differentiating through returns, many code samples resort to using SciPy function listed below.

TensorFlow

TensorFlow API provides tf.scan API, which can be supplied with an appropriate lambda function implementing the formula above. Under the hood, however, tf.scan implement the traditional sequential algorithm.

SciPy

SciPy provides a scipy.signal.lfilter function for computing IIR filter response using the sequential algorithm, which can be used for the task at hand, as suggested in this StackOverflow response.

Troubleshooting

The package relies on custom CUDA code, which is compiled upon installation. The most common source of problems is failure to compile the code, which results in package installation failure. Check the following aspects of your environment:

pip version needs to be the latest to support the custom installation pipeline defined by pyproject.toml. Update with pip3 install --upgrade pip
Python version 3.6 is the minimum; however, too new releases may have compatibility issues with PyTorch packages
PyTorch version 1.5 is the minimum, the newer the better. Each version of PyTorch supports a range of CUDA driver/toolkit versions, which can be identified from here
CUDA toolkit version Run nvcc --version to find out what you have installed. CUDA toolkits are supported by a certain range of drivers, which can be checked here (Table 1)
CUDA driver Run nvidia-smi - driver version will be shown in the table
OS version (Linux) All of the above may depend on the version of your OS. In case of Ubuntu, use lsb_release -a to find out the version. Other distributions have their ways.

Citation

To cite this repository, use the following BibTeX:

@misc{obukhov2021torchdiscountedcumsum,
  author={Anton Obukhov},
  year=2021,
  title={Fast discounted cumulative sums in PyTorch},
  url={https://github.com/toshas/torch-discounted-cumsum},
  publisher={Zenodo},
  version={v1.1.0},
  doi={10.5281/zenodo.5302420},
  note={Version: 1.1.0, DOI: 10.5281/zenodo.5302420}
}

torch-discounted-cumsum's People

Contributors

Stargazers

Watchers

Forkers

sailfish009 laplacekorea hrpan gliese581gg lucas-emery sustcsonglin veya2ztn nilkel

torch-discounted-cumsum's Issues

Output all ones?

I installed from source using
python3 setup.py build
python3 setup.py install
since I wasn't able to install from pip for unclear reasons (it hung,

root@iZj6c8jzt7ort860dx9mqpZ:~# pip3 install torch-discounted-cumsum                                                                                                                                                        
Looking in indexes: http://mirrors.cloud.aliyuncs.com/pypi/simple/                                                                                                                                                          
Collecting torch-discounted-cumsum                                                                                                                                                                                          
  Downloading http://mirrors.cloud.aliyuncs.com/pypi/packages/51/37/756d9808dfbc21291bf554b5874db0d9e5d5ffd6b1b4f976d1ef18305968/torch_discounted_cumsum-1.0.2.tar.gz (184 kB)                                              
     |________________________________| 184 kB 495 kB/s                                                                                                                                                                     
  Installing build dependencies ... -

).... and anyway, I got the output all-ones for your example, which doesn't seem right.

Fri Apr 16 15:52:34 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 455.32.00    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3D:00.0 Off |                    0 |
| N/A   29C    P0    39W / 250W |  26401MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:40:00.0 Off |                    0 |
| N/A   29C    P0    39W / 250W |  19437MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   59C    P0   202W / 250W |  28287MiB / 32510MiB |     86%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:B1:00.0 Off |                    0 |
| N/A   25C    P0    22W / 250W |      4MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
de-74279-k2-dev-2-0331181900-7b69767657-72fhf:torch-discounted-cumsum: python3
Python 3.8.0 (default, Oct 28 2019, 16:14:01)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> from torch_discounted_cumsum import discounted_cumsum_right
>>> N = 8
>>> gamma = 0.99
>>> x = torch.ones(1, N).cuda()
>>> y = discounted_cumsum_right(x, gamma)
>>> y.cpu()
tensor([[1., 1., 1., 1., 1., 1., 1., 1.]])

nvcc fatal : Unknown option '-generate-dependencies-with-compile'

The build fails with this

Torch 1.8.1
ubuntu 18.04
cuda 10.0

[POLL] Should the package switch install-time to run-time native code compilation?

Currently, the package compiles native code (CUDA kernels) upon package installation. This saves a few seconds during code run time, as the compilation does not happen when the user code starts. However, one scenario when it hurts is when the package is installed from a different environment or a machine than the actual code will be run on. This is a use case with most cluster environments, where packages may be installed from a GPU-less login node, rather than the actual machine with the GPU and CUDA support. In this case, package installation may fail.

Should compilation be rather performed at run time?

👍 - Move compilation to run time
👎 - Keep as is

Documentation

It would be good if you could document more precisely the behavior of this function.
E.g. what is the difference between left and right? (Document with equations would be ideal).
Also it would be nice if you could the requirements of the interface in terms of shape, dtypes, etc.

Problems installing the package

While trying to install this package, I've had a couple problems:

I had PyTorch installed through conda, pip was trying to install this torch version in an isolated env to build the package, couldn't find it and was hanging in an infinite loop. Solved with --no-build-isolation, wonder if there's a way to do this in the repo files instead of relying on the user?
This exception is thrown when trying to import the package:

FileNotFoundError                         Traceback (most recent call last)
Input In [2], in <module>
----> 1 import torch_discounted_cumsum

# uninteresting stack trace...

FileNotFoundError: [Errno 2] No such file or directory: '.../lib/python3.9/site-packages/torch_discounted_cumsum/discounted_cumsum_cuda.cpp'

it seems the cuda source file is not kept around in the installed package, I'm not sure why?

Implicitly building with mismatching pytorch version

Hello,

I had the same issue as #8 part (2) when installing with not-latest PyTorch versions (1.8.1, 1.9.1, etc).

Test environments: Ubuntu 20.04 + 18.04, python 3.7 + 3.8, GCC 7.5 + 9.4, PyTorch 1.8.1 + 1.9.1.

After some digging, I found it's
'ImportError: .../lib/python3.7/site-packages/torch_discounted_cumsum_cpu.cpython-37m-x86_64-linux-gnu.so' suppressed by the try-catch statement which redirecting extension loading to nonexisting sources, hence errors.

A little more digging reveals the ImportError complaining undefined symbol is caused by building the extension against a mismatching PyTorch version (latest release 1.11 as of the time of writing).

When turning on --verbose with pip, the head of logs shows the following:

Collecting torch-discounted-cumsum
Downloading torch_discounted_cumsum-1.1.0.tar.gz (185 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 185.4/185.4 KB 2.0 MB/s eta 0:00:00
Running command pip subprocess to install build dependencies
Collecting setuptools>=38.2.5
Using cached setuptools-62.0.0-py3-none-any.whl (1.1 MB)
Collecting wheel
Using cached wheel-0.37.1-py2.py3-none-any.whl (35 kB)
Collecting torch>=1.5
Using cached torch-1.11.0-cp37-cp37m-manylinux1_x86_64.whl (750.6 MB)
...

It shows pip is using a freshly pulled PyTorch as opposed to the existing PyTorch installation, therefore incurring the undefined symbol issue when importing the so module.

I'd suggest removing the step to automatically pull the latest PyTorch during installation.

A temporary solution: adding the --no-build-isolation flag (and --no-cache-dir if you've failed once) can prevent pip from pulling a new PyTorch package to build against, assuming you already have all requirements properly installed.

(which coincides with what #8 did in part 1 🤣)

learned gamma

@toshas Hi Anton! Thank you for this library! We recently found it to be very useful for improving on attention architectures https://github.com/lucidrains/token-shift-gpt

I was wondering what it would take to add a feature where the gamma could be learned, and beyond just a single scalar (each channel or even across tokens can get their own individual gamma)

Phil

Feature request: specify dimensions

Hello,

I request a parameter dim in discounted cumsum in which one could specify the dimensions.

the use would be, say, a multidimensional tensor where only the last dimension is time, so I'd natural cumsum the dim=-1

toshas / torch-discounted-cumsum Goto Github PK

torch-discounted-cumsum's Introduction

Fast Discounted Cumulative Sums in PyTorch

Usage

Installation

API

Example

Up to K elements

Parallel Algorithm

Numerical Precision

Speed-up

Ops-Space-Time Complexity

In Other Frameworks

PyTorch

TensorFlow

SciPy

Troubleshooting

Citation

torch-discounted-cumsum's People

Contributors

Stargazers

Watchers

Forkers

torch-discounted-cumsum's Issues

Recommend Projects

Recommend Topics

Recommend Org

Up to `K` elements