Giter VIP home page Giter VIP logo

glow's Introduction

PyTorch Logo


PyTorch is a Python package that provides two high-level features:

  • Tensor computation (like NumPy) with strong GPU acceleration
  • Deep neural networks built on a tape-based autograd system

You can reuse your favorite Python packages such as NumPy, SciPy, and Cython to extend PyTorch when needed.

Our trunk health (Continuous Integration signals) can be found at hud.pytorch.org.

More About PyTorch

Learn the basics of PyTorch

At a granular level, PyTorch is a library that consists of the following components:

Component Description
torch A Tensor library like NumPy, with strong GPU support
torch.autograd A tape-based automatic differentiation library that supports all differentiable Tensor operations in torch
torch.jit A compilation stack (TorchScript) to create serializable and optimizable models from PyTorch code
torch.nn A neural networks library deeply integrated with autograd designed for maximum flexibility
torch.multiprocessing Python multiprocessing, but with magical memory sharing of torch Tensors across processes. Useful for data loading and Hogwild training
torch.utils DataLoader and other utility functions for convenience

Usually, PyTorch is used either as:

  • A replacement for NumPy to use the power of GPUs.
  • A deep learning research platform that provides maximum flexibility and speed.

Elaborating Further:

A GPU-Ready Tensor Library

If you use NumPy, then you have used Tensors (a.k.a. ndarray).

Tensor illustration

PyTorch provides Tensors that can live either on the CPU or the GPU and accelerates the computation by a huge amount.

We provide a wide variety of tensor routines to accelerate and fit your scientific computation needs such as slicing, indexing, mathematical operations, linear algebra, reductions. And they are fast!

Dynamic Neural Networks: Tape-Based Autograd

PyTorch has a unique way of building neural networks: using and replaying a tape recorder.

Most frameworks such as TensorFlow, Theano, Caffe, and CNTK have a static view of the world. One has to build a neural network and reuse the same structure again and again. Changing the way the network behaves means that one has to start from scratch.

With PyTorch, we use a technique called reverse-mode auto-differentiation, which allows you to change the way your network behaves arbitrarily with zero lag or overhead. Our inspiration comes from several research papers on this topic, as well as current and past work such as torch-autograd, autograd, Chainer, etc.

While this technique is not unique to PyTorch, it's one of the fastest implementations of it to date. You get the best of speed and flexibility for your crazy research.

Dynamic graph

Python First

PyTorch is not a Python binding into a monolithic C++ framework. It is built to be deeply integrated into Python. You can use it naturally like you would use NumPy / SciPy / scikit-learn etc. You can write your new neural network layers in Python itself, using your favorite libraries and use packages such as Cython and Numba. Our goal is to not reinvent the wheel where appropriate.

Imperative Experiences

PyTorch is designed to be intuitive, linear in thought, and easy to use. When you execute a line of code, it gets executed. There isn't an asynchronous view of the world. When you drop into a debugger or receive error messages and stack traces, understanding them is straightforward. The stack trace points to exactly where your code was defined. We hope you never spend hours debugging your code because of bad stack traces or asynchronous and opaque execution engines.

Fast and Lean

PyTorch has minimal framework overhead. We integrate acceleration libraries such as Intel MKL and NVIDIA (cuDNN, NCCL) to maximize speed. At the core, its CPU and GPU Tensor and neural network backends are mature and have been tested for years.

Hence, PyTorch is quite fast — whether you run small or large neural networks.

The memory usage in PyTorch is extremely efficient compared to Torch or some of the alternatives. We've written custom memory allocators for the GPU to make sure that your deep learning models are maximally memory efficient. This enables you to train bigger deep learning models than before.

Extensions Without Pain

Writing new neural network modules, or interfacing with PyTorch's Tensor API was designed to be straightforward and with minimal abstractions.

You can write new neural network layers in Python using the torch API or your favorite NumPy-based libraries such as SciPy.

If you want to write your layers in C/C++, we provide a convenient extension API that is efficient and with minimal boilerplate. No wrapper code needs to be written. You can see a tutorial here and an example here.

Installation

Binaries

Commands to install binaries via Conda or pip wheels are on our website: https://pytorch.org/get-started/locally/

NVIDIA Jetson Platforms

Python wheels for NVIDIA's Jetson Nano, Jetson TX1/TX2, Jetson Xavier NX/AGX, and Jetson AGX Orin are provided here and the L4T container is published here

They require JetPack 4.2 and above, and @dusty-nv and @ptrblck are maintaining them.

From Source

Prerequisites

If you are installing from source, you will need:

  • Python 3.8 or later (for Linux, Python 3.8.1+ is needed)
  • A compiler that fully supports C++17, such as clang or gcc (gcc 9.4.0 or newer is required)

We highly recommend installing an Anaconda environment. You will get a high-quality BLAS library (MKL) and you get controlled dependency versions regardless of your Linux distro.

NVIDIA CUDA Support

If you want to compile with CUDA support, select a supported version of CUDA from our support matrix, then install the following:

Note: You could refer to the cuDNN Support Matrix for cuDNN versions with the various supported CUDA, CUDA driver and NVIDIA hardware

If you want to disable CUDA support, export the environment variable USE_CUDA=0. Other potentially useful environment variables may be found in setup.py.

If you are building for NVIDIA's Jetson platforms (Jetson Nano, TX1, TX2, AGX Xavier), Instructions to install PyTorch for Jetson Nano are available here

AMD ROCm Support

If you want to compile with ROCm support, install

  • AMD ROCm 4.0 and above installation
  • ROCm is currently supported only for Linux systems.

If you want to disable ROCm support, export the environment variable USE_ROCM=0. Other potentially useful environment variables may be found in setup.py.

Intel GPU Support

If you want to compile with Intel GPU support, follow these

If you want to disable Intel GPU support, export the environment variable USE_XPU=0. Other potentially useful environment variables may be found in setup.py.

Install Dependencies

Common

conda install cmake ninja
# Run this command from the PyTorch directory after cloning the source code using the “Get the PyTorch Source“ section below
pip install -r requirements.txt

On Linux

conda install intel::mkl-static intel::mkl-include
# CUDA only: Add LAPACK support for the GPU if needed
conda install -c pytorch magma-cuda121  # or the magma-cuda* that matches your CUDA version from https://anaconda.org/pytorch/repo

# (optional) If using torch.compile with inductor/triton, install the matching version of triton
# Run from the pytorch directory after cloning
# For Intel GPU support, please explicitly `export USE_XPU=1` before running command.
make triton

On MacOS

# Add this package on intel x86 processor machines only
conda install intel::mkl-static intel::mkl-include
# Add these packages if torch.distributed is needed
conda install pkg-config libuv

On Windows

conda install intel::mkl-static intel::mkl-include
# Add these packages if torch.distributed is needed.
# Distributed package support on Windows is a prototype feature and is subject to changes.
conda install -c conda-forge libuv=1.39

Get the PyTorch Source

git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
# if you are updating an existing checkout
git submodule sync
git submodule update --init --recursive

Install PyTorch

On Linux

If you would like to compile PyTorch with new C++ ABI enabled, then first run this command:

export _GLIBCXX_USE_CXX11_ABI=1

If you're compiling for AMD ROCm then first run this command:

# Only run this if you're compiling for ROCm
python tools/amd_build/build_amd.py

Install PyTorch

export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
python setup.py develop

Aside: If you are using Anaconda, you may experience an error caused by the linker:

build/temp.linux-x86_64-3.7/torch/csrc/stub.o: file not recognized: file format not recognized
collect2: error: ld returned 1 exit status
error: command 'g++' failed with exit status 1

This is caused by ld from the Conda environment shadowing the system ld. You should use a newer version of Python that fixes this issue. The recommended Python version is 3.8.1+.

On macOS

python3 setup.py develop

On Windows

Choose Correct Visual Studio Version.

PyTorch CI uses Visual C++ BuildTools, which come with Visual Studio Enterprise, Professional, or Community Editions. You can also install the build tools from https://visualstudio.microsoft.com/visual-cpp-build-tools/. The build tools do not come with Visual Studio Code by default.

If you want to build legacy python code, please refer to Building on legacy code and CUDA

CPU-only builds

In this mode PyTorch computations will run on your CPU, not your GPU

conda activate
python setup.py develop

Note on OpenMP: The desired OpenMP implementation is Intel OpenMP (iomp). In order to link against iomp, you'll need to manually download the library and set up the building environment by tweaking CMAKE_INCLUDE_PATH and LIB. The instruction here is an example for setting up both MKL and Intel OpenMP. Without these configurations for CMake, Microsoft Visual C OpenMP runtime (vcomp) will be used.

CUDA based build

In this mode PyTorch computations will leverage your GPU via CUDA for faster number crunching

NVTX is needed to build Pytorch with CUDA. NVTX is a part of CUDA distributive, where it is called "Nsight Compute". To install it onto an already installed CUDA run CUDA installation once again and check the corresponding checkbox. Make sure that CUDA with Nsight Compute is installed after Visual Studio.

Currently, VS 2017 / 2019, and Ninja are supported as the generator of CMake. If ninja.exe is detected in PATH, then Ninja will be used as the default generator, otherwise, it will use VS 2017 / 2019.
If Ninja is selected as the generator, the latest MSVC will get selected as the underlying toolchain.

Additional libraries such as Magma, oneDNN, a.k.a. MKLDNN or DNNL, and Sccache are often needed. Please refer to the installation-helper to install them.

You can refer to the build_pytorch.bat script for some other environment variables configurations

cmd

:: Set the environment variables after you have downloaded and unzipped the mkl package,
:: else CMake would throw an error as `Could NOT find OpenMP`.
set CMAKE_INCLUDE_PATH={Your directory}\mkl\include
set LIB={Your directory}\mkl\lib;%LIB%

:: Read the content in the previous section carefully before you proceed.
:: [Optional] If you want to override the underlying toolset used by Ninja and Visual Studio with CUDA, please run the following script block.
:: "Visual Studio 2019 Developer Command Prompt" will be run automatically.
:: Make sure you have CMake >= 3.12 before you do this when you use the Visual Studio generator.
set CMAKE_GENERATOR_TOOLSET_VERSION=14.27
set DISTUTILS_USE_SDK=1
for /f "usebackq tokens=*" %i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -version [15^,17^) -products * -latest -property installationPath`) do call "%i\VC\Auxiliary\Build\vcvarsall.bat" x64 -vcvars_ver=%CMAKE_GENERATOR_TOOLSET_VERSION%

:: [Optional] If you want to override the CUDA host compiler
set CUDAHOSTCXX=C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\bin\HostX64\x64\cl.exe

python setup.py develop
Adjust Build Options (Optional)

You can adjust the configuration of cmake variables optionally (without building first), by doing the following. For example, adjusting the pre-detected directories for CuDNN or BLAS can be done with such a step.

On Linux

export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
python setup.py build --cmake-only
ccmake build  # or cmake-gui build

On macOS

export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py build --cmake-only
ccmake build  # or cmake-gui build

Docker Image

Using pre-built images

You can also pull a pre-built docker image from Docker Hub and run with docker v19.03+

docker run --gpus all --rm -ti --ipc=host pytorch/pytorch:latest

Please note that PyTorch uses shared memory to share data between processes, so if torch multiprocessing is used (e.g. for multithreaded data loaders) the default shared memory segment size that container runs with is not enough, and you should increase shared memory size either with --ipc=host or --shm-size command line options to nvidia-docker run.

Building the image yourself

NOTE: Must be built with a docker version > 18.06

The Dockerfile is supplied to build images with CUDA 11.1 support and cuDNN v8. You can pass PYTHON_VERSION=x.y make variable to specify which Python version is to be used by Miniconda, or leave it unset to use the default.

make -f docker.Makefile
# images are tagged as docker.io/${your_docker_username}/pytorch

You can also pass the CMAKE_VARS="..." environment variable to specify additional CMake variables to be passed to CMake during the build. See setup.py for the list of available variables.

make -f docker.Makefile

Building the Documentation

To build documentation in various formats, you will need Sphinx and the readthedocs theme.

cd docs/
pip install -r requirements.txt

You can then build the documentation by running make <format> from the docs/ folder. Run make to get a list of all available output formats.

If you get a katex error run npm install katex. If it persists, try npm install -g katex

Note: if you installed nodejs with a different package manager (e.g., conda) then npm will probably install a version of katex that is not compatible with your version of nodejs and doc builds will fail. A combination of versions that is known to work is [email protected] and [email protected]. To install the latter with npm you can run npm install -g [email protected]

Previous Versions

Installation instructions and binaries for previous PyTorch versions may be found on our website.

Getting Started

Three-pointers to get you started:

Resources

Communication

Releases and Contributing

Typically, PyTorch has three minor releases a year. Please let us know if you encounter a bug by filing an issue.

We appreciate all contributions. If you are planning to contribute back bug-fixes, please do so without any further discussion.

If you plan to contribute new features, utility functions, or extensions to the core, please first open an issue and discuss the feature with us. Sending a PR without discussion might end up resulting in a rejected PR because we might be taking the core in a different direction than you might be aware of.

To learn more about making a contribution to Pytorch, please see our Contribution page. For more information about PyTorch releases, see Release page.

The Team

PyTorch is a community-driven project with several skillful engineers and researchers contributing to it.

PyTorch is currently maintained by Soumith Chintala, Gregory Chanan, Dmytro Dzhulgakov, Edward Yang, and Nikita Shulga with major contributions coming from hundreds of talented individuals in various forms and means. A non-exhaustive but growing list needs to mention: Trevor Killeen, Sasank Chilamkurthy, Sergey Zagoruyko, Adam Lerer, Francisco Massa, Alykhan Tejani, Luca Antiga, Alban Desmaison, Andreas Koepf, James Bradbury, Zeming Lin, Yuandong Tian, Guillaume Lample, Marat Dukhan, Natalia Gimelshein, Christian Sarofeen, Martin Raison, Edward Yang, Zachary Devito.

Note: This project is unrelated to hughperkins/pytorch with the same name. Hugh is a valuable contributor to the Torch community and has helped with many things Torch and PyTorch.

License

PyTorch has a BSD-style license, as found in the LICENSE file.

glow's People

Contributors

842974287 avatar artemrakhov-glow avatar aturetsk avatar aweis avatar bertmaher avatar compnerd avatar gcatron avatar hegemanjwh2 avatar jackm321 avatar jfix71 avatar khabinov avatar lliu315gt avatar mciprian13 avatar mikekgfb avatar mjanderson09 avatar mortzur avatar nadavrot avatar nickgg avatar omromano avatar opti-mix avatar qcolombet avatar qxy11 avatar r-barnes avatar rdzhabarov avatar shajrawi avatar tlepley-cadence avatar tracelogfb avatar vuzelac-cadence avatar zrphercule avatar zrphercule2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

glow's Issues

Compilation time and LLVM dependency reduction

LLVM takes quite a while to compile, does the entire library need to be built?

It looks like these are the dependencies, and I'm wondering if its possible to reduce the compilation time:

grep -r "include \"llvm" include/ src/ | sed 's/.*#/#/g' | sort | uniq
#include "llvm/Support/ErrorHandling.h"
#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/Hashing.h"
#include "llvm/ADT/PointerUnion.h"
#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/Support/Casting.h"
#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Debug.h"
#include "llvm/Support/NativeFormatting.h"
#include "llvm/Support/raw_ostream.h"

shareBuffers fails to merge chain of copies

As found in #1209, our shareBuffers optimization fails to merge chain of copies.
The problem is illustrated with the test case in #1237.

Copy/pasting the analysis made in #1209:
In a nutshell, our buffer optimization only looks at merging one segment at a time. In the case of the lowering of the load node, we have for the first time potentially more than one copy on a chain to a use (e.g., with the lowering of reshape(load)) .
Now to illustrate the problem with chain of copy, let us consider:
1: a = copy b
2: c = copy a
3: = c
We have the following segments:
a = (1, 2); b = (-,1); c (2, 3)

First we merge c and a:
1: a = copy b
2: a = copy a
3: = a
Now the segments look like:
a = (1, 2), (2, 3); b = (-,1)

When considering merging a with b, we only consider updating the segment of a that encloses the segment of b, thus (1,2).
The updated code looks like:
1: b = copy b
2: a = copy b
3: = a
We don't update the old c segment and end up with a copy in 2 now.

Interestingly, if you delete the useless copies as you go, you actually end up with incorrect code, e.g., for the previous example:
1: b = copy b
---2: a = copy b---- deleted
3: = a <--- a never defined

This is because when we update the instruction, we don't preserve the last definition of the outgoing variable and given we don't update the uses not in the segment, they potentially reference an undefined variable.

It looks like it works just by accident but maybe it was intended this way?

Long story short we need to lift the one-segment-at-a-time limitation (maybe by merging the segments?).

[CI jobs] More tests need to run in CI

Glow contains a variety of validation tests:

  1. Unit tests, these tests are run as part of the CI currently and verify inference as well as training
  2. Larger tests that exercise training/backpropagation (cifar10, mnist, ptb) and typically take a very long time to execute. The execution time can be tuned by introducing a dynamic number of iterations if needed.
  3. Inference on well known pre-trained models: resnet50, vgg19, squeezenet, shufflenet, etc. Model definitions that are in ONNX or Caffe2 format are required for those tests. You can check on how those models downloaded for ONNX and Caffe2. These tests do not assert current results against expected and used in a manual validation. Sample input/output look like
     Model: resnet50/predict_net.pb
     File: tests/images/imagenet/cat_285.png Result:285

The goal here is to introduce a script which allows running (3) on CI.
Few things to consider before implementing:

  • Measure how much time it takes to download weights/models and run validation
  • Depending on time, enable validation on every PR or as part of the daily travis cron job
  • Create a script which makes test fail in case output mismatch with the expected output (run this on CI)

Important scenarios include AOT examples, which should be validated on the CI in a similar way as we propose to do for (3). AOT examples also require downloading weights and models.

Integrate IWYU

Putting this up as a "wishlist" issue: Since we already have clang-tidy, let's also integrate Include-What-You-Use. LLVM has a lot of different headers so having a tool to tell us what to include where will be healthy for the future and avoid things like cyclic dependencies or other badness that can arise from C++'s lack of a module system. It also avoid the kind of thing where you remove one header and the whole card-house of mismanaged includes breaks down. I managed to integrate it with my last LLVM project. It's excruciatingly slow, but would be great to have as a separate build rule to run over night or before lunch once in a while.

Need to canonicalize or optimize high-dim concat to concat+transpose

The picture below depicts a concat node that joins nodes on dimension number 1. The IR that we generate for this code is inefficient for two reasons. First, we can't optimize the operator that writes the result because the result is scattered across the 2st dimension (dim zero is the first). And second, we emit a sequence of insert_tensor instructions that process the tensor several times invalidating cache. A much better way would be to represent this as dim-0 concat followed by a transpose.

Design question: I am not sure if this should be the canonical representation, the only representation or simply a target specific optimization.

screen shot 2018-07-19 at 1 05 24 pm

Refactor Backend interface

The glow::Backend interface is current storing a lot of state, and it needs to be refactored. We want to clearly separate the hardware abstraction layer from the different kinds of state we're storing.

Over in #1176 we want to compile multiple functions to run on different CPUs. In the future, we also want to support multiple GPUs and other accelerators, so we need a backend interface that separates state related to different functions and different execution units.

I'll elaborate on a design in the comments below.

Wrong scheduling because of the lack of memory dependencies

Our graph does not model the memory dependencies explicitly.
As a result, if we have a variable is written and read on different
data paths, the scheduler may be unlucky and we read the wrong
version of the variable.

Essentially, if you have:
     A
    / \
write read

The scheduler can produce either:
write A
read A
or
read A
write A

Obviously this does not produce the same output.
Glow intended semantic is that write should happen last, but with
our current algorithm, this is actually possible to fool the
scheduler not to do that.

I'm going to send a PR to show how we can fool the scheduler.

image-classifier behaves differently with one or several images for densenet121

I noticed that the image-classifier returns different results if invoked with one file at a time and all the same file at once and was wondering if this is expected.

To reproduce:
$ ./bin/image-classifier tests/images/imagenet/* -image_mode=0to1 -m=densenet121/model.onnx
$ ./bin/image-classifier tests/images/imagenet/cat_285.png -image_mode=0to1 -m=densenet121/model.onnx

Result:
With all the files at once:
Model: densenet121/model.onnx
File: tests/images/imagenet/cat_285.png Result:281
File: tests/images/imagenet/dog_207.png Result:207
File: tests/images/imagenet/zebra_340.png Result:340

With one at a time:
Model: densenet121/model.onnx
File: tests/images/imagenet/cat_285.png Result:674
Model: densenet121/model.onnx
File: tests/images/imagenet/dog_207.png Result:674
Model: densenet121/model.onnx
File: tests/images/imagenet/zebra_340.png Result:674

Also when there is only one file given, the process takes much longer to produce a result.

Is this expected?

test target dependencies are incomplete

One thing I noticed, the dependencies are not correct in the cmake. In effect, if you run ninja test without building glow first, it will fail because the tests don't depend on glow being built. This is problematic as we may update glow, forget to rebuild, and run the tests and think that everything is okay whereas we would run the old code.

Should lower/optimize behave differently based on the compilation mode?

As of today our API takes the CompilationMode for both the lowering and optimization phase (glow::lower and glow::optimize).
Except for one thing in the lowering, which I will come back later, this argument does change the behavior of these functions and thus, I was wondering if we should pass it at all.
My way of thinking is if we don't use it, let's not pass it and if/when we need it we can add it.

Now going back to the only user of this argument in lowering.
The lowering of batch normalization is the only user of that argument, but I believe we could avoid that use and that would be a better design. Basically, we lower BN differently between inference and training, whereas I think the right thing would be to have a different representation when we differentiate BN. In essence, what I am suggesting for this use case is to introduce a new node for mean and variance normalization (i.e., for what we special case with the training mode during lowering.) In other words, differentiation for BN would produce BNGrad(BN(meanVarNorm)) and that would completely eliminate the need for a special lowering.

All in all, as of today, I don't see a reason why we pass this mode.

What do people think?

[Feature request] Tensor attributes

It would be useful to have more attributes on tensor objects.
Specifically, whether a tensor is an input/output for the function and whether its constant or mutable during inference etc.
Alternatively this could be added as a utility function that backends can invoke.

Add support for FP16 and bfloat

Working with accelerators, we are going to need support for 16-bit floating point types.

  • FP16 is the binary16 format defined in IEEE 754-2008.
  • bfloat is the Tensorflow floating point type with the same exponent range as a float.
Type binary16 bfloat binary32
k, storage width in bits 16 16 32
w, exponent field width in bits 5 8 8
t, trailing significand field width in bits 10 7 23

We should at a minimum add these types to glow::ElemKind so we can represent low-precision tensors in these types. Ideally, also add interpreter support.

Separate HAL, temporary state, and compiled functions

This is part of the Backend refactoring in #1227.

The goal is to have the Backend sub-classes represent the hardware abstraction layer. The instances hold no mutable state, and all actions are performed through const virtual functions.

Proposed interface:

class Backend {
  virtual bool transformPreLowering(Function *F, CompilationMode mode) const;
  virtual bool transformPostLowering(Function *F, CompilationMode mode) const;
  virtual bool isOpSupported(Kinded::Kind opKind, ElemKind elementTy) const;
  virtual bool shouldLower(const Node *N) const;
  virtual bool shouldShareBuffers() const;

  /// Compile an IR function, consuming it in the process.
  /// Returns a new object representing the compiled function.
  virtual unique_ptr<CompiledFunction> compileFunction(unique_ptr<IRFunction>) const;
}

class CompiledFunction {
  virtual void execute() =0;
}

Refactoring steps:

  • Change transformPreLowering and friends to be const.
  • Create a CompiledFunction sub-class per backend which holds the result of compiling a function.
  • Move remaining temporary compilation state out of the Backend sub-classes into a locally defined class that only exists during compileFunction().
  • Remove the IRFunction pointer from backend instances and the createBackend() function.

At this stage, the CompiledFunction sub-classes will still reference the original module, and their execute method can modify variables in the module.

Build failure on MacOS

Hi folks,
I followed the README and tried to build Glow but got the following error:

FAILED: lib/Graph/CMakeFiles/Graph.dir/Graph.cpp.o
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++  -DGLOW_WITH_OPENCL=1 -I../include -Iinclude -I. -isystem /usr/local/opt/llvm/include -Wall -fno-exceptions -fno-
rtti -g -fno-omit-frame-pointer -O0 -fvisibility=hidden -fvisibility-inlines-hidden   -std=c++14 -MD -MT lib/Graph/CMakeFiles/Graph.dir/Graph.cpp.o -MF lib/Graph/CMakeFiles/Graph.dir/Graph.cpp.o.d -o lib/
Graph/CMakeFiles/Graph.dir/Graph.cpp.o -c ../lib/Graph/Graph.cpp
../lib/Graph/Graph.cpp:871:19: error: invalid operands to binary expression ('llvm::StringRef' and 'const char *')
      (namePrefix + ".initial_state").str(), Variable::VisibilityKind::Public,
       ~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~

My build command is cmake -G Ninja .. -DCMAKE_PREFIX_PATH=/usr/local/opt/llvm

Any idea what had gone wrong? Thanks!

Use deterministic PRNG seeds for testing

Over in #1065, we have an example of a unit test that fails intermittently depending on the pseudo-random numbers delivered by nextRand(). This function uses a static PRNG, so while it is seeded deterministically, reordering tests will give them different pseudo-random numbers.

The PRNG state is also shared between threads, so in a multithreaded environment the pseudo-random numbers will be non-deterministic.

One way of solving this problem would be to make the PRNG state a member of the Module class and change the initXavier() and randomize() methods of Tensor to take an explicit PRNG state parameter.

Add Autoencoder example

An autoencoder is a very basic neural network that takes in some input, encodes it into a lower-dimensional representation, and then decodes it back into something that should ideally be identical to the input. We should try to implement such an autoencoder with Glow. An extension of this would be to make a variational autoencoder.

[Broadcasting] Concat inputs scale with the broadcast size.

We represent a broadcast as a series of concats. For example if we want to broadcast a tensor A with dims = {1} to a tensor B with dims = {10}, B is a ConcatNode with A repeatedly input 10 times.

This works nicely because we do not need a BroadcastNode. Additionally, we IRGen this ConcatNode with repeated inputs into a single InsertTensorInst with some count value, i.e. the InsertTensorInst does not have its input count scale with the broadcast size.

However, this scaling issue does occur at the Node level. If the broadcast size is large, for example if the destination tensor B has dims = {2048}, then B will be a ConcatNode with 2048 inputs. Not great.

To improve this, one option I am considering is to add a vector member to ConcatNode that represents how many times to concatenate each input. For example, instead of the above ConcatNode B having A input 2048 times, it has A input once, as well as a new "count" member vector {2048}, signaling that A should be inserted 2048 times.

If we wanted to also Concat some other Node C 10 times at the end, then this "count" member vector would be {2048, 10}, and B would have A and C input once each.

This would mean ConcatNode more closely matches our InsertTensorInst, which ConcatNode is IRGen'd into anyway. However, it adds some complexity to the ConcatNode.

What does everyone think?

Support GlobalAveragePool

ONNX implementation of squeezenet uses GlobalAveragePool. Here's the spec.

GlobalAveragePool could be represented as simple AveragePool with kernel size equal to size of image.

One way to support is to "lower" at Graph creation stage. I.e. do something similar to ChannelShuffle (commits: 1 2).

Alternatively, we could create GlobalAveragePoolNode and lower it at normal lowering stage, but I don't see any benefits of doing so.

Also, AveragePool in ONNX spec supports averaging across any-shaped N-dimensional box. Currently we support only square boxes. This may or may not be an issue for lowering GlobalAveragePool.

Does the size_t & uint64_t really necessary in opencl kernel.cl?

First of all NVIDIA CUDA OpenCL doesn't support size_t.
Besides, When compiling the executable under 32 bit application,
uint64_t about to be another issue:0
cause in kernel

typedef struct {
  cl_uint64_t n; // Number of samples
  cl_uint64_t h; // Height
  cl_uint64_t w; // Width
  cl_uint64_t c; // Number of channels
} ShapeNHWC;

but in C++

struct ShapeNCHW {
  size_t n; // Number of samples
  size_t c; // Number of Channels
  size_t h; // Height
  size_t w; // Width

There is inconsistence between OpenCL kernel & C++ code about size_t /uint64_t.
We may need to define a fixed coding style about size_t & uint64_t.

Thousands of libjit_stacked_kernelXXX doing the same thing are issued

I was looking at the llvm IR produced by the image classifier using tests/images/run.sh and I noticed that we issue thousands of this kernel:
define internal void @libjit_stacked_kernel.1575(float* noalias) {
entry:
br label %loop

loop: ; preds = %loop, %entry
%1 = phi i64 [ , %entry ], [ %nextvar, %loop ]
%2 = call float @libjit_splat_kernel_f(i64 %1, float SomeNumber, float* null, float* null)
%buffer.element.addr = getelementptr float, float* %0, i64 %1
store float %2, float* %buffer.element.addr
%nextvar = add nuw nsw i64 %1, SomeOtherNumber
%loopcond = icmp ult i64 %nextvar, AnotherNumber
br i1 %loopcond, label %loop, label %afterloop, !llvm.loop !

afterloop: ; preds = %loop
ret void
}

Would be nice to reuse them if at all possible (haven't checked but the code seems duplicated quite a lot), so that we reduce our memory footprint and our compile time.

[Feature request] Backend-specific tensors

Backends should be able to implement their own type of tensor objects.
A good example for using this feature would be having tensors backed by OpenCL resources for the OpenCL backend (and similarly for other backends).

In order to support this, the Tensor class can be made more suitable for deriving (adding virtual qualifiers etc.) or create a Tensor interface class for other implementations to derive from.
Additional features for this Tensor interface should include support for lock/unlock (map/unmap) or some other forms of synchronization mechanism.

Rename Pool Nodes and Instrs

Our pooling nodes and instructions are named PoolAvg and PoolMax. Everywhere else in ML they are called AvgPool and MaxPool, respectively. It would be great to do a mechanical rename.

Implement Optimal fp32-i32 Multiplication in Integer Instructions

I have a quip doc on this issue, but here is the summary:

Given a 32-bit floating-point numbers and a 32-bit integer number, we need a general-purpose method of multiplying them using only integer operations. This is for the case where we have quantized operations running on hardware without a fast FPU or other floating-point support. The current implementation of this only works well for particular cases of rescaling operations -- it tends to be slightly incorrect when we are performing a multiplication of a floating-point scale with an integer register in which some quantities have been accumulated, for instance as happens during a matrix multiplication (FC) or convolution operation.

The overall idea here is that we want to maximize the precision we can maintain when transforming an fp32-i32 multiply into a product of two i32's. At present, we have a general idea that we want to:

(A) Select parameters for pre- and post-multiplication shifting, in order to account for scaling either up or down by factors of 2.
(B) Select an appropriately-precise integer approximation to the mantissa of the f32.
(C) At runtime, compute the f32-i32 product by multiplying the integer approximation to the mantissa with the i32 input, and then shifting appropriately to convert the value to the correct scale w.r.t. powers of 2.

The above is the general theoretical approach. However, in practice, we run into several technical issues.

========= The following is wrong =========
(i) Especially for right-shifting, which is the common case (because the integer approximation to the f32 mantissa is usually going to be a "scaled-up" version of the actual mantissa, in order to achieve precision), arithmetic right-shifting is not quite equivalent to [integer] division by a power of 2. It is equivalent, as desired, for positive values; however, the truncation of right-shifting inherently rounds downward toward negative infinity, and hence is not symmetric about 0. This is undesirable. One solution to this is to add a certain quantity to negative i32 values prior to [arithmetic] right-shifts. The correct value to add is 1_[input < 0] << (shift - 1), which is effectively equivalent to dividing by the respective power of 2, adding 1/2, and then truncating (when the operation is thought of as taking place over the real numbers). This thus implements "round to nearest" behavior for negative i32 values. (Whereas we still have "round down" behavior for postive i32 values -- however, at least...
========== END WRONG PART ===========

(ii) In the course of multiplications which result from a floating-point value and an integer value which has been accumulated (possibly from other integer products), predicting the correct scale of precision on which to operate is difficult, if not impossible. For example, suppose that we wish to multiply an f32 (whose value is known at compile-time), and an i32 (whose value is not known at compile-time) which is the result of a row-column dot product during a matrix multiply operation. Even if the matrix entries in question are i8's, each such element-wise product can have up to 16 bits, and the dimensions of the matrix in question may also introduce another element of variance in the size of the integer that results (and which we wish to multiply by the f32). For instance:

Generally: f = s * (i - o)
For matrix multiply:
f_d = \sum f_l * f_r = (s_l * s_r) * (\sum (i_l - o_l) * (i_r - o_r))
and hence
i_d = (s_l * s_r / s_d) * (\sum (i_l - o_l) * (i_r - o_r)) + o_d

The floating-point value in question then in s_l * s_r / s_d, which is known at compile-time. However -- even though perhaps the number of elements in the summation may be known ahead of time -- the fact that the individual entries i_l and i_r (and consequently i_l - o_l and i_r - o_r) cannot be means that the actual order of magnitude, in bits, of the resulting i32 multiplicand can only be bounded within some range, and since we only have 32 bits to work with in the first place, it is easy to see how this could become problematic if assumptions about the actual order of the resultant i32 dot product are made at compile-time. It therefore appears that at least some runtime analysis may be needed.

icon/logo design proposal

Hi, I am a graphic designer and i want to contribute to your nice project by proposing an icon/logo design.
If you like and want to use it i will be gladly sending you the files as a gift for free. Here is what i come up with:

Design idea- letter G is very visible for the initial of Glow, it also shows gears which represents machine or engine. another element it shows is a sun that glows which literally represents glow.

icon
glow-01

logotype
glow-02

I hope you like and make you wanna use it. it there's any modification that you would like to do please let me know and i will edit.

Thanks and best regards!
-Tobaloidee

[OpenCL] clBuildProgram Failed. __kernel function cannot have argument whose type is, or contains, type size_t.

Here is the clinfo output for my system pastebin

I've compiled glow with OpenCL support as follows

$ cmake -G Ninja -DCMAKE_BUILD_TYPE=Debug -DGLOW_WITH_CPU=1 -DGLOW_WITH_OPENCL=1  ../glow
$ ninja all

Whenever I try to run any of the unittests or demos with the -opencl command line option, I get :

<kernel>:282:1: error: unexpected type name 'vtype': expected expression
DEFINE_OPENCL_UNARY_DATA_PARALLEL_KERNEL_WITH_IMM_OPERAND(splat, float, SRC)
^
<kernel>:240:19: note: expanded from macro 'DEFINE_OPENCL_UNARY_DATA_PARALLEL_KERNEL_WITH_IMM_OPERAND'
      vtype SRC = vtype(val);                                                  \
                  ^
<kernel>:283:1: error: unexpected type name 'vtype': expected expression
DEFINE_OPENCL_UNARY_DATA_PARALLEL_KERNEL_WITH_IMM_OPERAND(splat_u, ulong, SRC)
^
<kernel>:240:19: note: expanded from macro 'DEFINE_OPENCL_UNARY_DATA_PARALLEL_KERNEL_WITH_IMM_OPERAND'
      vtype SRC = vtype(val);                                                  \
                  ^
<kernel>:334:59: error: __kernel function cannot have argument whose type is, or contains, type size_t
                                cl_uint32_t batch, size_t numSlice,
                                                          ^
<kernel>:335:40: error: __kernel function cannot have argument whose type is, or contains, type size_ta
                                size_t sliceSize) {
                                       ^

/glow/lib/Backends/OpenCL/OpenCL.cpp:189: failed assertion `err == CL_SUCCESS && "clBuildProgram Failed."'

I'm far from conversant in OpenCL but I made the changes that the compiler highlights

diff --git a/lib/Backends/OpenCL/kernels.cl b/lib/Backends/OpenCL/kernels.cl
index f51a1267..17b96644 100644
--- a/lib/Backends/OpenCL/kernels.cl
+++ b/lib/Backends/OpenCL/kernels.cl
@@ -238,7 +238,7 @@ size_t getNCHW(ShapeNCHW s, cl_uint32_t n, cl_uint32_t c, cl_uint32_t h,
       vstore8(VAL, i * 2, dest);                                               \
     }                                                                          \
     {                                                                          \
-      vtype SRC = vtype(val);                                                  \
+      vtype SRC = (vtype)val;                                                  \
       vtype VAL = body;                                                        \
       vstore8(VAL, i * 2 + 1, dest);                                           \
     }                                                                          \
@@ -332,8 +332,8 @@ __kernel void batchedreduceaddK(__global float *dest, __global float *batch,
 }
 
 __kernel void batchedreduceaddW(__global void *mem, cl_uint32_t dest,
-                                cl_uint32_t batch, size_t numSlice,
-                                size_t sliceSize) {
+                                cl_uint32_t batch, cl_uint32_t numSlice,
+                                cl_uint32_t sliceSize) {
   batchedreduceaddK(&mem[dest], &mem[batch], numSlice, sliceSize);
 }

Now, OCLTest, MLTest, ptb all work fine with OpenCL backend but operatorTest, JITTest and mnist all fail certain tests.

[ RUN      ] OpenCL/Operator.simplePredication/0
glow/lib/Backends/OpenCL/OpenCL.cpp:197: failed assertion `err == CL_SUCCESS && "Unable to set parameter"'
[ RUN      ] OpenCL/BackendCorrectnessTest.poolMaxGradTest/0
glow/lib/Backends/OpenCL/OpenCL.cpp:197: failed assertion `err == CL_SUCCESS && "Unable to set parameter"'
$ ./bin/mnist -opencl
Loading the mnist database.
Loaded 50000 images.
Training.
Training - epoch #0
glow/lib/Backends/OpenCL/OpenCL.cpp:197: failed assertion `err == CL_SUCCESS && "Unable to set parameter"'

How can I resolve this?

Performance compared with NNVM/TVM

Except for the performance of Glow on Resnet50 etc compared with TensorFlow1.7, will there be some more benchmarks such as comparing Glow with NNVM/TVM on CPU/GPU?

Should we lower QuantizationProfile node?

I was looking at the QuantizationProfile node and the related QuantizationProfile instruction, and it looks like we only provide the interpreter support.
So here are my questions:

  • Do we expect every backend to come up with their own support for this instruction? or,
  • Should we lower the node in something more chewable for the backends?

If we expect every backend to come up with their own support, we should make an extra effort on documenting what are the expectations of this operator.

Assumptions baked in the Graph optimizer not checked anywhere

Some of the Graph optimizations assume that some nodes have variables as their inputs (e.g., batch normalization, convolution) and that these variables are used only in those nodes.
Although this currently matches what the importers emit, this creates an unwelcome coupling. In essence, we could spot usage of cast<Variable> sprinkled in the code base.
We should either:

  1. Check these assumptions in the related verify methods, or
  2. Fix the optimizer (and the hand written test cases)

I am tempted to go with #2 because:
A. Only the optimizer, importer and test cases have these assumptions baked in (for instance, the CPU optimization for convolution optimizeCPUConv is fine without those, same for the lowering and the interpreter.)
B. It is not unlikely that constant Variables with the same value would be uniqued and be shared across different uses and that's never checked.
C. With the introduce of the load node, those optimizations will need to be taught how to look through loads anyway.

ASAN build is broken

Steps to reproduce:

  • cmake -G Ninja --DGLOW_USE_SANITIZER=Address ../glow
  • ninja all
[1/83] Linking CXX executable bin/emulator
FAILED: bin/emulator 
: && /Applications/Xcode_9.0.1_fb.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++  -Wall -fno-exceptions -fno-rtti -g -fno-omit-frame-pointer -O0 -Wl,-search_paths_first -Wl,-headerpad_max_install_names  tools/emulator/CMakeFiles/emulator.dir/emulator.cpp.o  -o bin/emulator   && :
Undefined symbols for architecture x86_64:
  "___asan_init", referenced from:
      _asan.module_ctor in emulator.cpp.o
  "___asan_version_mismatch_check_apple_900", referenced from:
      _asan.module_ctor in emulator.cpp.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Graph partitioning for multi-device execution

Partitioning a Function into multiple subgraphs breaks a lot of assumptions in our current architecture. PR #1176 shows a proof-of-concept for the mechanics of partitioning, but I want to step back and figure out how this fits into the whole system.

Let's consider the interfaces we currently have:

  • ExecutionEngine is configured with a particular Backend and knows how to compile a Function for it.
  • Once compiled, EE can run a function with particular Variable-to-Tensor bindings.
  • Internally to EE, Backend compiles an IRFunction, creating a CompiledFunction.
  • CompiledFunction can execute itself. It assumes Tensor bindings are performed externally (but see #1227 for an alternate design).

The main difficulty with partitioning is that it complicates the pipeline:
Function->IRFunction->CompiledFunction->execute
With partitioning, we have to decide where to split a Function into a DAG of Functions, and how to appropriately propagate the dependence information.

Furthermore some of the choices we make will be backend specific. The following scenarios could apply:

  • CPU execution in which each part is generated sequentially, and a higher-level Executor manages scheduling of parts onto threads
  • CPU execution in which synchronization is compiled in to the graph
  • Accelerator execution where synchronization generates inter-card DMA operations

[Quantization] Allow running end2end quantization tests on OpenCL backend

Currently, there are two end2end tests for quantization which make sure that we can:

  • Profile and Quantize the graph
  • Execute inference for the quantized graph on a specific backend

Note that Profiling is always done with the Interpreter backend. While quantizing and inference is run on a specific backend.

See here for the reference.

@ZchiPitt we've been talking about this. Created this issue just to make sure it's not lost.

Quick compilation test on macOS Sierra 10.12.6 failing

[ 28%] Building CXX object src/glow/Base/CMakeFiles/Base.dir/Train.cpp.o
In file included from /Users/bwasti/Glow/src/glow/Base/Train.cpp:3:
In file included from /Users/bwasti/Glow/include/glow/Base/Train.h:4:
/Users/bwasti/Glow/include/glow/Base/Tensor.h:158:26: error: no member named 'drop_front' in
      'llvm::ArrayRef<unsigned long>'
    auto dim = t->dims().drop_front();
               ~~~~~~~~~ ^
/Users/bwasti/Glow/include/glow/Base/Tensor.h:173:35: error: no member named 'drop_front' in
      'llvm::ArrayRef<unsigned long>'
    auto onceSliceDim = t->dims().drop_front();
                        ~~~~~~~~~ ^
/Users/bwasti/Glow/include/glow/Base/Tensor.h:175:35: error: no member named 'drop_front' in
      'llvm::ArrayRef<unsigned long>'
    assert(onceSliceDim == dims().drop_front() && "Invalid slice size");
                           ~~~~~~ ^
/usr/include/assert.h:93:25: note: expanded from macro 'assert'
    (__builtin_expect(!(e), 0) ? __assert_rtn(__func__, __FILE__, __LINE__, #e) : (void)0)
                        ^
In file included from /Users/bwasti/Glow/src/glow/Base/Train.cpp:3:
In file included from /Users/bwasti/Glow/include/glow/Base/Train.h:4:
/Users/bwasti/Glow/include/glow/Base/Tensor.h:357:32: error: no member named 'drop_front' in
      'llvm::ArrayRef<unsigned long>'
    Tensor slice(elemTy, sizes.drop_front());
                         ~~~~~ ^
4 errors generated.
make[2]: *** [src/glow/Base/CMakeFiles/Base.dir/Train.cpp.o] Error 1
make[1]: *** [src/glow/Base/CMakeFiles/Base.dir/all] Error 2
make: *** [all] Error 2

The definition seems to exist in external/llvm/include/llvm/ADT/ArrayRef.h, so I'm not really sure where the error is coming from.

$ strings external/llvm/lib/* 2>/dev/null | c++filt -n | grep "llvm::ArrayRef<unsigned long>" | grep drop_front
llvm::ArrayRef<unsigned long>::drop_front(unsigned long) const

OpenCL backend leaks memory

I ran a build with LeakSanitizer enabled and it pointed to a few leaks.

cmake -DGLOW_USE_SANITIZER=Address ~/src/glow
export ASAN_OPTIONS=detect_leaks=1
ninja graphTest
./tests/graphTest

The leaks are of allocations inside OpenCL, but we should make sure we're not misusing the API.

e.g.,

Indirect leak of 36 byte(s) in 1 object(s) allocated from:
    #0 0x103024c33 in wrap_malloc (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x56c33)
    #1 0x7fff7e6f38b2 in _Block_copy (libsystem_blocks.dylib:x86_64+0x8b2)
    #2 0x7fff7e6f3b3d in _Block_object_assign (libsystem_blocks.dylib:x86_64+0xb3d)
    #3 0x7fff7e6f38ef in _Block_copy (libsystem_blocks.dylib:x86_64+0x8ef)
    #4 0x7fff7e66bfd3 in _dispatch_Block_copy (libdispatch.dylib:x86_64+0x1fd3)
    #5 0x7fff7e683a07 in _dispatch_source_set_cancel_handler (libdispatch.dylib:x86_64+0x19a07)
    #6 0x103023f9e in wrap_dispatch_source_set_cancel_handler (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x55f9e)
    #7 0x7fff7867db1a in (anonymous namespace)::RunElsewhere::instance() (SkyLight:x86_64+0x23b1a)
    #8 0x7fff787ba521 in __SLSInitialize_block_invoke (SkyLight:x86_64+0x160521)
    #9 0x7fff7e66bd4f in _dispatch_client_callout (libdispatch.dylib:x86_64+0x1d4f)
    #10 0x7fff7e66bd02 in dispatch_once_f (libdispatch.dylib:x86_64+0x1d02)
    #11 0x7fff78828d08 in SLSMainConnectionID (SkyLight:x86_64+0x1ced08)
    #12 0x7fff604e5c9a in gfxLoadPluginData (libGFXShared.dylib:x86_64+0xc9a)
    #13 0x7fff5f30308a  (OpenCL:x86_64+0x2008a)
    #14 0x7fff7e9309bd in __pthread_once_handler (libsystem_pthread.dylib:x86_64+0x39bd)
    #15 0x7fff7e925eff in _os_once (libsystem_platform.dylib:x86_64+0xeff)
    #16 0x7fff7e930958 in pthread_once (libsystem_pthread.dylib:x86_64+0x3958)
    #17 0x7fff5f302e67  (OpenCL:x86_64+0x1fe67)
    #18 0x7fff5f3034e1 in clGetDeviceIDs (OpenCL:x86_64+0x204e1)
    #19 0x1003bee98 in glow::OCLBackend::OCLBackend() OpenCL.cpp:89
    #20 0x1003bebda in glow::createOCLBackend() OpenCL.cpp:87
    #21 0x10002395b in getConvNodeSize(glow::BackendKind) graphTest.cpp:408
    #22 0x100024477 in Graph_disableUnrollingGroupConv_Test::TestBody() graphTest.cpp:438
    #23 0x100268e1f in testing::Test::Run() gtest.cc
    #24 0x10026c688 in testing::TestInfo::Run() gtest.cc:2654
    #25 0x10026dc94 in testing::TestCase::Run() gtest.cc:2772
    #26 0x10028e1c7 in testing::internal::UnitTestImpl::RunAllTests() gtest.cc:4677
    #27 0x10028cfb5 in testing::UnitTest::Run() gtest.cc
    #28 0x1002c7b3f in main gtest.h:2237
    #29 0x7fff7e6a5114 in start (libdyld.dylib:x86_64+0x1114)

More initialization kinds and possibly refactor initialization mechanism

A thing I noticed a while ago was that the initialization algorithm of a variable is currently embedded in the representation of the Variable class itself. Given that in practice people may want to use quite a variety of initialization kinds, it may be a little unscalable to modify the core Variable class for each new initialization algorithm, and clutter Nodes.cpp with initialization algorithms that may best be put somewhere else. Maybe it's worth thinking of how the initialization code can be kept separate from Variable itself. This is just a thought. I'll look into how other libraries and frameworks deal with this.

Either way adding a few more initialization algorithms may be worthwhile for the future.

[quantization] Add symmetric quantization schema support to Glow

Current state

The latest version of Glow compiler provides a certain level of quantization support.

Quantization process consists of the following steps:

  1. Instrument an original floating-point graph with the special Quantization Profile node

  2. Run inference on the instrumented graph many times with good representative inputs (see dump_profile flag).
    During this step Glow automatically captures the distribution of all activations in the compute graph. The distribution contains min/max value seen from the given output of the graph node as well as detailed frequency per floating point range withing [min, max] interval. There are 2000 floating point ranges kept as part of the Quantization Profile node.
    As the result of the profiling procedure, Glow generates Scale and Offset parameters for all activations and dumps it in the file (Note, currently Glow uses linear quantization). The scale is a positive fp32 number while Offset is an int32. Note, Glow does not use the distribution of floating point numbers but relies only on values of min and max (this is a separate issue and out of scope here).

  3. Transform computation graph based on the captured profile and specific execution backend. Note, not all nodes are quantized as not all backends support certain quantized op. See CPU backend op support here for example.

  4. Perform computation of quantized graph. See Interpreter implementation here.

What needs to be enhanced

There are backends that tight to a specific quantization schema, e.g., symmetric quantization.
The difference between symmetric and assymetric quantization is that the "offset" parameter equals to 0 in symmetric linear quantization. In this case the dequantization follows the formula: fp32_number = scale * quantized_number (fp32_number = scale * (quantized_number - offset) for the asymmetric case).

Requirement for the symmetric quantization

  • Need to make sure that the accuracy loss is comparable with the asymmetric case on Resnet50. Important note, currently, activations are signed int8 and in case of symmetric quantization, outputs of Sigmoid/ReLU will effectively use only 7 bits out of 8 (which is undesirable).

The issue could be tackled in two steps:

  • Introduce the symmetric schema and deal with the int8 activations (accuracy loss)
  • Introduce int8, uint8 activations. Check if specific backend can handle int8/uint8 activations for a given Op (this way we could gradually onboard backends). This is a bigger change and would require a design discussion here.

Please discuss this issue here before implementation.
cc: @qcolombet

[Quantization] Calculate Scale and Offset based on the histogram

Currently, we profile all nodes in a graph and capture both min/max as well as the histogram.

Quantization scale and offset are calculated based on the global min and global max and do not take histogram into consideration.
This does not work very well when there are outliers.

There was the logic for calculating scale and offset based on histogram but it was removed: b5d9385
As it did not properly work. Ex, simple tensor with {1, 2} values.

Support non-square padding for Convolution operator

Convolution is the most important operator for Computer Vision NNs.

The simplest implementation (our Interpreter backend) could be found here:

void Interpreter::fwdConvolutionInst_FloatImpl(Value *inV, Value *outV,

It has parameters, such as group, pad, stride, filterSize.

Convolution is 2D operation (it always works with 2D image and 2D filter, and produces 2D result). Right now all our Convolution implementations assume that pad is the same for both dimensions (image height and width). But for some models (very rarely) it's not true. In general case, there could be 4 different pads: left side, right side, top side, bottom side. We need to change "pad" to be a vector instead of a single number (for ConvolutionNode and ConvolutionInstr). And change all the implementations: Interpreter (float and quantized), CPU (float and quantized). Later, we also need to support uneven paddings in ConvolutionGradNode (which is backward pass for Convolution).

Note, that our Convolution for CPU backend is highly optimized, and in theory complicating its code may lead to losing performance. In practice, I highly doubt that this could happen, but worth to check performance after the change just to be sure. One can run tests/images/run.sh -cpu -iterations=100 with and without the change, and make sure that average performance numbers are within 1%.

Here is ONNX specification of Conv operator:
https://github.com/onnx/onnx/blob/master/docs/Operators.md#conv
One can notice that not only padding is represented as array there, but also stride and filterSize are represented as array. Changing these is lower priority, and there's no immediate need. Just good thing to have.

Another direction of future work: support non-square padding in PoolMax and PoolAvg operators.

Resnet 50 example segfault

Hi,

I'm trying to run the Resnet 50 example in examples/bundles. My examples/bundles/resnet50/Makefile bundle has

LOADER?=~/build_Debug/bin/image-classifier
GLOW_SRC?=~/glow

Upon executing make run, I get

File ‘resnet50/predict_net.pbtxt’ already there; not retrieving.

File ‘resnet50/predict_net.pb’ already there; not retrieving.

File ‘resnet50/init_net.pb’ already there; not retrieving.

Model: resnet50_profile
 File: /home/ubuntu/glow/tests/images/imagenet/cat_285.png Result:285
 File: /home/ubuntu/glow/tests/images/imagenet/dog_207.png Result:207
 File: /home/ubuntu/glow/tests/images/imagenet/zebra_340.png Result:340
Segmentation fault (core dumped)
Makefile:37: recipe for target 'build/resnet50.o' failed
make: *** [build/resnet50.o] Error 139

Then, without quantization (QUANTIZE=NO make run), I get

File ‘resnet50/predict_net.pbtxt’ already there; not retrieving.

File ‘resnet50/predict_net.pb’ already there; not retrieving.

File ‘resnet50/init_net.pb’ already there; not retrieving.

Segmentation fault (core dumped)
Makefile:42: recipe for target 'build/resnet50.o' failed
make: *** [build/resnet50.o] Error 139

Device: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
OS: Ubuntu 16.04.4 LTS

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.