nvidia / torchfort Goto Github PK

View Code? Open in Web Editor NEW

149.0 10.0 15.0 4.9 MB

An Online Deep Learning Interface for HPC programs on NVIDIA GPUs

Home Page: https://nvidia.github.io/TorchFort/

License: Other

CMake 1.72% Dockerfile 0.56% C++ 67.65% C 7.96% Fortran 20.87% Python 1.24%

deep-learning fortran libtorch pytorch

torchfort's Introduction

TorchFort

An Online Deep Learning Interface for HPC programs on NVIDIA GPUs

Introduction

TorchFort is a DL training and inference interface for HPC programs implemented using LibTorch, the C++ backend used by the PyTorch framework. The goal of this library is to help practitioners and domain scientists to seamlessly combine their simulation codes with Deep Learning functionalities available within PyTorch. This library can be invoked directly from Fortran or C/C++ programs, enabling transparent sharing of data arrays to and from the DL framework all contained within the simulation process (i.e., no external glue/data-sharing code required). The library can directly load PyTorch model definitions exported to TorchScript and implements a configurable training process that users can control via a simple YAML configuration file format. The configuration files enable users to specify optimizer and loss selection, learning rate schedules, and much more.

Please refer to the documentation for additional information on the library, build instructions, and usage details.

Please refer to the examples to see TorchFort in action.

Contact us or open a GitHub issue if you are interested in using this library in your own solvers and have questions on usage and/or feature requests.

License

This library is released under a BSD 3-clause license, which can be found in LICENSE.

torchfort's People

Contributors

Stargazers

Watchers

Forkers

azrael417 jtao yakupcemilk dizbassarov greydoubt simrit1 mzamini92 sushrutkr sayan8759 sweetpand sepehrmn smturzo tommelt liqiaofeng1990 nloppi

torchfort's Issues

cmake cannot locate MPI fortran (from NVHPC 23.7)

I am trying to install TorchFort dependencies with spack and then build with cmake.

So far I have installed the following dependencies (with spack and using gcc version 12.3.0):

* [email protected]     ( with options ~doc+ncurses+ownlibs build_system=generic build_type=Release)
* [email protected]      ( with options ~allow-unsupported-compilers~dev build_system=generic)
* [email protected]      ( with options +cxx+fortran+hl~ipo+mpi+shared~szip~threadsafe+tools api=default build_system=cmake build_type=Release generator=make patches=0e20187)
* [email protected]       ( with options +blas+lapack+mpi build_system=generic install_type=single)
* [email protected]   ( with options ~ipo+pic+shared~tests build_system=cmake build_type=Release generator=make)

I have also setup and configured a conda environment which contains python 3.11.4 and I pip installed pybind11 2.11.1 and the requirements.txt file using pip install -r requirements.txt from within the conda environment.

I used the following bash script to compile my code:

#!/usr/bin/env bash

module load cmake/3.26.3/uf63q cuda/11.8.0/dmxqu hdf5/1.8.21/3bxvx nvhpc/23.7/tdmi4 yaml-cpp/0.7.0/l7fcu

source $HOME/miniconda3/bin/activate torchfort

NVHPC_ROOT="/software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7/"
NVHPC_CMAKE_DIR="$NVHPC_ROOT/cmake"

rm -rf build
mkdir build && cd build

export CMAKE_PREFIX_PATH="$CMAKE_PREFIX_PATH:$HOME/miniconda3/envs/torchfort/lib/python3.11/site-packages/pybind11"

cmake -DCMAKE_INSTALL_PREFIX="$HOME/.torchfort" \
    -DNVHPC_CUDA_VERSION=11.8 \
    -DCMAKE_PREFIX_PATH="`python -c 'import torch;print(torch.utils.cmake_prefix_path)'`;${NVHPC_CMAKE_DIR}" \
    ..

I get the following error:

./comp.sh 
-- The CXX compiler identification is NVHPC 23.7.0
-- The Fortran compiler identification is NVHPC 23.7.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7/compilers/bin/nvc++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Check for working Fortran compiler: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7/compilers/bin/nvfortran - skipped
-- Found CUDA: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/cuda-11.8.0-dmxquapj2bbxtifgzf3fwl423bjh3qjd (found version "11.8") 
-- The CUDA compiler identification is NVIDIA 11.8.89
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7/compilers/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Caffe2: CUDA detected: 11.8
-- Caffe2: CUDA nvcc is: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/cuda-11.8.0-dmxquapj2bbxtifgzf3fwl423bjh3qjd/bin/nvcc
-- Caffe2: CUDA toolkit directory: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/cuda-11.8.0-dmxquapj2bbxtifgzf3fwl423bjh3qjd
-- Caffe2: Header version is: 11.8
-- /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/cuda-11.8.0-dmxquapj2bbxtifgzf3fwl423bjh3qjd/lib64/libnvrtc.so shorthash is 672ee683
-- USE_CUDNN is set to 0. Compiling without cuDNN support
-- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_80,code=sm_80
CMake Warning at ~/miniconda3/envs/torchfort/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):
  static library kineto_LIBRARY-NOTFOUND not found.
Call Stack (most recent call first):
  ~/miniconda3/envs/torchfort/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found)
  CMakeLists.txt:23 (find_package)


-- Found Torch: ~/miniconda3/envs/torchfort/lib/python3.11/site-packages/torch/lib/libtorch.so  
-- Found MPI_CXX: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/openmpi-4.1.5-eq5qt6oay5atbk4jff6f5fg6tfmugwsp/lib/libmpi.so (found version "3.1") 
-- Could NOT find MPI_Fortran (missing: MPI_Fortran_WORKS) 
CMake Error at /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/cmake-3.26.3-uf63q4ykrr4cv5ppwkygp6hgjacdbt5i/share/cmake-3.26/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find MPI (missing: MPI_Fortran_FOUND) (found version "3.1")
Call Stack (most recent call first):
  /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/cmake-3.26.3-uf63q4ykrr4cv5ppwkygp6hgjacdbt5i/share/cmake-3.26/Modules/FindPackageHandleStandardArgs.cmake:600 (_FPHSA_FAILURE_MESSAGE)
  /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/cmake-3.26.3-uf63q4ykrr4cv5ppwkygp6hgjacdbt5i/share/cmake-3.26/Modules/FindMPI.cmake:1837 (find_package_handle_standard_args)
  CMakeLists.txt:24 (find_package)


-- Configuring incomplete, errors occurred!

For some reason cmake can find the MPI_CXX but not MPI_Fortran. Do you have any ideas how to get this working?

CMakeList.txt doesn't support CUDA Arch 89

My GPU has a CUDA architecture of 89 (GTX 4080). Currently the CMakeLists.txt is only setup to handle architectures that end in 0.

When building with default CMakeLists.txt I get the following error when I try to run the example fortran binary ./train:

Accelerator Fatal Error: This file was compiled: -acc=gpu -gpu=cc70 -gpu=cc80 -acc=host or -acc=multicore
Rebuild this file with -gpu=cc89 to use NVIDIA Tesla GPU 0

add topics

I suggest adding the topics pytorch, libtorch, deep-learning in the About section at https://github.com/NVIDIA/TorchFort

cmake contains hardcoded paths for yaml-cpp library

CMakeLists.txt and examples/cpp/cart_pole/CMakeLists.txt contain hardcoded paths to yaml-cpp

docker build uses large amount of memory when running with more than 4 cores

The following line in docker/Dockerfile allows make to build with all available cores.

TorchFort/docker/Dockerfile

Line 54 in e06613d

make -j install && \

On my laptop I have 12 cores. I ran the docker inside a virtual machine with 8 cores and 8 GB RAM and during the make stage docker quickly consumes all available memory and crashes.

I would suggest setting a limit or advising users to build either serial make install or by overriding the default to something sensible like make -j 4 although sensible depends on available RAM.

Using the library without cuda

Hello, i was wondering if it is possibile to have a version of this library that run without the cuda support since i want to use the library on cpu.

Thanks in advance for the answer
Nico

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.