Giter VIP home page Giter VIP logo

torchfort's Introduction

TorchFort

An Online Deep Learning Interface for HPC programs on NVIDIA GPUs

Introduction

TorchFort is a DL training and inference interface for HPC programs implemented using LibTorch, the C++ backend used by the PyTorch framework. The goal of this library is to help practitioners and domain scientists to seamlessly combine their simulation codes with Deep Learning functionalities available within PyTorch. This library can be invoked directly from Fortran or C/C++ programs, enabling transparent sharing of data arrays to and from the DL framework all contained within the simulation process (i.e., no external glue/data-sharing code required). The library can directly load PyTorch model definitions exported to TorchScript and implements a configurable training process that users can control via a simple YAML configuration file format. The configuration files enable users to specify optimizer and loss selection, learning rate schedules, and much more.

Please refer to the documentation for additional information on the library, build instructions, and usage details.

Please refer to the examples to see TorchFort in action.

Contact us or open a GitHub issue if you are interested in using this library in your own solvers and have questions on usage and/or feature requests.

License

This library is released under a BSD 3-clause license, which can be found in LICENSE.

torchfort's People

Contributors

azrael417 avatar nloppi avatar romerojosh avatar tommelt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

torchfort's Issues

cmake cannot locate MPI fortran (from NVHPC 23.7)

I am trying to install TorchFort dependencies with spack and then build with cmake.

So far I have installed the following dependencies (with spack and using gcc version 12.3.0):

* [email protected]     ( with options ~doc+ncurses+ownlibs build_system=generic build_type=Release)
* [email protected]      ( with options ~allow-unsupported-compilers~dev build_system=generic)
* [email protected]      ( with options +cxx+fortran+hl~ipo+mpi+shared~szip~threadsafe+tools api=default build_system=cmake build_type=Release generator=make patches=0e20187)
* [email protected]       ( with options +blas+lapack+mpi build_system=generic install_type=single)
* [email protected]   ( with options ~ipo+pic+shared~tests build_system=cmake build_type=Release generator=make)

I have also setup and configured a conda environment which contains python 3.11.4 and I pip installed pybind11 2.11.1 and the requirements.txt file using pip install -r requirements.txt from within the conda environment.

I used the following bash script to compile my code:

#!/usr/bin/env bash

module load cmake/3.26.3/uf63q cuda/11.8.0/dmxqu hdf5/1.8.21/3bxvx nvhpc/23.7/tdmi4 yaml-cpp/0.7.0/l7fcu

source $HOME/miniconda3/bin/activate torchfort

NVHPC_ROOT="/software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7/"
NVHPC_CMAKE_DIR="$NVHPC_ROOT/cmake"

rm -rf build
mkdir build && cd build

export CMAKE_PREFIX_PATH="$CMAKE_PREFIX_PATH:$HOME/miniconda3/envs/torchfort/lib/python3.11/site-packages/pybind11"

cmake -DCMAKE_INSTALL_PREFIX="$HOME/.torchfort" \
    -DNVHPC_CUDA_VERSION=11.8 \
    -DCMAKE_PREFIX_PATH="`python -c 'import torch;print(torch.utils.cmake_prefix_path)'`;${NVHPC_CMAKE_DIR}" \
    ..

I get the following error:

./comp.sh 
-- The CXX compiler identification is NVHPC 23.7.0
-- The Fortran compiler identification is NVHPC 23.7.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7/compilers/bin/nvc++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Check for working Fortran compiler: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7/compilers/bin/nvfortran - skipped
-- Found CUDA: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/cuda-11.8.0-dmxquapj2bbxtifgzf3fwl423bjh3qjd (found version "11.8") 
-- The CUDA compiler identification is NVIDIA 11.8.89
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7/compilers/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Caffe2: CUDA detected: 11.8
-- Caffe2: CUDA nvcc is: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/cuda-11.8.0-dmxquapj2bbxtifgzf3fwl423bjh3qjd/bin/nvcc
-- Caffe2: CUDA toolkit directory: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/cuda-11.8.0-dmxquapj2bbxtifgzf3fwl423bjh3qjd
-- Caffe2: Header version is: 11.8
-- /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/cuda-11.8.0-dmxquapj2bbxtifgzf3fwl423bjh3qjd/lib64/libnvrtc.so shorthash is 672ee683
-- USE_CUDNN is set to 0. Compiling without cuDNN support
-- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_80,code=sm_80
CMake Warning at ~/miniconda3/envs/torchfort/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):
  static library kineto_LIBRARY-NOTFOUND not found.
Call Stack (most recent call first):
  ~/miniconda3/envs/torchfort/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found)
  CMakeLists.txt:23 (find_package)


-- Found Torch: ~/miniconda3/envs/torchfort/lib/python3.11/site-packages/torch/lib/libtorch.so  
-- Found MPI_CXX: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/openmpi-4.1.5-eq5qt6oay5atbk4jff6f5fg6tfmugwsp/lib/libmpi.so (found version "3.1") 
-- Could NOT find MPI_Fortran (missing: MPI_Fortran_WORKS) 
CMake Error at /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/cmake-3.26.3-uf63q4ykrr4cv5ppwkygp6hgjacdbt5i/share/cmake-3.26/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find MPI (missing: MPI_Fortran_FOUND) (found version "3.1")
Call Stack (most recent call first):
  /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/cmake-3.26.3-uf63q4ykrr4cv5ppwkygp6hgjacdbt5i/share/cmake-3.26/Modules/FindPackageHandleStandardArgs.cmake:600 (_FPHSA_FAILURE_MESSAGE)
  /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/cmake-3.26.3-uf63q4ykrr4cv5ppwkygp6hgjacdbt5i/share/cmake-3.26/Modules/FindMPI.cmake:1837 (find_package_handle_standard_args)
  CMakeLists.txt:24 (find_package)


-- Configuring incomplete, errors occurred!

For some reason cmake can find the MPI_CXX but not MPI_Fortran. Do you have any ideas how to get this working?

CMakeList.txt doesn't support CUDA Arch 89

My GPU has a CUDA architecture of 89 (GTX 4080). Currently the CMakeLists.txt is only setup to handle architectures that end in 0.

When building with default CMakeLists.txt I get the following error when I try to run the example fortran binary ./train:

Accelerator Fatal Error: This file was compiled: -acc=gpu -gpu=cc70 -gpu=cc80 -acc=host or -acc=multicore
Rebuild this file with -gpu=cc89 to use NVIDIA Tesla GPU 0

docker build uses large amount of memory when running with more than 4 cores

The following line in docker/Dockerfile allows make to build with all available cores.

make -j install && \

On my laptop I have 12 cores. I ran the docker inside a virtual machine with 8 cores and 8 GB RAM and during the make stage docker quickly consumes all available memory and crashes.

I would suggest setting a limit or advising users to build either serial make install or by overriding the default to something sensible like make -j 4 although sensible depends on available RAM.

Using the library without cuda

Hello, i was wondering if it is possibile to have a version of this library that run without the cuda support since i want to use the library on cpu.

Thanks in advance for the answer
Nico

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.