Giter VIP home page Giter VIP logo

bionemo's Introduction

NVIDIA BioNeMo

Overview

AI is achieving incredible breakthroughs in chemistry and biology such as enabling 3D protein structure prediction, property prediction, and generation of novel protein sequences and molecules. This progress is enabling developments in the pharmaceutical industry like antibody design, small-molecule drug discovery, and newer approaches like RNA aptamer and peptide-based therapeutics. Generative models, specifically, are enabling every stage of drug discovery.

BioNeMo is a cloud service for every stage of AI-powered drug discovery. BioNeMo enables researchers with SOTA AI models for protein structure prediction, molecule generation, protein generation, and binding pose generation.

BioNeMo provides API endpoints and a rich graphical user interface (GUI) to enable browser-based access to pretrained models, and enables training and fine-tuning of these models through DGX Cloud. Users can generate molecules with MoFlow and MegaMolBART, generate protein sequences with ProtGPT-2, predict protein structures with ESMFold, OpenFold, and AlphaFold, predict properties from embeddings from MegaMolBART and ESM, and predict docked poses through DiffDock.

BioNeMo is a cloud service for every stage of AI-powered drug discovery.

Getting Started

For more information and to sign up for access to BioNeMo Service, please visit https://www.nvidia.com/en-us/gpu-cloud/bionemo/.

To get started with the BioNeMo Service API, you can explore a set of example workflows in the BioNeMo API example notebooks.

bionemo's People

Contributors

kkersten avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

bionemo's Issues

nccl timeout when using default training configs

I am using the default configs, code and data to train a model within BioNeMo framework. The timeout occurs at the middle of the training. Based on the error logs, I have no clue how to track down the bug... Any idea on how to debug? Thanks!!!!

  
Epoch 0:   6%|██                               | 32040/500150 [6:28:43<94:39:17,  1.37it/s, loss=2.6, v_num=95nc, reduced_train_loss=2.590, global_step=3.2e+4, consumed_samples=2.56e+7][E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624886 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800741 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800733 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800769 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800847 milliseconds before timing out.
^C

The configs are:

name: esm2nv
do_training: True # set to false if data preprocessing steps must be completed
do_testing: False # set to true to run evaluation on test data after training, requires test_dataset section
restore_from_path: null # used when starting from a .nemo file

trainer:
  devices: 8 # number of GPUs or CPUs
  num_nodes: 1 
  accelerator: gpu #gpu or cpu
  precision: 16 #16 or 32
  logger: False # logger is provided by NeMo exp_manager
  enable_checkpointing: False # checkpointing is done by NeMo exp_manager
  replace_sampler_ddp: False # use NeMo Megatron samplers
  max_epochs: null # # use max_steps instead with NeMo Megatron model
  log_every_n_steps: 10  # number of interations between logging
  val_check_interval: 15e4
  limit_val_batches: 50 # number of batches in validation step, use fraction for fraction of data, 0 to disable
  limit_test_batches: 500 # number of batches in test step, use fraction for fraction of data, 0 to disable
  accumulate_grad_batches: 1
  gradient_clip_val: 1.0
  benchmark: False
  max_steps: 500000

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.