The bionemo from nvidia

bionemo's Introduction

NVIDIA BioNeMo

Overview

AI is achieving incredible breakthroughs in chemistry and biology such as enabling 3D protein structure prediction, property prediction, and generation of novel protein sequences and molecules. This progress is enabling developments in the pharmaceutical industry like antibody design, small-molecule drug discovery, and newer approaches like RNA aptamer and peptide-based therapeutics. Generative models, specifically, are enabling every stage of drug discovery.

BioNeMo is a cloud service for every stage of AI-powered drug discovery. BioNeMo enables researchers with SOTA AI models for protein structure prediction, molecule generation, protein generation, and binding pose generation.

BioNeMo provides API endpoints and a rich graphical user interface (GUI) to enable browser-based access to pretrained models, and enables training and fine-tuning of these models through DGX Cloud. Users can generate molecules with MoFlow and MegaMolBART, generate protein sequences with ProtGPT-2, predict protein structures with ESMFold, OpenFold, and AlphaFold, predict properties from embeddings from MegaMolBART and ESM, and predict docked poses through DiffDock.

BioNeMo is a cloud service for every stage of AI-powered drug discovery.

Getting Started

For more information and to sign up for access to BioNeMo Service, please visit https://www.nvidia.com/en-us/gpu-cloud/bionemo/.

To get started with the BioNeMo Service API, you can explore a set of example workflows in the BioNeMo API example notebooks.

bionemo's People

Contributors

Stargazers

Watchers

bionemo's Issues

nccl timeout when using default training configs

I am using the default configs, code and data to train a model within BioNeMo framework. The timeout occurs at the middle of the training. Based on the error logs, I have no clue how to track down the bug... Any idea on how to debug? Thanks!!!!

  
Epoch 0:   6%|██                               | 32040/500150 [6:28:43<94:39:17,  1.37it/s, loss=2.6, v_num=95nc, reduced_train_loss=2.590, global_step=3.2e+4, consumed_samples=2.56e+7][E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624886 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800741 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800733 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800769 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800847 milliseconds before timing out.
^C

The configs are:

name: esm2nv
do_training: True # set to false if data preprocessing steps must be completed
do_testing: False # set to true to run evaluation on test data after training, requires test_dataset section
restore_from_path: null # used when starting from a .nemo file

trainer:
  devices: 8 # number of GPUs or CPUs
  num_nodes: 1 
  accelerator: gpu #gpu or cpu
  precision: 16 #16 or 32
  logger: False # logger is provided by NeMo exp_manager
  enable_checkpointing: False # checkpointing is done by NeMo exp_manager
  replace_sampler_ddp: False # use NeMo Megatron samplers
  max_epochs: null # # use max_steps instead with NeMo Megatron model
  log_every_n_steps: 10  # number of interations between logging
  val_check_interval: 15e4
  limit_val_batches: 50 # number of batches in validation step, use fraction for fraction of data, 0 to disable
  limit_test_batches: 500 # number of batches in test step, use fraction for fraction of data, 0 to disable
  accumulate_grad_batches: 1
  gradient_clip_val: 1.0
  benchmark: False
  max_steps: 500000

Recommend Projects

nvidia / bionemo Goto Github PK

bionemo's Introduction

NVIDIA BioNeMo

Overview

Getting Started

bionemo's People

Contributors

Stargazers

Watchers

Forkers

bionemo's Issues

nccl timeout when using default training configs

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent