gcorso / diffdock Goto Github PK

Implementation of DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking

Home Page: https://arxiv.org/abs/2210.01776

License: MIT License

Python 99.59% Dockerfile 0.41%

diffusion-models docking binding computational-biology equivariance machine-learning non-euclidean-geometry score-based-models

diffdock's People

Contributors

Stargazers

Watchers

Forkers

nkourkou unixjunkie rkakamilan ashishpatel26 techthiyanes duerrsimon shoufachen ytworks brian-hongyan yufengwhy yimeng-zeng engelberger jamesthesnake yaoyinying sabaiitj guydurant stjordanis lilleswing chaoqu12 forfreedomgit yutanakano-05906 josecarlosgomezt bbgao labdao mihirdate cyangnyu niklastr bbyun28 dongcf rochoa85 frenkiboy superxiang ballaneypranav shenlei515 scottmreed kjwallace koubapetr bii-dpi darrengao628 benygood schoosef caodh hahajinjiaodawang masterwhook majhas jalexvig lyndonlens vedasheersh martinakaduc huntbeat truatpasteurdotfr yecohn kehan777 colbyford liunanln dylanabramson33 jiaor17 giaguaro hfooladi minghao2016 emarsh25 whan-alter frankdji vinhsuhi abdulsalam-bande sunnnymskang lzcstan hgbrian dot23 xz-ding aalksii jakublala aolgac sirelkhatim hejonghong hoontlabs dharmogata fecet jasonzdeng tigerkey10 theangle134 jasonchow1991 maikuraky jordy-prog biochunan shredderroy martinez-zacharya cisprague itamarchinn qaison gfzhou yubeenkim benjileibo tomcobley aditishenoy altayg wahbisaied kiheonbaek zhonghuigu yernaidu-reddi

diffdock's Issues

ModuleNotFoundError: No module named 'datasets.pdbbind'

Thanks for the great work.

When I do "python -m evaluate --model_dir workdir/paper_score_model --ckpt best_ema_inference_epoch_model.pt --confidence_ckpt best_model_epoch75.pt --confidence_model_dir workdir/paper_confidence_model --run_name DiffDockInference --inference_steps 20 --split_path data/splits/timesplit_test --samples_per_complex 40 --batch_size 10"
I got the following message
"ModuleNotFoundError: No module named 'datasets.pdbbind' "

I think we need an empty file named "__init__.py" under the datasets folder.

I am not sure if I got the right results after I run inference

$python -m inference --protein_ligand_csv data/protein_ligand_example_csv.csv --out_dir results/user_predictions_small --inference_steps 20 --samples_per_complex 40 --batch_size 10 --actual_steps 18 --no_final_step_noise

Precomputing and saving to cache SO(3) distribution table
Precomputing and saving to cache torus distribution table
100%|█████████████████████████████████████████| 201/201 [00:45<00:00, 4.39it/s]
100%|█████████████████████████████████████████| 201/201 [00:55<00:00, 3.59it/s]
/home/xzhang/projects/DiffDock/utils/torus.py:39: RuntimeWarning: invalid value encountered in divide
score_ = grad(x, sigma[:, None], N=100) / p_
Reading molecules and generating local structures with RDKit (unless --keep_local_structures is turned on).
0it [00:00, ?it/s]rdkit coords could not be generated without using random coords. using random coords now.
6it [00:01, 3.74it/s]
Reading language model embeddings.
Generating graphs for ligands and proteins
loading complexes: 100%|██████████████████████████| 6/6 [00:02<00:00, 2.81it/s]
loading data from memory: data/cache_torsion/limit0_INDEX_maxLigSizeNone_H0_recRad15.0_recMax24_esmEmbeddings863911206/heterographs.pkl
Number of complexes: 6
radius protein: mean 33.793853759765625, std 14.15740966796875, max 53.81545639038086
radius molecule: mean 6.548925876617432, std 3.7833714485168457, max 14.822683334350586
distance protein-mol: mean 59.33454513549805, std 22.522035598754883, max 75.89938354492188
rmsd matching: mean 0.0, std 0.0, max 0
HAPPENING | confidence model uses different type of graphs than the score model. Loading (or creating if not existing) the data for the confidence model now.
Reading molecules and generating local structures with RDKit (unless --keep_local_structures is turned on).
0it [00:00, ?it/s]rdkit coords could not be generated without using random coords. using random coords now.
6it [00:01, 3.53it/s]
Reading language model embeddings.
Generating graphs for ligands and proteins
loading complexes: 100%|██████████████████████████| 6/6 [00:02<00:00, 2.82it/s]
loading data from memory: data/cache_torsion_allatoms/limit0_INDEX_maxLigSizeNone_H0_recRad15.0_recMax24_atomRad5_atomMax8_esmEmbeddings863911206/heterographs.pkl
Number of complexes: 6
radius protein: mean 33.793853759765625, std 14.15740966796875, max 53.81545639038086
radius molecule: mean 6.218267917633057, std 2.776658773422241, max 12.076842308044434
distance protein-mol: mean 58.96320724487305, std 22.422767639160156, max 75.24571990966797
rmsd matching: mean 0.0, std 0.0, max 0
common t schedule [1. 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35
0.3 0.25 0.2 0.15 0.1 0.05]
Size of test dataset: 6
0it [00:00, ?it/s]/home/xzhang/miniconda3/envs/diffdock/lib/python3.9/site-packages/e3nn/o3/_spherical_harmonics.py:82: UserWarning: FALLBACK path has been taken inside: compileCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable export PYTORCH_NVFUSER_DISABLE=fallback
To report the issue, try enable logging via setting the envvariable export PYTORCH_JIT_LOG_LEVEL=manager.cpp
(Triggered internally at /opt/conda/conda-bld/pytorch_1659484809662/work/torch/csrc/jit/codegen/cuda/manager.cpp:237.)
sh = _spherical_harmonics(self._lmax, x[..., 0], x[..., 1], x[..., 2])
5it [03:45, 30.19s/it]Failed on ['data/PDBBind_processed/6mo8/6mo8_protein_processed.pdb____data/PDBBind_processed/6hld/6hld_ligand.mol2'] Invariant Violation
no eligible neighbors for chiral center
Violation occurred on line 213 in file Code/GraphMol/FileParsers/MolFileStereochem.cpp
Failed Expression: nbrScores.size()
RDKIT: 2022.09.1
BOOST: 1_78

6it [04:24, 44.08s/it]
Failed for 1 complexes
Skipped 0 complexes
Results are in results/user_predictions_small

There is only one file in folder "index5_data-PDBBind_processed-6mo8-6mo8_protein_processed.pdb____data-PDBBind_processed-6hld-6hld_ligand.mol2", but there are 41 files in the other 5 folders. Did I run the inference correctly? Thanks

Index typo in the model

Thanks for the great work!

We notice a potentially major typo in the code.

DiffDock/models/score_model.py

Line 298 in f8d67b5

 center_edge_attr = torch.cat([center_edge_attr, lig_node_attr[center_edge_index[0], :self.ns]], -1) 

center_edge_index[0] will give us data['ligand'].batch (as defined below)

DiffDock/models/score_model.py

Line 403 in f8d67b5

 edge_index = torch.cat([data['ligand'].batch.unsqueeze(0), torch.arange(len(data['ligand'].batch)).to(data['ligand'].x.device).unsqueeze(0)], dim=0) 

But this doesn't make sense here. since we don't need the features from the graph for protein/ligand pair 0 to make prediction for protein/ligand pair 1. But lig_node_attr has dimension n_atom x n_feature. meaning that pair 1 will get features from pair 0.

This typo should have large effect on the result, but somehow it doesn't look like so. so we are confused. Maybe our understanding is not correct.

Wei

command: /usr/bin/python2.7 -c 'import sys, setuptools, tokenize; sys.argv[0]

command: /usr/bin/python2.7 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/user/Documents/CONDA/DiffDock/DiffDock/esm/setup.py'"'"'; file='"'"'/home/user/Documents/CONDA/DiffDock/DiffDock/esm/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' develop --no-deps --user --prefix=
cwd: /home/user/Documents/CONDA/DiffDock/DiffDock/esm/
Complete output (6 lines):
usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
or: setup.py --help [cmd1 cmd2 ...]
or: setup.py --help-commands
or: setup.py cmd --help

error: option --user not recognized
----------------------------------------

ERROR: Command errored out with exit status 1: /usr/bin/python2.7 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/user/Documents/CONDA/DiffDock/DiffDock/esm/setup.py'"'"'; file='"'"'/home/user/Documents/CONDA/DiffDock/DiffDock/esm/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' develop --no-deps --user --prefix= Check the logs for full command output.

python -m confidence.confidence_train

confidence_train.py: error: unrecognized arguments: --inf_sched_alpha 1 --inf_sched_beta 1 --tr_sigma_min 0.1 --tr_sigma_max 34 --rot_sigma_min 0.03 --rot_sigma_max 1.55

ModuleNotFoundError: No module named 'esm.model'

Hello,

Thank you for the great work. I was trying the example out of the box. Namely (I managed to successfully clone the repo & install the conda environment). Then, I ran:

python datasets/esm_embedding_preparation.py --protein_ligand_csv data/protein_ligand_example_csv.csv --out_file data/prepared_for_esm.fasta 
git clone https://github.com/facebookresearch/esm
cd esm
pip install -e .
cd ..
HOME=esm/model_weights python esm/scripts/extract.py esm2_t33_650M_UR50D data/prepared_for_esm.fasta data/esm2_output --repr_layers 33 --include per_tok

However, on the last line of code, I received the error:

(diffdock) akshat@Akshat:~/Downloads/DiffDock$ HOME=esm/model_weights python esm/scripts/extract.py esm2_t33_650M_UR50D data/prepared_for_esm.fasta data/esm2_output --repr_layers 33 --include per_tok
Traceback (most recent call last):
  File "/home/akshat/Downloads/DiffDock/esm/scripts/extract.py", line 12, in <module>
    from esm import Alphabet, FastaBatchedDataset, ProteinBertModel, pretrained, MSATransformer
  File "/home/akshat/Downloads/DiffDock/esm/esm/pretrained.py", line 15, in <module>
    from esm.model.esm2 import ESM2
ModuleNotFoundError: No module named 'esm.model'

I would appreciate some advice.

Thank you so much! :)

installation errors

Thanks for your repo. When I followed your readme and tried to create a conda env using "conda env create", I got
Pip subprocess error:
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [73 lines of output]

failed

CondaEnvException: Pip failed

Error inference with single files and torch_geometric

I installed this in a fresh environment using the provided environment.yml. but it fails on inference. I attached the two files (renamed to txt for upload)

The esm embedding step works:

(diffdock) $ HOME=esm/model_weights python esm/scripts/extract.py esm2_t33_650M_UR50D data/prepared_for_esm.fasta data/esm2_output --repr_layers 33 --include per_tok
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t33_650M_UR50D.pt" to esm/model_weights/.cache/torch/hub/checkpoints/esm2_t33_650M_UR50D.pt
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/regression/esm2_t33_650M_UR50D-contact-regression.pt" to esm/model_weights/.cache/torch/hub/checkpoints/esm2_t33_650M_UR50D-contact-regression.pt
Transferred model to GPU
Read data/prepared_for_esm.fasta with 2 sequences
Processing 1 of 1 batches (2 sequences)

I get the following error when executing this command in the root dir of the project. I did not download the data and used single sdf and pdb files but according to the README that should work. data/esm2_output contains two .pt files: 1cbr_protein.pdb_chain_0.pt 1cbr_protein.pdb_chain_1.pt

(diffdock) $ python -m inference --ligand_path examples/1cbr_ligand.sdf --protein_path examples/1cbr_protein.pdb --out_dir results/user_predictions_small --inference_steps 20 --samples_per_complex 40 --batch_size 10
Traceback (most recent call last):
  File "/home/duerr/miniconda3/envs/diffdock/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/duerr/miniconda3/envs/diffdock/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/share/lcbcsrv5/lcbcdata/duerr/PhD/08_Code/DiffDock/inference.py", line 16, in <module>
    from datasets.pdbbind import PDBBind
  File "/share/lcbcsrv5/lcbcdata/duerr/PhD/08_Code/DiffDock/datasets/pdbbind.py", line 22, in <module>
    from utils.utils import read_strings_from_txt
  File "/share/lcbcsrv5/lcbcdata/duerr/PhD/08_Code/DiffDock/utils/utils.py", line 12, in <module>
    from torch_geometric.nn.data_parallel import DataParallel
  File "/home/duerr/miniconda3/envs/diffdock/lib/python3.9/site-packages/torch_geometric/nn/__init__.py", line 3, in <module>
    from .sequential import Sequential
  File "/home/duerr/miniconda3/envs/diffdock/lib/python3.9/site-packages/torch_geometric/nn/sequential.py", line 8, in <module>
    from torch_geometric.nn.conv.utils.jit import class_from_module_repr
  File "/home/duerr/miniconda3/envs/diffdock/lib/python3.9/site-packages/torch_geometric/nn/conv/__init__.py", line 25, in <module>
    from .spline_conv import SplineConv
  File "/home/duerr/miniconda3/envs/diffdock/lib/python3.9/site-packages/torch_geometric/nn/conv/spline_conv.py", line 16, in <module>
    from torch_spline_conv import spline_basis, spline_weighting
  File "/home/duerr/miniconda3/envs/diffdock/lib/python3.9/site-packages/torch_spline_conv/__init__.py", line 11, in <module>
    torch.ops.load_library(importlib.machinery.PathFinder().find_spec(
AttributeError: 'NoneType' object has no attribute 'origin'

1cbr_ligand.txt
1cbr_protein.txt

FileNotFoundError: [Errno 2] No such file or directory: 'data/prepared_for_esm.fasta'

Hi, when I follow this command:
HOME=esm/model_weights python esm/scripts/extract.py esm2_t33_650M_UR50D data/prepared_for_esm.fasta data/esm2_output --repr_layers 33 --include per_tok
it get an error.
Traceback (most recent call last): File "/home/icer/my_prj/diffdock/DiffDock/esm/scripts/extract.py", line 137, in <module> main(args) File "/home/icer/my_prj/diffdock/DiffDock/esm/scripts/extract.py", line 74, in main dataset = FastaBatchedDataset.from_file(args.fasta_file) File "/home/icer/my_prj/diffdock/DiffDock/esm/esm/data.py", line 39, in from_file with open(fasta_file, "r") as infile: FileNotFoundError: [Errno 2] No such file or directory: 'data/prepared_for_esm.fasta'

environment.yml

why many pip installed package cannot find the required version?

What is the difference between score model and confidence model？

when Training a model by myself， What is the difference between score model and confidence model？ why need this two score

New script for reproducing paper results directly from files

I follow all the instructions in README and cloned and evaluated it three times using the checkpoint you provided. The evaluating instruction is the same as in README. The dataset I am using is Version 2 in zenodo. The evaluated RMSD < 2% is always 17%, filtered RMSD < 2% is among 32%-34%, and top-5 filtered RMSD < 2% is among 38-40%. I see the paper the results are all higher than these. The whole results are as below

run_times_std 21.73
run_times_mean 47.04
steric_clash_fraction 8.37
self_intersect_fraction 0.58
mean_rmsd 19.127588073855097
rmsds_below_2 17.09366391184573
rmsds_below_5 50.874655647382916
rmsds_percentile_25 2.68
rmsds_percentile_50 4.91
rmsds_percentile_75 9.19
mean_centroid 16.57
centroid_below_2 54.66
centroid_below_5 77.22
centroid_percentile_25 0.8
centroid_percentile_50 1.75
centroid_percentile_75 4.3
top5_steric_clash_fraction 6.61
top5_self_intersect_fraction 0.0
top5_rmsds_below_2 31.68
top5_rmsds_below_5 69.97
top5_rmsds_percentile_25 1.68
top5_rmsds_percentile_50 3.17
top5_rmsds_percentile_75 5.73
top5_centroid_below_2 67.77
top5_centroid_below_5 85.4
top5_centroid_percentile_25 0.59
top5_centroid_percentile_50 1.28
top5_centroid_percentile_75 2.59
top10_steric_clash_fraction 6.89
top10_self_intersect_fraction 0.0
top10_rmsds_below_2 38.29
top10_rmsds_below_5 73.83
top10_rmsds_percentile_25 1.52
top10_rmsds_percentile_50 2.74
top10_rmsds_percentile_75 5.21
top10_centroid_below_2 70.52
top10_centroid_below_5 86.78
top10_centroid_percentile_25 0.51
top10_centroid_percentile_50 1.09
top10_centroid_percentile_75 2.37
filtered_self_intersect_fraction 0.55
filtered_steric_clash_fraction 2.48
filtered_rmsds_below_2 32.78
filtered_rmsds_below_5 59.5
filtered_rmsds_percentile_25 1.63
filtered_rmsds_percentile_50 3.48
filtered_rmsds_percentile_75 7.91
filtered_centroid_below_2 62.26
filtered_centroid_below_5 79.89
filtered_centroid_percentile_25 0.55
filtered_centroid_percentile_50 1.26
filtered_centroid_percentile_75 3.3
top5_filtered_self_intersect_fraction 4.68
top5_filtered_steric_clash_fraction 4.68
top5_filtered_rmsds_below_2 39.94
top5_filtered_rmsds_below_5 73.0
top5_filtered_rmsds_percentile_25 1.45
top5_filtered_rmsds_percentile_50 2.55
top5_filtered_rmsds_percentile_75 5.25
top5_filtered_centroid_below_2 69.97
top5_filtered_centroid_below_5 86.5
top5_filtered_centroid_percentile_25 0.46
top5_filtered_centroid_percentile_50 1.05
top5_filtered_centroid_percentile_75 2.43
top10_filtered_self_intersect_fraction 4.13
top10_filtered_steric_clash_fraction 4.13
top10_filtered_rmsds_below_2 42.7
top10_filtered_rmsds_below_5 75.76
top10_filtered_rmsds_percentile_25 1.4
top10_filtered_rmsds_percentile_50 2.43
top10_filtered_rmsds_percentile_75 4.78
top10_filtered_centroid_below_2 72.45
top10_filtered_centroid_below_5 87.6
top10_filtered_centroid_percentile_25 0.45
top10_filtered_centroid_percentile_50 0.96
top10_filtered_centroid_percentile_75 2.15
no_overlap_run_times_std 21.73
no_overlap_run_times_mean 47.04
no_overlap_steric_clash_fraction 12.05
no_overlap_self_intersect_fraction 0.49
no_overlap_mean_rmsd 13.51542289628426
no_overlap_rmsds_below_2 6.09375
no_overlap_rmsds_below_5 32.29166666666667
no_overlap_rmsds_percentile_25 4.09
no_overlap_rmsds_percentile_50 7.82
no_overlap_rmsds_percentile_75 20.88
no_overlap_mean_centroid 11.02
no_overlap_centroid_below_2 33.66
no_overlap_centroid_below_5 56.94
no_overlap_centroid_percentile_25 1.48
no_overlap_centroid_percentile_50 3.52
no_overlap_centroid_percentile_75 19.7
no_overlap_top5_steric_clash_fraction 9.03
no_overlap_top5_self_intersect_fraction 0.0
no_overlap_top5_rmsds_below_2 13.19
no_overlap_top5_rmsds_below_5 52.08
no_overlap_top5_rmsds_percentile_25 2.62
no_overlap_top5_rmsds_percentile_50 4.82
no_overlap_top5_rmsds_percentile_75 8.93
no_overlap_top5_centroid_below_2 47.22
no_overlap_top5_centroid_below_5 70.14
no_overlap_top5_centroid_percentile_25 1.11
no_overlap_top5_centroid_percentile_50 2.24
no_overlap_top5_centroid_percentile_75 5.99
no_overlap_top10_steric_clash_fraction 9.03
no_overlap_top10_self_intersect_fraction 0.0
no_overlap_top10_rmsds_below_2 16.67
no_overlap_top10_rmsds_below_5 55.56
no_overlap_top10_rmsds_percentile_25 2.45
no_overlap_top10_rmsds_percentile_50 4.19
no_overlap_top10_rmsds_percentile_75 8.05
no_overlap_top10_centroid_below_2 49.31
no_overlap_top10_centroid_below_5 72.22
no_overlap_top10_centroid_percentile_25 0.96
no_overlap_top10_centroid_percentile_50 2.06
no_overlap_top10_centroid_percentile_75 5.64
no_overlap_filtered_self_intersect_fraction 0.0
no_overlap_filtered_steric_clash_fraction 4.86
no_overlap_mean_filtered_rmsds 12.111509165359777
no_overlap_filtered_rmsds_below_2 15.28
no_overlap_filtered_rmsds_below_5 38.89
no_overlap_filtered_rmsds_percentile_25 2.73
no_overlap_filtered_rmsds_percentile_50 6.79
no_overlap_filtered_rmsds_percentile_75 16.45
no_overlap_mean_filtered_centroid 9.79313355364492
no_overlap_filtered_centroid_below_2 41.67
no_overlap_filtered_centroid_below_5 61.11
no_overlap_filtered_centroid_percentile_25 0.94
no_overlap_filtered_centroid_percentile_50 2.82
no_overlap_filtered_centroid_percentile_75 14.3
no_overlap_top5_filtered_self_intersect_fraction 6.25
no_overlap_top5_filtered_steric_clash_fraction 6.25
no_overlap_top5_filtered_rmsds_below_2 22.92
no_overlap_top5_filtered_rmsds_below_5 56.94
no_overlap_top5_filtered_rmsds_percentile_25 2.11
no_overlap_top5_filtered_rmsds_percentile_50 4.28
no_overlap_top5_filtered_rmsds_percentile_75 9.13
no_overlap_top5_filtered_centroid_below_2 51.39
no_overlap_top5_filtered_centroid_below_5 72.22
no_overlap_top5_filtered_centroid_percentile_25 0.8
no_overlap_top5_filtered_centroid_percentile_50 1.86
no_overlap_top5_filtered_centroid_percentile_75 6.24
no_overlap_top10_filtered_self_intersect_fraction 6.94
no_overlap_top10_filtered_steric_clash_fraction 6.94
no_overlap_top10_filtered_rmsds_below_2 25.69
no_overlap_top10_filtered_rmsds_below_5 60.42
no_overlap_top10_filtered_rmsds_percentile_25 1.98
no_overlap_top10_filtered_rmsds_percentile_50 3.96
no_overlap_top10_filtered_rmsds_percentile_75 7.83
no_overlap_top10_filtered_centroid_below_2 54.86
no_overlap_top10_filtered_centroid_below_5 73.61
no_overlap_top10_filtered_centroid_percentile_25 0.8
no_overlap_top10_filtered_centroid_percentile_50 1.74
no_overlap_top10_filtered_centroid_percentile_75 5.46

Could you please take a look? I am not sure if this is because of changes of original codes.

RuntimeError: torch.cat(): expected a non-empty list of Tensors

Hello,

Thanks for your awesome work.
I met following error when training model:

loading data from memory:  data/cache_torsion/limit0_INDEXtimesplit_no_lig_overlap_train_maxLigSizeNone_H0_recRad15.0_recMax24_esmEmbeddings/heterographs.pkl
Number of complexes:  16271                                                                          
radius protein: mean 35.324798583984375, std 10.894360542297363, max 140.3852081298828
radius molecule: mean 7.464962482452393, std 3.125143051147461, max 28.649322509765625                                                                                                                     
distance protein-mol: mean 12.960723876953125, std 6.1995849609375, max 70.93856811523438
rmsd matching: mean 0.5280884495331837, std 0.5165252914283075, max 6.0405902977233294                                                                                                                     
loading data from memory:  data/cache_torsion/limit0_INDEXtimesplit_no_lig_overlap_val_maxLigSizeNone_H0_recRad15.0_recMax24_esmEmbeddings/heterographs.pkl
Number of complexes:  955                                                                            
radius protein: mean 35.945213317871094, std 11.460107803344727, max 92.777587890625
radius molecule: mean 7.608007431030273, std 3.1059141159057617, max 21.896770477294922
distance protein-mol: mean 13.32070541381836, std 6.682881832122803, max 54.33257293701172
rmsd matching: mean 0.5830034745709728, std 0.6322264833323166, max 5.280259896200438
Model with 20248214 parameters

Starting training...
Run name:  big_score_model
  0%|                                                                                                                                                                             | 0/1017 [00:13<?, ?it/s]
Traceback (most recent call last):
  File "/home/chenshoufa/anaconda3/envs/diffdock/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/chenshoufa/anaconda3/envs/diffdock/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/chenshoufa/workspace/DiffDock/train.py", line 158, in <module>
    main_function()
  File "/home/chenshoufa/workspace/DiffDock/train.py", line 153, in main_function
    train(args, model, optimizer, scheduler, ema_weights, train_loader, val_loader, t_to_sigma, run_dir)
  File "/home/chenshoufa/workspace/DiffDock/train.py", line 35, in train
    train_losses = train_epoch(model, train_loader, optimizer, device, t_to_sigma, loss_fn, ema_weights)
  File "/home/chenshoufa/workspace/DiffDock/utils/training.py", line 128, in train_epoch
    raise e
  File "/home/chenshoufa/workspace/DiffDock/utils/training.py", line 105, in train_epoch
    tr_pred, rot_pred, tor_pred = model(data)
  File "/home/chenshoufa/anaconda3/envs/diffdock/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/chenshoufa/anaconda3/envs/diffdock/lib/python3.9/site-packages/torch_geometric/nn/data_parallel.py", line 70, in forward
    outputs = self.parallel_apply(replicas, inputs, None)
  File "/home/chenshoufa/anaconda3/envs/diffdock/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/chenshoufa/anaconda3/envs/diffdock/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/chenshoufa/anaconda3/envs/diffdock/lib/python3.9/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/chenshoufa/anaconda3/envs/diffdock/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/chenshoufa/anaconda3/envs/diffdock/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/chenshoufa/workspace/DiffDock/models/score_model.py", line 299, in forward
    global_pred = self.final_conv(lig_node_attr, center_edge_index, center_edge_attr, center_edge_sh, out_nodes=data.num_graphs)
  File "/home/chenshoufa/anaconda3/envs/diffdock/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/chenshoufa/workspace/DiffDock/models/score_model.py", line 88, in forward
    out = self.batch_norm(out)
  File "/home/chenshoufa/anaconda3/envs/diffdock/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/chenshoufa/anaconda3/envs/diffdock/lib/python3.9/site-packages/e3nn/nn/_batchnorm.py", line 178, in forward
    torch.cat(new_means, out=self.running_mean)
RuntimeError: torch.cat(): expected a non-empty list of Tensors

zenodo dataset has two versions

The zenodo dataset has two versions, which one to use?

https://zenodo.org/record/6034088

Suggestions from drug design developers

Hi, very interesting model. But here are a few suggestions from us drug design developers:

If you want to compare your model with traditional docking methods, e.g. Glide, Smina, Vina, etc, you should focus on pockets instead of using global docking to lower the their performance, as they are fully focus on the pocket, not on the full protein;
End-to-end whole protein docking is not necessary. You can do a dedicated pocket prediction model and then fully invest in pocket pose prediction. If the pockets are all wrongly predicted, there is absolutely no need to calculate RMSD and centroid distance;
From the point of view of the force field, the prediction of the torsion angle is not arbitrary. I didn't see the assignment in your paper about the plausible assessment of molecular conformation, even if your initial conformation was generated by rdkit, bond length and angles are guaranteed, but plausibility assessment of torsion angle is still required. If you consider the vdw interaction between the atoms forming the torsion angle, it may be better for your model to predict the native-like pose;
Actually, this suggestion (#4) is correct, pose prediction serves HTS, you do not need to develop the corresponding scoring function, but you should further combine the traditional scoring function to report your advantage of high conformational enrichment within 2A on virtual screening task.
The above are some of my suggestions, I hope to be forgiven if I say something wrong. We hope that we drug developers will keep communicating with your computer science researchers.

TypeError: get_batch_converter() takes 1 positional argument but 2 were given

I have installed diffdock from github and fair-esm from pip. unzip the esm into diffdock.
I try to run the extract.py as given in the readme file . I am getting the error.

HOME=esm/model_weights python esm/scripts/extract.py esm2_t33_650M_UR50D data/prepared_for_esm.fasta data/esm2_output --repr_layers 33 --include per_tok

Traceback (most recent call last):
File "/home/cadd/DiffDock-main/esm/scripts/extract.py", line 136, in
main(args)
File "/home/cadd/DiffDock-main/esm/scripts/extract.py", line 77, in main
dataset, collate_fn=alphabet.get_batch_converter(args.truncation_seq_length), batch_sampler=batches
TypeError: get_batch_converter() takes 1 positional argument but 2 were given

AttributeError: 'NoneType' object has no attribute 'origin'

However I try to use diffdock ( I followed the instructions with creating the conda environment and all that seemed to work)
I'll get the error mentioned in the title. To be more specific it looks like this.

$ python -m inference --protein_ligand_csv data/protein_ligand_example_csv.csv --out_dir results/user_predictions_small --inference_steps 20 --samples_per_complex 40 --batch_size 10

Traceback (most recent call last):
  File "/work/scratch/b_mayer/miniconda3/envs/diffdock/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/work/scratch/b_mayer/miniconda3/envs/diffdock/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/work/scratch/b_mayer/DiffDock/inference.py", line 16, in <module>
    from datasets.pdbbind import PDBBind
  File "/work/scratch/b_mayer/DiffDock/datasets/pdbbind.py", line 22, in <module>
    from utils.utils import read_strings_from_txt
  File "/work/scratch/b_mayer/DiffDock/utils/utils.py", line 12, in <module>
    from torch_geometric.nn.data_parallel import DataParallel
  File "/work/scratch/b_mayer/miniconda3/envs/diffdock/lib/python3.9/site-packages/torch_geometric/nn/__init__.py", line 3, in <module>
    from .sequential import Sequential
  File "/work/scratch/b_mayer/miniconda3/envs/diffdock/lib/python3.9/site-packages/torch_geometric/nn/sequential.py", line 8, in <module>
    from torch_geometric.nn.conv.utils.jit import class_from_module_repr
  File "/work/scratch/b_mayer/miniconda3/envs/diffdock/lib/python3.9/site-packages/torch_geometric/nn/conv/__init__.py", line 25, in <module>
    from .spline_conv import SplineConv
  File "/work/scratch/b_mayer/miniconda3/envs/diffdock/lib/python3.9/site-packages/torch_geometric/nn/conv/spline_conv.py", line 16, in <module>
    from torch_spline_conv import spline_basis, spline_weighting
  File "/work/scratch/b_mayer/miniconda3/envs/diffdock/lib/python3.9/site-packages/torch_spline_conv/__init__.py", line 11, in <module>
    torch.ops.load_library(importlib.machinery.PathFinder().find_spec(
AttributeError: 'NoneType' object has no attribute 'origin'_

I found that if I open a python console in the DiffDock folder and do
from datasets.pdbbind import PDBBind I'll get the same error as above:

>>>from datasets.pdbbind import PDBBind
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/work/scratch/b_mayer/DiffDock/datasets/pdbbind.py", line 22, in <module>
    from utils.utils import read_strings_from_txt
  File "/work/scratch/b_mayer/DiffDock/utils/utils.py", line 12, in <module>
    from torch_geometric.nn.data_parallel import DataParallel
  File "/work/scratch/b_mayer/miniconda3/envs/diffdock/lib/python3.9/site-packages/torch_geometric/nn/__init__.py", line 3, in <module>
    from .sequential import Sequential
  File "/work/scratch/b_mayer/miniconda3/envs/diffdock/lib/python3.9/site-packages/torch_geometric/nn/sequential.py", line 8, in <module>
    from torch_geometric.nn.conv.utils.jit import class_from_module_repr
  File "/work/scratch/b_mayer/miniconda3/envs/diffdock/lib/python3.9/site-packages/torch_geometric/nn/conv/__init__.py", line 25, in <module>
    from .spline_conv import SplineConv
  File "/work/scratch/b_mayer/miniconda3/envs/diffdock/lib/python3.9/site-packages/torch_geometric/nn/conv/spline_conv.py", line 16, in <module>
    from torch_spline_conv import spline_basis, spline_weighting
  File "/work/scratch/b_mayer/miniconda3/envs/diffdock/lib/python3.9/site-packages/torch_spline_conv/__init__.py", line 11, in <module>
    torch.ops.load_library(importlib.machinery.PathFinder().find_spec(
AttributeError: 'NoneType' object has no attribute 'origin'

Is pretraining torsion model useful for docking?

The code of 'datasets.geom' is used for pretraining?

https://anonymous.4open.science/r/DiffDock/datasets/pdbbind.py

from datasets.geom import GeomNoiseTransform, Geom

ESM embeddings script missing ligand option

Docs state that you don't need a csv and can use --ligand but this is not the case

optional arguments:
  -h, --help            show this help message and exit
  --out_file OUT_FILE
  --protein_ligand_csv PROTEIN_LIGAND_CSV
                        Path to a .csv specifying the input as described in the main README
  --protein_path PROTEIN_PATH
                        Path to a single PDB file. If this is not None then it will be used instead of the --protein_ligand_csv

PyTorch Internal Assert Failed

Hi, congratulations on your work, it is a very interesting approach and the results are amazing!

I was able to run the PDBbind examples, but I see the following error with other input files:
Failed on ['data/protein.pdb____data/ligands/10005.sdf'] tensor_type->scalarType().has_value() INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1659484809662/work/torch/csrc/jit/codegen/cuda/type_promotion.cpp":111, please report a bug to PyTorch. Missing Scalar Type information

Do you have any idea what might be wrong?

Animation

Hi,

Is there anyway, we can generate animation (.gif) of the predicted results?. Thanks in advance.

AttributeError: 'NoneType' object has no attribute 'origin'

Hi, I tried running the model for a protein-ligand complex with the following commands:

python datasets/esm_embedding_preparation.py --protein_path /brahma_hd/a7_allosteric/docking/7ekt/diffdock/7ekt.pdb --out_file data/prepared_for_esm.fasta 

git clone https://github.com/facebookresearch/esm
cd esm
pip install -e .
cd ..
HOME=esm/model_weights python esm/scripts/extract.py esm2_t33_650M_UR50D data/prepared_for_esm.fasta data/esm2_output --repr_layers 33 --include per_tok

python -m inference --out_dir /brahma_hd/a7_allosteric/docking/7ekt/diffdock/ --inference_steps 20 --samples_per_complex 40 --batch_size 10 --actual_steps 18 --no_final_step_noise --protein_path /brahma_hd/a7_allosteric/docking/7ekt/diffdock/7ekt.pdb --ligand /brahma_hd/a7_allosteric/docking/7ekt/diffdock/EQ04.mol2

But I get this error when running inference:


Traceback (most recent call last):
  File "/biggin/b196/scro4068/miniconda3/envs/diffdock/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/biggin/b196/scro4068/miniconda3/envs/diffdock/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/biggin/b196/scro4068/opt/DiffDock/inference.py", line 16, in <module>
    from datasets.pdbbind import PDBBind
  File "/biggin/b196/scro4068/opt/DiffDock/datasets/pdbbind.py", line 22, in <module>
    from utils.utils import read_strings_from_txt
  File "/biggin/b196/scro4068/opt/DiffDock/utils/utils.py", line 12, in <module>
    from torch_geometric.nn.data_parallel import DataParallel
  File "/biggin/b196/scro4068/miniconda3/envs/diffdock/lib/python3.9/site-packages/torch_geometric/nn/__init__.py", line 3, in <module>
    from .sequential import Sequential
  File "/biggin/b196/scro4068/miniconda3/envs/diffdock/lib/python3.9/site-packages/torch_geometric/nn/sequential.py", line 8, in <module>
    from torch_geometric.nn.conv.utils.jit import class_from_module_repr
  File "/biggin/b196/scro4068/miniconda3/envs/diffdock/lib/python3.9/site-packages/torch_geometric/nn/conv/__init__.py", line 25, in <module>
    from .spline_conv import SplineConv
  File "/biggin/b196/scro4068/miniconda3/envs/diffdock/lib/python3.9/site-packages/torch_geometric/nn/conv/spline_conv.py", line 16, in <module>
    from torch_spline_conv import spline_basis, spline_weighting
  File "/biggin/b196/scro4068/miniconda3/envs/diffdock/lib/python3.9/site-packages/torch_spline_conv/__init__.py", line 11, in <module>
    torch.ops.load_library(importlib.machinery.PathFinder().find_spec(
AttributeError: 'NoneType' object has no attribute 'origin'

Any ideas?

Best regards,
Franco

Optimizing for time

As the code is written (--inference_steps 20 --samples_per_complex 40 --batch_size 10), it takes about 12 minutes/run to complete on a Mac M1 with 32 GB RAM. I have tried to minimize these values but have gotten less accurate results. Can you provide some insight into which parameters can be minimized without losing significant accuracy?

Fail at some input

I follow the README but after I inference the whole dataset, the log says I failed at 80 of them, is there something wrong or it's an expected behavior?

Zenodo dataset processing?

The Zenodo dataset proteins are different from the original PDBBind. Can you explain how these were processed? I did not see any details in the paper or code.

Thanks!

0it [00:00, ?it/s]Killed

Hi, When I run the inference, There exists a kill error, No other explanations is offered. So I do not know where to debug.Thank you for your time. Here is my log file.
~/my_prj/diffdock/DiffDock$ python -m inference --protein_ligand_csv data/protein_ligand_example_csv.csv --out_dir results/user_predictions_small --inference_steps 2 --samples_per_complex 40 --batch_size 10 --actual_steps 18 --no_final_step_noise loading data from memory: data/cache_torsion/limit0_INDEX_maxLigSizeNone_H0_recRad15.0_recMax24_esmEmbeddings492466589/heterographs.pkl Number of complexes: 5 radius protein: mean 29.789531707763672, std 12.012960433959961, max 53.81545639038086 radius molecule: mean 6.223214149475098, std 3.113785982131958, max 12.253068923950195 distance protein-mol: mean 65.75000762939453, std 19.175703048706055, max 76.23456573486328 rmsd matching: mean 0.0, std 0.0, max 0 HAPPENING | confidence model uses different type of graphs than the score model. Loading (or creating if not existing) the data for the confidence model now. loading data from memory: data/cache_torsion_allatoms/limit0_INDEX_maxLigSizeNone_H0_recRad15.0_recMax24_atomRad5_atomMax8_esmEmbeddings492466589/heterographs.pkl Number of complexes: 5 radius protein: mean 29.789531707763672, std 12.012960433959961, max 53.81545639038086 radius molecule: mean 6.45449686050415, std 3.5746219158172607, max 13.436868667602539 distance protein-mol: mean 65.57249450683594, std 19.06147575378418, max 75.89845275878906 rmsd matching: mean 0.0, std 0.0, max 0 common t schedule [1. 0.5] Size of test dataset: 5 **0it [00:00, ?it/s]Killed**

Does DiffDock recognize Chiral Centers?

Hello, I have been trying to use DiffDock with Glufosinate specifically. It has a Chiral Center where one of the members is hydrogen. The smiles code for them is below.

R-glufosinate CP(=O)(CCC@HN)O
L-glufosinate CP(=O)(CCC@@HN)O

When I converted these Smiles codes to SDF and ran DiffDock with the SDFs. The results were all of the same chirality. Is there some way to define the chirality differently. I am now running it with the smiles code itself rather than the SDF to see if there is a difference.

Thanks

ValueError: zero-size array to reduction operation maximum which has no identity

While going through the steps described in README I am getting an error. See below for the full trace...
[It is on macOS Monterey, Version 12.6 (21G115)]

BTW: It should be

python scripts/extract.py esm2_t33_650M_UR50D ../data/pdbbind_sequences.fasta embeddings_output --repr_layers 33 --include per_tok

instead of

python scripts/extract.py esm2_t33_650M_UR50D data/pdbbind_sequences.fasta embeddings_output --repr_layers 33 --include per_tok

Full trace:

(diffdock) $ python -m inference --protein_ligand_csv data/protein_ligand_example_csv.csv --out_dir results/user_predictions_small --inference_steps 20 --samples_per_complex 40 --batch_size 10
loading data from memory:  data/cache_torsion/limit0_INDEX_maxLigSizeNone_H0_recRad15.0_recMax24_esmEmbeddings863911206/heterographs.pkl
Number of complexes:  0
/Users/ryszard/miniconda3/envs/diffdock/lib/python3.9/site-packages/numpy/core/fromnumeric.py:3432: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/Users/ryszard/miniconda3/envs/diffdock/lib/python3.9/site-packages/numpy/core/_methods.py:190: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/Users/ryszard/miniconda3/envs/diffdock/lib/python3.9/site-packages/numpy/core/_methods.py:265: RuntimeWarning: Degrees of freedom <= 0 for slice
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/Users/ryszard/miniconda3/envs/diffdock/lib/python3.9/site-packages/numpy/core/_methods.py:223: RuntimeWarning: invalid value encountered in divide
  arrmean = um.true_divide(arrmean, div, out=arrmean, casting='unsafe',
/Users/ryszard/miniconda3/envs/diffdock/lib/python3.9/site-packages/numpy/core/_methods.py:257: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
  File "/Users/ryszard/miniconda3/envs/diffdock/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/ryszard/miniconda3/envs/diffdock/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/ryszard/repos/DiffDock/inference.py", line 81, in <module>
    test_dataset = PDBBind(transform=None, root='', protein_path_list=protein_path_list, ligand_descriptions=ligand_descriptions,
  File "/Users/ryszard/repos/DiffDock/datasets/pdbbind.py", line 111, in __init__
    print_statistics(self.complex_graphs)
  File "/Users/ryszard/repos/DiffDock/datasets/pdbbind.py", line 361, in print_statistics
    print(f"{name[i]}: mean {np.mean(array)}, std {np.std(array)}, max {np.max(array)}")
  File "<__array_function__ internals>", line 180, in amax
  File "/Users/ryszard/miniconda3/envs/diffdock/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 2793, in amax
    return _wrapreduction(a, np.maximum, 'max', axis, None, out,
  File "/Users/ryszard/miniconda3/envs/diffdock/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation maximum which has no identity
(diffdock) $

ESM bedding Problem

When I follow the readme, in the following step
(diffdock) icer@ubuntu:~/my_prj/diffdock/DiffDock$ HOME=esm/model_weights python esm/scripts/extract.py esm2_t33_650M_UR50D data/prepared_for_esm.fasta data/esm2_output --repr_layers 33 --include per_tok
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t33_650M_UR50D.pt" to esm/model_weights/.cache/torch/hub/checkpoints/esm2_t33_650M_UR50D.pt

I found it is still here, can not continue.

Using ligand hydrogens

Dear authors,

first of all let me congratulate you on the great work!

I would like to ask you about the use of the ligand hydrogens - in the paper, you say the final model does not use hydrogens for the score model - did it bring a significant improvement? And does not this improvement come just from the fact that there is less atoms to align in the RMSD computation?

Just to make myself clear on where the ligand hydrogens are lost - they are used just in the node features on the input, but the network does not predict their poses at all, right? So to obtain them from DiffDock the best I can do is to run Diffdock and then run some external protonation tool?

Thank you very much in advance for any reply!

Petr

Question about translation score

Hi,

Would you please give some hints about the calculation process of the translation score?

Thanks in advance.

DiffDock/datasets/pdbbind.py

Line 50 in f6094e7

data.tr_score = -tr_update / tr_sigma ** 2

Unrecognized argument ligand_path

Seems the instructions in the README are wrong because --ligand_path does not exist. I tried using --ligand but that throws a different error.

python -m inference --ligand_path examples/1cbr_ligand.sdf --protein_path examples/1cbr_protein.pdb --out_dir results/user_predictions_small --inference_steps 20 --samples_per_complex 40 --batch_size 10
100%|████████████████████████████████████████████████████████████████████████████████████████| 201/201 [01:33<00:00,  2.14it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████| 201/201 [02:04<00:00,  1.61it/s]
/share/lcbcsrv5/lcbcdata/duerr/PhD/08_Code/DiffDock/utils/torus.py:38: RuntimeWarning: invalid value encountered in divide
  score_ = grad(x, sigma[:, None], N=100) / p_
usage: inference.py [-h] [--config CONFIG] [--protein_ligand_csv PROTEIN_LIGAND_CSV] [--protein_path PROTEIN_PATH]
                    [--ligand LIGAND] [--out_dir OUT_DIR] [--esm_embeddings_path ESM_EMBEDDINGS_PATH] [--save_visualisation]
                    [--samples_per_complex SAMPLES_PER_COMPLEX] [--model_dir MODEL_DIR] [--ckpt CKPT]
                    [--confidence_model_dir CONFIDENCE_MODEL_DIR] [--confidence_ckpt CONFIDENCE_CKPT]
                    [--batch_size BATCH_SIZE] [--cache_path CACHE_PATH] [--no_random] [--no_final_step_noise] [--ode]
                    [--inference_steps INFERENCE_STEPS] [--num_workers NUM_WORKERS] [--sigma_schedule SIGMA_SCHEDULE]
                    [--actual_steps ACTUAL_STEPS] [--keep_local_structures]
inference.py: error: unrecognized arguments: --ligand_path examples/1cbr_ligand.sdf

Error with --ligand

python -m inference --ligand examples/1cbr_ligand.sdf --protein_path examples/1cbr_protein.pdb --out_dir results/user_predictions_small --inference_steps 20 --samples_per_complex 40 --batch_size 10
Reading molecules and generating local structures with RDKit
  0%|                                                                                                    | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/duerr/miniconda3/envs/diffdock/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/duerr/miniconda3/envs/diffdock/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/share/lcbcsrv5/lcbcdata/duerr/PhD/08_Code/DiffDock/inference.py", line 81, in <module>
    test_dataset = PDBBind(transform=None, root='', protein_path_list=protein_path_list, ligand_descriptions=ligand_descriptions,
  File "/share/lcbcsrv5/lcbcdata/duerr/PhD/08_Code/DiffDock/datasets/pdbbind.py", line 102, in __init__
    self.inference_preprocessing()
  File "/share/lcbcsrv5/lcbcdata/duerr/PhD/08_Code/DiffDock/datasets/pdbbind.py", line 208, in inference_preprocessing
    mol.RemoveAllConformers()

'NoneType' object has no attribute 'RemoveAllConformers'

Running on the example works fine.
Running on new structures in a csv file with full paths/etc. I get this error(even though I'm not running PDBBind data):

Run cmd:

python -m inference --protein_ligand_csv data/diffdock_paths.csv  --out_dir results/glycans_local --inference_steps 20 --samples_per_complex 40 --batch_size 10

 File "/home/jadolfbr/.conda/envs/diffdock/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/jadolfbr/.conda/envs/diffdock/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/jadolfbr/DiffDock/inference.py", line 81, in <module>
    test_dataset = PDBBind(transform=None, root='', protein_path_list=protein_path_list, ligand_descriptions=ligand_descriptions,
  File "/home/jadolfbr/DiffDock/datasets/pdbbind.py", line 102, in __init__
    self.inference_preprocessing()
  File "/home/jadolfbr/DiffDock/datasets/pdbbind.py", line 208, in inference_preprocessing
    mol.RemoveAllConformers()
AttributeError: 'ValueError' object has no attribute 'RemoveAllConformers'

Existing ligand from PDB

Is there a way to include existing ligand in a PDB structure and dock a new ligand. This is necessary to block off the pocket pertaining to the originally bound ligand in the protein PDB.
An example is an ATP/GTP bound pocket which is inaccessible to a second ligand that we would try to dock using diffdock.
If not, is it possible to dock more than one ligand simultaneously to the protein.

Segmentation fault

Hi, I followed all the installation steps.
Tried to run the inference script, but receive just Segmentation fault
Can you help me to understand what I did wrong or what is the issue?

About confidence score

Hi,

I performed a docking with inference protocol, and I finally obtained several result files named like rank1_confidence-0.22.sdf.

Now I want to evaluate the results based on the confidence score, but I couldn't found any cutoff values for the score.

Is there any cutoff or rule of thumb value for result selection ?

Sincerely,

Unable to get .sdf or .mol2 files to work under Windows

I have been trying to get DiffDock installed on a Windows server so I can test it with our structures and ligands of interest.

I have been running into difficulties. After getting everything installed and all the prerequisite packages working. I get the following errors/failures. These errors/failures occur when I use either an .sdf or a .mol2 file for the ligand. And even when I include a Smiles code and the program supposedly completes, the results make no sense. The molecule basically blows apart, or is nowhere near the target PDB.

'''
python -m inference --protein_ligand_csv data/protein_ligand_trial3_csv.csv --out_dir results/user_predictions_small3 --inference_steps 20 --samples_per_complex 40 --batch_size 10 --actual_steps 18 --no_final_step_noise
loading data from memory: data/cache_torsion\limit0_INDEX_maxLigSizeNone_H0_recRad15.0_recMax24_esmEmbeddings3467677806\heterographs.pkl
Number of complexes: 1
radius protein: mean 25.799917221069336, std 0.0, max 25.799917221069336
radius molecule: mean 3.531266689300537, std 0.0, max 3.531266689300537
distance protein-mol: mean 11.676636695861816, std 0.0, max 11.676636695861816
rmsd matching: mean 0.0, std 0.0, max 0
HAPPENING | confidence model uses different type of graphs than the score model. Loading (or creating if not existing) the data for the confidence model now.
loading data from memory: data/cache_torsion_allatoms\limit0_INDEX_maxLigSizeNone_H0_recRad15.0_recMax24_atomRad5_atomMax8_esmEmbeddings3467677806\heterographs.pkl
Number of complexes: 1
radius protein: mean 25.799917221069336, std 0.0, max 25.799917221069336
radius molecule: mean 3.7641730308532715, std 0.0, max 3.7641730308532715
distance protein-mol: mean 11.22496223449707, std 0.0, max 11.22496223449707
rmsd matching: mean 0.0, std 0.0, max 0
common t schedule [1. 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35
0.3 0.25 0.2 0.15 0.1 0.05]
Size of test dataset: 1
0it [00:00, ?it/s]### C:\Users\XXXXXXX\Miniconda3\envs\diffdock4\lib\site-packages\e3nn\o3_spherical_harmonics.py:82: UserWarning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::compileCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable export PYTORCH_NVFUSER_DISABLE=fallback
To report the issue, try enable logging via setting the envvariable export PYTORCH_JIT_LOG_LEVEL=manager.cpp
(Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\codegen\cuda\manager.cpp:244.)
sh = _spherical_harmonics(self._lmax, x[..., 0], x[..., 1], x[..., 2])
C:\Users\XXXXXXXX\DiffDock-main\utils\torsion.py:60: RuntimeWarning: invalid value encountered in true_divide
rot_vec = rot_vec * torsion_updates[idx_edge] / np.linalg.norm(rot_vec) # idx_edge!
Failed on ['data/trial-3/Ap_GST_Phi2.pdb____data/trial-3/L_glufosinate.mol2'] linalg.svd: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 2).
1it [00:07, 7.58s/it]
Failed for 1 complexes
Skipped 0 complexes
Results are in results/user_predictions_small3
'''

So, ok, maybe I'll just try using smiles representations instead. When I use the isomeric smiles string I get the same. I am just showing the warning and error portions.

"""
C:\Users\XXXXXXX\Miniconda3\envs\diffdock4\lib\site-packages\e3nn\o3_spherical_harmonics.py:82: UserWarning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::compileCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable export PYTORCH_NVFUSER_DISABLE=fallback
To report the issue, try enable logging via setting the envvariable export PYTORCH_JIT_LOG_LEVEL=manager.cpp
(Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\codegen\cuda\manager.cpp:244.)
sh = spherical_harmonics(self.lmax, x[..., 0], x[..., 1], x[..., 2])
C:\Users\XXXXXXX\DiffDock-main\utils\torsion.py:60: RuntimeWarning: invalid value encountered in true_divide
rot_vec = rot_vec * torsion_updates[idx_edge] / np.linalg.norm(rot_vec) # idx_edge!
Failed on ['data/trial-1/Ap_GST_Phi2.pdb__C(CC(=O)NC@@HC(=O)NCC(=O)O)C@@HN'] linalg.svd: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 2).
1it [00:07, 7.16s/it]
Failed for 1 complexes
Skipped 0 complexes
Results are in results/user_predictions_small1
"""

However, when I list a ligand as a canonical smiles string

"""
python -m inference --protein_ligand_csv data/protein_ligand_trial3_csv.csv --out_dir results/user_predictions_small3 --inference_steps 20 --samples_per_complex 40 --batch_size 10 --actual_steps 18 --no_final_step_noise
Reading molecules and generating local structures with RDKit
1it [00:00, 22.30it/s]
Reading language model embeddings.
Generating graphs for ligands and proteins
loading complexes: 100%|█████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.97it/s]
loading data from memory: data/cache_torsion\limit0_INDEX_maxLigSizeNone_H0_recRad15.0_recMax24_esmEmbeddings3792232284\heterographs.pkl
Number of complexes: 1
radius protein: mean 25.799917221069336, std 0.0, max 25.799917221069336
radius molecule: mean 5.837835311889648, std 0.0, max 5.837835311889648
distance protein-mol: mean 11.18027114868164, std 0.0, max 11.18027114868164
rmsd matching: mean 0.0, std 0.0, max 0
HAPPENING | confidence model uses different type of graphs than the score model. Loading (or creating if not existing) the data for the confidence model now.
Reading molecules and generating local structures with RDKit
1it [00:00, 27.60it/s]
Reading language model embeddings.
Generating graphs for ligands and proteins
loading complexes: 100%|█████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.88it/s]
loading data from memory: data/cache_torsion_allatoms\limit0_INDEX_maxLigSizeNone_H0_recRad15.0_recMax24_atomRad5_atomMax8_esmEmbeddings3792232284\heterographs.pkl
Number of complexes: 1
radius protein: mean 25.799917221069336, std 0.0, max 25.799917221069336
radius molecule: mean 6.153195858001709, std 0.0, max 6.153195858001709
distance protein-mol: mean 11.254096031188965, std 0.0, max 11.254096031188965
rmsd matching: mean 0.0, std 0.0, max 0
common t schedule [1. 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35
0.3 0.25 0.2 0.15 0.1 0.05]
Size of test dataset: 1
0it [00:00, ?it/s]C:\Users\XXXXXXX\Miniconda3\envs\diffdock4\lib\site-packages\e3nn\o3_spherical_harmonics.py:82: UserWarning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::compileCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable export PYTORCH_NVFUSER_DISABLE=fallback
To report the issue, try enable logging via setting the envvariable export PYTORCH_JIT_LOG_LEVEL=manager.cpp
(Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\jit\codegen\cuda\manager.cpp:244.)
sh = _spherical_harmonics(self._lmax, x[..., 0], x[..., 1], x[..., 2])
1it [00:57, 57.72s/it]
Failed for 0 complexes
Skipped 0 complexes
Results are in results/user_predictions_small3
"""
You'll notice that even when it successfully completes I still have the FALLBACK warning.

I don't know what's going on. Many of the ligands I am interested in are chiral compounds where one isomer is active and the other is not. I want to investigate the differences between the interactions.

Thanks for any assistance.

About 'WARNING: weird torch_cluster error, skipping batch' during training process

Hi, thank you for your great work!
When I am runing the training code, several batches will output 'WARNING: weird torch_cluster error, skipping batch' , and it takes up around 1/5 of all of steps per epoch. The training data are using the split in the repo, is this within the expectation? Or somethins is wrong with my training process? By the way, I find that according to the README doc, the training epochs number is set to be 850, may I ask how many epochs does the model actually need to train to have the similar performance as the paper's?

Local docking?

Hello! Really great paper and very very nice interface for production. Much simpler than EquiBind, and much more easy to run individuals or sets of molecules.

I was wondering if there is any way to constrain the docking around a pocket as well as the overall flexibility of input ligands while using the option to use input structures. Sometimes for larger molecules, it is necessary to keep the input ligand mostly resembling the input structure.

Thanks.

If you want to benchmark the virtual-screening power

LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening
https://pubs.acs.org/doi/10.1021/acs.jcim.0c00155

Dataset:
https://drugdesign.unistra.fr/LIT-PCBA/

Exception: Bad Conformer ID

When I run the training code on the small score model, it throws an exception during the preprocessing

loading complexes:   3%|▎         | 493/16379 [43:30<23:21:46,  5.29s/it]
Traceback (most recent call last):
  File "anaconda3/envs/diffdock/lib/python3.10/site-packages/scipy/optimize/_differentialevolution.py", line 1116, in _calculate_population_energies
    calc_energies = list(
  File "anaconda3/envs/diffdock/lib/python3.10/site-packages/scipy/_lib/_util.py", line 407, in __call__
    return self.f(x, *self.args)
  File "diffdock/datasets/conformer_matching.py", line 60, in score_conformation
    SetDihedral(self.mol.GetConformer(self.probe_id), r, values[i])
ValueError: Bad Conformer Id

This is the exception thrown in conformer matching file. I am not sure if it is only me or it's the codes. Could you please take a look? Or do you see this exception before?

Preprocessing Receptors for Inference

Before using DiffDock for inference, would you suggest running the same obabel and reduce preprocessing steps on the receptor that were used in Equibind? I didn't see anything mentioned here in the documentation but still saw some references to the same _protein_obabel_reduce filenames so figured I'd ask.

What is a "good" confidence score?

With AlphaFold, when pLDDT is say above 70, you can gain some trust in the prediction. For DiffDock, what is a range where you would "trust" the results?

index 2 is out of bounds for axis 0 with size 2

HI,this is a great work. When I run inference, error exists:

radius molecule: mean 7.5976667404174805, std 0.0, max 7.5976667404174805
distance protein-mol: mean 12.849800109863281, std 0.0, max 12.849800109863281
rmsd matching: mean 0.0, std 0.0, max 0
common t schedule [1.  0.5]
Size of test dataset:  1
0it [00:00, ?it/s]Failed on ['data/7rfw_receptor.pdb____data/7rfw_ligand.mol2'] index 2 is out of bounds for axis 0 with size 2
1it [01:09, 70.00s/it]
Failed for 1 complexes
Skipped 0 complexes
Results are in results/user_predictions_small
[2]-  Killed                  python -m inference --protein_path data/7rfw_receptor.pdb --ligand data/7rfw_ligand.mol2 --out_dir results/user_predictions_small --inference_steps 20 --samples_per_complex 40 --batch_size 10 --actual_steps 18 --no_final_step_noise
`

ValueError: Bad Conformer ID

I saw issue#13 was very similar, but this doesn't seem to be the same problem. I have the updated version of the pdbbind.py.

Reading molecules and generating local structures with RDKit
 75%|█████████████████████████████▎         | 2315/3080 [01:20<00:18, 42.04it/s]rdkit coords could not be generated without using random coords. using random coords now.
 75%|█████████████████████████████▎         | 2317/3080 [02:35<00:51, 14.90it/s]
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/diffdock/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ubuntu/miniconda3/envs/diffdock/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/DiffDock/inference.py", line 81, in <module>
    test_dataset = PDBBind(transform=None, root='', protein_path_list=protein_path_list, ligand_descriptions=ligand_descriptions,
  File "/home/ubuntu/DiffDock/datasets/pdbbind.py", line 102, in __init__
    self.inference_preprocessing()
  File "/home/ubuntu/DiffDock/datasets/pdbbind.py", line 203, in inference_preprocessing
    generate_conformer(mol)
  File "/home/ubuntu/DiffDock/datasets/process_mols.py", line 276, in generate_conformer
    AllChem.MMFFOptimizeMolecule(mol, confId=0)
ValueError: Bad Conformer Id

Strange ligand's pose output

Thank you for the great project.

I have tried to dock histamine h1 receptor (protein) with diphenhydramine (ligand) but I got a strange pose of the ligand.
This file is complex after docking.
H1_diphenhydramine_DiffDock.pdb.zip
After that, I tried to use Schrodinger to visualize more details output and I got this one

The problem is the ligand is completely deformed, and the original ligand is no longer visible (as you can see in the image above).
Do you have any comments/suggestions to solve this problem?

Explicit valence for atom # 48 H, 4, is greater than permitted

Exception: ('RDKit could not read the molecule ', 'data/3d_sdf/1eby_01.sdf')
Explicit valence for atom # 48 H, 4, is greater than permitted
could not process mol

LM embeddings "did not have the right length"

Hi, could you shed some light on the error "LM embeddings for complex... did not have the right length for the protein"?

I'm trying to run on a single protein-ligand complex, and I'm providing a prepared protein PDB and ligand SDF. I can see in the code where this is generated

DiffDock/datasets/pdbbind.py

Line 322 in c32ec5b

 print(f'LM embeddings for complex {name} did not have the right length for the protein. Skipping {name}.') 

but I'm not sure I understand what is causing the if statement to be true.

Here is the exact error:

python -m inference --protein_path data/XXX.pdb --ligand data/YYY.sdf --out_dir results/user_predictions_small --inference_steps 20 --samples_per_complex 40 --batch_size 10 --actual_steps 18 --no_final_step_noise
Reading molecules and generating local structures with RDKit
1it [00:00, 23.30it/s]
Reading language model embeddings.
Generating graphs for ligands and proteins
loading complexes:   0%|                                                                                                                                                                                        | 0/1 [00:00<?, ?it/s]LM embeddings for complex data/XXX.pdb____data/YYY.sdf did not have the right length for the protein. Skipping data/XXX.pdb____data/YYY.sdf.
loading complexes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.47s/it]
loading data from memory:  data/cache_torsion/limit0_INDEX_maxLigSizeNone_H0_recRad15.0_recMax24_esmEmbeddings2279325814/heterographs.pkl
Number of complexes:  0
/cluster/home/slochowe/anaconda3/envs/diffdock/lib/python3.9/site-packages/numpy/core/fromnumeric.py:3432: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/cluster/home/slochowe/anaconda3/envs/diffdock/lib/python3.9/site-packages/numpy/core/_methods.py:190: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/cluster/home/slochowe/anaconda3/envs/diffdock/lib/python3.9/site-packages/numpy/core/_methods.py:265: RuntimeWarning: Degrees of freedom <= 0 for slice
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/cluster/home/slochowe/anaconda3/envs/diffdock/lib/python3.9/site-packages/numpy/core/_methods.py:223: RuntimeWarning: invalid value encountered in divide
  arrmean = um.true_divide(arrmean, div, out=arrmean, casting='unsafe',
/cluster/home/slochowe/anaconda3/envs/diffdock/lib/python3.9/site-packages/numpy/core/_methods.py:257: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
  File "/cluster/home/slochowe/anaconda3/envs/diffdock/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/cluster/home/slochowe/anaconda3/envs/diffdock/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/cluster/home/slochowe/explorations/DiffDock/inference.py", line 81, in <module>
    test_dataset = PDBBind(transform=None, root='', protein_path_list=protein_path_list, ligand_descriptions=ligand_descriptions,
  File "/cluster/home/slochowe/explorations/DiffDock/datasets/pdbbind.py", line 111, in __init__
    print_statistics(self.complex_graphs)
  File "/cluster/home/slochowe/explorations/DiffDock/datasets/pdbbind.py", line 376, in print_statistics
    print(f"{name[i]}: mean {np.mean(array)}, std {np.std(array)}, max {np.max(array)}")
  File "<__array_function__ internals>", line 180, in amax
  File "/cluster/home/slochowe/anaconda3/envs/diffdock/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 2793, in amax
    return _wrapreduction(a, np.maximum, 'max', axis, None, out,
  File "/cluster/home/slochowe/anaconda3/envs/diffdock/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation maximum which has no identity

The Zenodo dataset link doesn't work in china?

The Zenodo dataset link doesn't work in china, please Add an alternate download site if possible; e.g. Google Cloud Disk

Cannot create a consistent method resolution order (MRO) for bases Batch, Batch

Hi,
Thank you for the great project!

I followed the readme to taste the project but got confusing issues:

"Failed on [xxx] Cannot create a consistent method resolution order (MRO) for bases Batch, Batch"
when I run "Using the provided model weights for evaluation," i.e.,
python -m inference --protein_ligand_csv data/testset_csv.csv --out_dir results/user_predictions_testset --inference_steps 20 --samples_per_complex 40 --batch_size 10 --actual_steps 18 --no_final_step_noise

Would you mind giving some comments/suggestions to solve this problem?