gnina / models Goto Github PK

View Code? Open in Web Editor NEW

71.0 5.0 23.0 273.59 MB

Trained caffe models

Python 100.00%

models's Introduction

This section of the repository contains caffe model files that are usable with GNINA

Breakdown of the different sub-directories

acs2018 -- The models and atom maps used in 2018 Fall ACS National Meeting poster. Included here is Default2017, Default2018, HiRes Affinity, and HiRes Pose.
affinity -- Legacy models that started the affinity prediction task
crossdocked_paper -- The models and atom maps used in our CrossDocked2020 paper. Included is Default2017, Default2018, HiRes Affinity, HiRes Pose, and Dense.
data -- Directory containing the raw data used for training and evaluating models.
refmodel3 -- Legacy model. It does not support affinity prediction.

models's People

Contributors

Stargazers

Watchers

models's Issues

Asking access to URL of crossdock2020

Hi,

Thanks a lot for curating CrossDocked2020 and PDBBind.

Currently, the URL is not available: http://bits.csb.pitt.edu/files/crossdock2020/

Could you please check it out?

How to divide it2_tt_v1.3_completeset into test and train set

Thank you for sharing your excellent work.
I have downloaded the crossdocked2020 v1.3 data.
I would like to know how all data is divided into train and test.
It seems that "it2_tt_v1.3_completeset_test0.types" and "it2_tt_v1.3_completeset_train0.types" are the same file.
I thought it2_tt_v1.3_train[0-2].types was concatenated with it2_tt_v1.3_completeset_train0.types, is that correct?

What dataset for the built-in models?

Hi Developer,

I see there are 5 default built-in models, including "redock_default2018_2","general_default2018_3", "crossdock_default2018" and 2 "Dense" models. It looks the redock model was built from the "redock" subset in Crossdock2020 dataset. But for "general_default2018" , I originally guessed this model was built from PDbbind2016 General dataset, I ever tried to build my own default2018 model by using PDBbind2016 General only, however the performance of my own default2018 model from PDBbind2016 General is much poor than the "general_default2018_3" when affinity evaluation ( just single model comparing), So I guess maybe there was more data used. What the dataset was used to build the "general_default2018" ?

Another question, I guess the "crossdock_default2018" and "Dense" models are both built from Crossdock2020 dataset, right ? There are several serial files with "CrossDocked2020/types" folder, like "it2_tt_v1.3_0_train" ,"it2_tt_v1.3_10p20n_train" and "mod_it2_tt_v1.3_0_train" files. Which serial types file was used for "crossdock_default2018" and "Dense" models ?

Thanks a lot !

Some potentially problematic CrossDocked examples

I've noticed that in some of the CrossDocked folders there are some possible errors data processing.

In 1433Z_HUMAN_1_244_pep_0 there is a docked file 5d3f_A_rec_5d3f_fsc_lig_tt_docked.sdf.gz. This seems to suggest the ligand 5d3f_fsc is being docked into the receptor 5d3f_A_rec.pdb and I think this part is correct. However, if you load the crystal ligand 5d3f_fsc_lig.pdb, it is in a non-physically possible part of the receptor. I think this is because there is also a 5d3f_B_rec.pdb file that this ligand presumably was taken from. In summary, I think there needs to be two copies of 5d3f_fsc_lig.pdb, one for receptor chain A and one for receptor chain B.

Several versions of the same receptor in CrossDocked2020

Hi,

Many thanks for curating the CrossDocked2020 dataset - it is super useful for our research. I have encountered some issues when exploring the data - namely the same receptor (with same pdbid, and containing the same chain) seems to be present in the same pocket folder.

For instance, consider the CP3A4_HUMAN_23_503_catalytic_0/4k9t_A_rec.pdb file:

REMARK Selection 'protein or ion and not water'
ATOM      1  N   SER A  29     -12.659  -8.898 -14.390  1.00 73.82         A N
ATOM      2  CA  SER A  29     -13.741  -7.890 -14.636  1.00 72.94         A C
ATOM      3  C   SER A  29     -13.883  -7.497 -16.117  1.00 72.83         A C
ATOM      4  O   SER A  29     -14.937  -6.970 -16.533  1.00 71.08         A O
ATOM      5  CB  SER A  29     -13.479  -6.630 -13.815  1.00 72.53         A C
ATOM      6  OG  SER A  29     -12.680  -5.708 -14.551  1.00 72.01         A O
ATOM      7  N   HIS A  30     -12.824  -7.741 -16.895  1.00 69.18         A N
ATOM      8  CA  HIS A  30     -12.738  -7.273 -18.287  1.00 66.60         A C
...

Compared to the CP3A4_HUMAN_23_503_catalytic_0/4k9t_rec.pdb file:

REMARK Selection '(protein and ch... and not water)'
ATOM      1  N   SER A  29     -26.772   9.141 -14.685  1.00 73.82           N
ATOM      2  CA  SER A  29     -25.634   8.216 -14.991  1.00 72.94           C
ATOM      3  C   SER A  29     -25.499   7.888 -16.489  1.00 72.83           C
ATOM      4  O   SER A  29     -24.423   7.446 -16.946  1.00 71.08           O
ATOM      5  CB  SER A  29     -25.794   6.911 -14.214  1.00 72.53           C
ATOM      6  OG  SER A  29     -26.549   5.967 -14.969  1.00 72.01           O
ATOM      7  N   HIS A  30     -26.590   8.091 -17.236  1.00 69.18           N
ATOM      8  CA  HIS A  30     -26.677   7.670 -18.642  1.00 66.60           C
ATOM      9  C   HIS A  30     -25.903   8.521 -19.613  1.00 68.46           C

Coordinates seem to change between these two. Ions also seem to be preserved on the first one - albeit I do not know if this was also done on the second due to the truncation of the REMARK field.

What is the difference between these files, and which one of them was used for docking?

Cheers,

Details of the affnity labels in data

when use *_min poses as part of the training set

I noticed that in *.types files in PDBbind2016, you use *_min poses as part of the train data. Then how do you define their affinity label? Did you just assign those minimized poses the same affinity with the crystal poses? And other docked poses just set to the corresponding negative number?

Why the second column has positive and negetive nubers for ligand and docked_poses?

<label> <pK> <RMSD to crystal> <Receptor filename> <Ligand filename> # <Autodock Vina score>>
1 3.28 0.908077 3zsx/3zsx_rec_0.gninatypes 3zsx/3zsx_min_0.gninatypes # -6.89469
0 -3.28 4.7514 3zsx/3zsx_rec_0.gninatypes 3zsx/3zsx_docked_0.gninatypes # -7.84082
0 -3.28 3.89599 3zsx/3zsx_rec_0.gninatypes 3zsx/3zsx_docked_1.gninatypes # -7.43202
0 -3.28 6.06622 3zsx/3zsx_rec_0.gninatypes 3zsx/3zsx_docked_2.gninatypes # -7.10783
0 -3.28 7.9518 3zsx/3zsx_rec_0.gninatypes 3zsx/3zsx_docked_3.gninatypes # -7.03943

‘CrossDocked2020_v1.2.tgz’ archive is corrupt，how to solve the problem?

An error is reported during file decompression. How to solve the problem?

Archive File：CrossDocked2020_v1.2.tgz
Error Message in Linux：

Error Message in Windows：

Odd data in PDBBind2016

I wanted to flag some oddities with the PDBBind2016 dataset. I've tried to recompute the RMSDs and have noticed a very large fraction do not match the data. One particularly odd example I found was in 5c28 where the docked ligand is a different molecule from the crystal ligand. Is there by any chance a cleaner version of the PDBBind docked dataset that could be used?

difference between Crossdock-default2018 and default-2018

I use the built-in model --cnn crossdock_default2018 to dock my dataset firstly to get original results, and then I want to train this model using my own datasets, but I can only find default2018.model file in gnina/models/crossdocked_paper.

What should I do? If the architecture of crossdock-default2018 and default-2018 is the same, then I can compare them, right?

‘CrossDocked2020_v1.2.tgz’ archive is corrupt

An error is reported during file decompression.

Archive File：CrossDocked2020_v1.2.tgz
Error Message in Linux：

Error Message in Windows：

Questions on gninatypes-format

I understand that gninatypes-format are binary files with atom coordinates and atom types (maybe having a small section on the gninatype-format in a README would be super helpful, right now (afaik) this information is hidden in a closed github issue :)).

Is there some functionality to convert the gninatype-format back into pdb (If I understand correctly, connectivity and amino acid information are lost in this format), but still, it would be nice to be able to visualize the content of gninatypes , for instance, in pymol.
The context of my question is, that I am interested in knowing whether protein pdb and gninatypes in the crossDocked2020 dataset are aligned. eg. whether 2bvo_A_rec.pdb and 2bvo_A_rec_0.gninatypes have the same atom coordinates.

Even if the answer to 2. is yes, I would still be interested in 1., as I would like to explore the content of gninatype-files. I loaded them in with molgrid.ExampleProvider() and got coords and atom type indeces, but then, I did not know which atom type index maps to which atom type.

Thanks a lot for clarification!

ChEMBL and MUV data from JCIM 2017 paper

Hi there, thanks for making available all these resources. I was wondering whether you have made available the independent virtual screening test sets used in "Protein-Ligand Scoring with Convolutional Neural Networks" (JCIM, 2017), as well as the scores produced by each of the five methods in the benchmark (DUD-E, 2:1, Vina, RF-score, NNScore).

Thanks!

Question about files

Can you help clarify what it2_redocked_tt_v1.2_completeset_train0.types corresponds to?

Looking at the documentation, the _redocked_ part of the file suggests it is just the ReDocked set (not CrossDocked). However, it also appears to be generated using counterexamples (it2). Is there a plain PDBBind-only ReDocked train/test set that was used without any added CNN-generated counterexamples?

Thank you!

Doubt about why using ProBiS

I find in the paper that ProBis is used to avoid highly-similar target appear both in TRAIN and TEST files. But what is the point of totally avoiding the test-set has similar target with train-set? I argue that it would make the evaluation on test more difficult than it should be.

I mean, does other scoring functions use the same methods to train model? If not, how could you contrast gnina with them?

Question about only train CNN_affinity for crossdock_default2018.

Hello developers!

I saw the format of types for training both CNN_score and CNN_affinity needs rmsd and affinity label, but I don't wanna train or use CNN_score in my work, so I am searching for how to make it only for CNN_affinity.

But different papers have different types file format, such as in data/PDBBind2016/Refined_types:
# >
but it is different in data/refined like:

0 -6.3979 10gs/10gs_rec.gninatypes 10gs/10gs_ligand_0.gninatypes # 5.31559 -8.06592
0 -6.3979 10gs/10gs_rec.gninatypes 10gs/10gs_ligand_1.gninatypes # 9.14515 -8.0171

I want to know how to write my own types file format if I only want to train CNN_affinity?
AND what should I do when using training.py if I only train the model --cnn crossdock_default2018? It's new to me to use Caffe, so I don't know what kind of file should I use, and where weights_file should be assigned in.

(I have all ligand's rmsd so if removing CNN_score is very difficult I can accept it.)

CASR datebase

Hello，I'm trying to train cnn model with CASR dataset. But I want to the structure of these pose. I didn't find the sdf files in the CASR folders. Can you share the sdf files of these docked poses with me? this is my email: [email protected]. Thank you.

Missing data in CrossDocked2020_v1.1

Thanks a lot for putting up a new version of CrossDocked2020 (v1.1).

I used this type file mod_it2_tt_v1.1_0_train0.types in combination with crossdock2020_1.1_rec.molcache2 and crossdock2020_1.1_lig.molcache2.

There seem to be missing files in the lig-cache-file:

PVDQ_PSEAE_26_217_0/4wks_A_rec_2wyc_3la_lig_it2_it1_tt_docked_9.gninatypes
PVDQ_PSEAE_26_217_0/4k2g_A_rec_4k2g_1oq_lig_it2_it1_tt_docked_7.gninatypes
potentially more?

Note: I am using libmolgrid (ExampleProvider), data loading breaks with following error:
ValueError: Could not read PVDQ_PSEAE_26_217_0/4k2g_A_rec_4k2g_1oq_lig_it2_it1_tt_docked_7.gninatypes

I am assuming this is because the files are missing in the molcache files.

What is your model architecture?

In your paper, you said "five 3 × 3 × 3 convolutional layers with rectified linear activation units alternating with max pooling layers". Is it possible to show a picture of each layer's parameters?

Your model with the mashes 48x48x48 in 34 channels. It is a huge model. How big of your memory is? In total, how many parameters of your model? I am interested in your model details. Many thanks.

Gnina --cnn_model ERROR

error: completerec.
I am wondering if it is because the model file is not compatible with the caffemodel file? Or other problem in my command?

Looking for CSAR dataset and test folds for pose prediction task

As the paper Protein−Ligand Scoring with Convolutional Neural Networks says: The performance of trained CNN models were evaluated by 3-fold cross-validation for both the pose prediction and virtual screening tasks. To avoid evaluating models on targets similar to those in the training set, training and test folds were constructed by clustering data based on target families rather than individual targets.

But I couldn't find the dataset here, and I didn't know how you construct test folds by target families...(also couldn't find the test fold for pose predictions here)

CrossDocked2020 dataset questions

Thanks a lot for making those datasets available! Very much appreciated.

There is no equation given in the paper on how you got from the Kd(?)/Ki(?) values provided by the PDBBind webpage to the pK values used in the paper and provided in the types-files.
Are both Kd and Ki values mapped to the same pK?
To understand the dataset better, I checked the pk-Values of this types-file: it0_tt_0_train0.types (from this directory: http://bits.csb.pitt.edu/files/crossdock2020/CrossDocked2020_types.tar.gz)

I find that ~50% of the lines/samples have a pK value of 0. Is this meaningful?
I find (while FigureS12 is showing experimental pK values in the range of 2 -12), that ~30% of the pK values in the above types-files are negative

Thanks for clarification!

Updated tarballs

With regard to the latest update in data/CrossDocked2020/README.md and the corresponding data.

Has the content of CrossDocked2020_types.tar.gz, CrossDocked2020.tgz changed against CrossDocked2020_v1.1_types.tar.gz, CrossDocked2020_v1.1.tgz?

Eyal

The details in generating <PDBid>_nowat.pdb

In the PDBbind2016 data directory, I see the following files that seems to be used in your training project.
<PDBid>_nowat.pdb -- Receptor structure with all HETATOMS removed

I was wondering how to generate thenowat.pdbforpocket.pdbfiles in PDBbind 2019 dataset, and the ChatGPT told me to use:
for file in *.pdb; do gnina -i "$file" -o "${file%.pdb}_nowat.pdb" --autobox_ligand ""; done

I am not sure if this is the same way of your project. If so, even though it works, it might have some differences with your dataset. So I am here to ask your methods to generate 'nowat.pdb' in details.

CrossDocked2020 Pose scores

Hi,
I would like to know where can I find the precalculated scores for all the >20M poses for each ligand-target pair.
Is that in the dump somewhere?
Thanks!

gnina scores on CrossDocked2020 dataset

Hello! Thanks for these very useful datasets. I notice that the Vina scores are included in the types files, so I was wondering whether the gnina scores (or any of the various models' scores from the CrossDocked2020 paper) are available for download anywhere, for the provided poses? It seems a bit computationally heavy to recreate them :')

Question about PDBBind2016 docked poses

In the README I see the following description:
<PDBid>_docked.sdf -- smina docked pose of *_uff.sdf into its cognate receptor.

Does this mean the re-docking was done by initializing the monte-carlo search at a minimized version of the crystal pose? I had thought that docking would have been initialized at a random conformation such as:
<PDBid>_conf.sdf -- RDkit generated conformer from the ligand SMILES.

about *.types files

Hello,

I tried to make CNN models, but there are some difficulties.

There are various types of '.types' files, each '.types' file has N of columns.
In files which have 4 or 5 columns, I don't know meaning of some columns.

For example,

A file all.types(data/csar/all.types) has 3 of columns like below.
I know each columns are meaning about label, receptor file, ligand file.
0 set2/102/rec.gninatypes set2/102/docked_17.gninatypes # 6.072250 -5.202670
0 set2/102/rec.gninatypes set2/102/docked_18.gninatypes # 5.235610 -5.080780

A file gaffwmintrain0.types(models/data/general/gaffwmintrain0.types) has 4 columns like below.
In this file, I don't know meaning of 2nd colums.
What is meaning of 2nd column and how I can make this column value?
1 8.0000 9abp/9abp_rec.gninatypes 9abp/9abp_min.gninatypes # -7.90308 0.485643
0 -8.3500 9hvp/9hvp_rec.gninatypes 9hvp/9hvp_ligand_1.gninatypes # -9.82969 4.74258

Also, a file ccv_gen_uff_3_test0.types(models/data/PDBBind2016/General_types/ccv_gen_uff_3_test0.types) has 5 columns like below.
What is meaning of 2nd, 3rd columns and how I can make this columns?
1 3.52 0.880169 4eky/4eky_rec_0.gninatypes 4eky/4eky_min_0.gninatypes # -9.76619
1 3.52 0.69435 4eky/4eky_rec_0.gninatypes 4eky/4eky_docked_0.gninatypes # -11.2589

I want to know meaning of all columns from all "*.types" files.
Could you please explain what it is meaning?

Thanks.

Atom Featurization Code

https://arxiv.org/abs/1612.02751

Does the code for atom level featurization exist here? I want to reproduce the 18 different atom types described in the paper.

I also checked https://github.com/dkoes/rdkit-scripts but was not able to find anything for atom labels.