sc8668 / genscore Goto Github PK

License: MIT License

Python 99.29% Shell 0.71%

genscore's Introduction

GenScore

GenScore is a generalized protein-ligand scoring framework extended from RTMScore, and it exhibits balanced scoring, ranking, docking and screening powers on multiple datasets.

Requirements

mdanalysis==2.0.0
pandas==1.0.3
prody==2.1.0
python==3.8.11
pytorch==1.11.0
torch-geometric==2.0.3
torch-scatter==2.0.9
rdkit==2021.03.5
openbabel==3.1.0
scikit-learn==0.24.2
scipy==1.6.2
seaborn==0.11.2
numpy==1.20.3
pandas==1.3.2
matplotlib==3.4.3
joblib==1.0.1

conda create --prefix xxx --file ./requirements_conda.txt      
pip install -r ./requirements_pip.txt

Datasets

PDBbind
CASF-2016
docking poses for DEKOIS2.0 and DUD-E
CSAR NRC-HiQ benchmark
Merck FEP benchmark
PDBbind-CrossDocked-Core

Examples for using the trained model for prediction

cd example

# input is protein (need to extract the pocket first)

python genscore.py -p ./1qkt_p.pdb -l ./1qkt_decoys.sdf -rl ./1qkt_l.sdf -gen_pocket -c 10.0 -e gt -m ../trained_models/GT_0.0_1.pth

# input is pocket

python genscore.py -p ./1qkt_p_pocket_10.0.pdb -l ./1qkt_decoys.sdf -e gatedgcn -m ../trained_models/GatedGCN_0.5_1.pth

# calculate the atom contributions of the score

python genscore.py -p ./1qkt_p_pocket_10.0.pdb -l ./1qkt_decoys.sdf -e gatedgcn -ac -m ../trained_models/GatedGCN_ft_1.0_1.pth

# calculate the residue contributions of the score

python genscore.py -p ./1qkt_p_pocket_10.0.pdb -l ./1qkt_decoys.sdf -e gatedgcn -rc -m ../trained_models/GatedGCN_ft_1.0_1.pth

genscore's People

Contributors

Stargazers

Watchers

Forkers

pmorerio cyangnyu gmattedi-kvantify piervitocreanza justvicthor dingluoxmu megagatlingpea

genscore's Issues

Cannot reproduce publicated data

Hi everyone,
I'm trying to reproduce the numbers I've found in this publication about GenScore .
I'm interested in the Enrichment Factor. I've started my analysis by downloading CASF-2016 from the official website.
The following is the process I followed to try and reproduce the data.
For each of the Binders of the CASF available inside CASF-2016/decoys_screening I've computed the scoring of all ligands inside each folder using the following command:
casf_output_ef.txt

python3.8 ./example/genscore.py -p CASF-2016/coreset/<target>/<target>_pocket.pdb -l CASF_2016/decoys_screening/<target>/ligands.mol2 -e gatedgcn -m ./trained_models/GatedGCN_0.5_1.pth -o <out_file>
Now that I have all the scores computed by GenScore, I've computed the enrichment factor with the following:
python2 CASF-2016/power_screening/forward_screening_power.py -c CoreSet.dat -s <genscore_results_folder> -t TargetInfo.dat -p 'positive' -o 'GenScore'
The problem I get is from the output of this script (I'm attaching the output I got):
Average enrichment factor among top 1% = 8.11 Average enrichment factor among top 5% = 3.36 Average enrichment factor among top 10% = 2.31 The best ligand is found among top 1% candidates for 15 cluster(s); success rate = 26.3% The best ligand is found among top 5% candidates for 24 cluster(s); success rate = 42.1% The best ligand is found among top 10% candidates for 28 cluster(s); success rate = 49.1%
Which is different from the number I've found in the publication

What am I missing?

Thank you in advance for all the help.

Hard time installing

I have a hard time installing all the dependencies.

I am using cuda_11.5.2 toolkit in a docker container. However when installing the dependencies with pip or anaconda (python 3.8.11), I run into some compatibility problems between torch, torch-scatter...

Can someone help me with this, I feel like I am on the wrong cuda or python version?

Warm regards,
Wout

Finetuning for retraining GenScore

Hello, thank you for your incredible work!

I would like to do finetuning with some of your pre-trained models, what should I look at?
Could you give me an example of a command line string that would allow me to execute finetuning on the NN?

Best regards,
Vittorio.

Can't repeat "vina>cross-docking>SR1" in Tabel6

Hi,
I ran GenScore on PDBbind-crossdock-core for testing (The model I tested was GT_ft_1.0_1.pth, using the default parameters for everything else.). However, for the decoys generated by vina, 505 receptor-ligand pairs (1343 in total) reported errors:
"""
Traceback (most recent call last):
File ".../GenScore/GenScore/example/genscore.py", line 256, in
main()
File ".../GenScore/GenScore/example/genscore.py", line 237, in main
ids, scores = scoring(prot=inargs.prot,
File ".../GenScore/GenScore/example/genscore.py", line 82, in scoring
data = VSDataset(ligs=lig,
File ".../GenScore/GenScore/example/../GenScore/data/data.py", line 178, in init
self.ids, self.gls = zip(*filter(lambda x: x[1] != None, zip(self.idsx, self.gls)))
ValueError: not enough values to unpack (expected 2, got 0)
"""
In all the cases where score can be generated, I calculated that the cross-docking SR1 is only 0.463 (0.59 reported in paper), and the SR1 of redocking is 0.680.

What problems do I need to correct when running genscore?

Best regards.

dekois 2.0 data not run

i run the dockies with example/genscore.py with target name '11betahsd1' have error:
gen_score_file(protein_path=protein_path,ligand_path=ligand_path,cand_sdf_path=cand_decoys_sdf,out_put_path=cand_decoys_score_outpath,args=args)
File "evalation_with_dockies.py", line 10, in gen_score_file
ids, scores = genscore.scoring(prot=protein_path,
File "/home/internal-GenScore/example/genscore.py", line 86, in scoring
data = VSDataset(ligs=lig,
File "/home/internal-GenScore/example/../GenScore/data/data.py", line 130, in init
self.gp = prot_to_graph(self.prot, cutoff)
File "/home/internal-GenScore/example/../GenScore/feats/mol2graph_rdmda_res.py", line 40, in prot_to_graph
res_coods = th.tensor(np.array([np.concatenate([res.atoms.positions, np.full((RES_MAX_NATOMS-len(res.atoms), 3), np.nan)],axis=0) for res in u.residues]))
File "/home/internal-GenScore/example/../GenScore/feats/mol2graph_rdmda_res.py", line 40, in
res_coods = th.tensor(np.array([np.concatenate([res.atoms.positions, np.full((RES_MAX_NATOMS-len(res.atoms), 3), np.nan)],axis=0) for res in u.residues]))
File "/opt/conda/lib/python3.8/site-packages/numpy/core/numeric.py", line 343, in full
a = empty(shape, dtype, order)
ValueError: negative dimensions are not allowed

cause the error is RES_MAX_NATOMS=24 smaller than res_max_natoms = max([len(res.atoms) for res in u.residues]).
what happened?

explicit_H for charged ligands and protein

Hi,

I just have some queston when running GenScore:

I noticed there is an option for explicit_H in the genscore.py file. Just wonder if setting explicit_H=True will change the program, since a lot of my ligands are charged molecules.
Will --atom_contribution and --res_contribution make the final score more reliable?
There are three files for GT_ft_0.5 in trained_models folder, which one should I use?

Thanks!

Cannot install dependencies

Hi team,

Thanks for the interesting program. I am having trouble installing the dependencies using the provided requirements_conda.txt.

This is the log:

conda install -c conda-forge --file requirements_conda.txt
Channels:
 - conda-forge
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - kaleido==0.2.1=pypi_0
  - plotly==5.6.0=pypi_0
  - scripttest==1.3=pypi_0
  - tenacity==8.0.1=pypi_0

Current channels:

  - https://conda.anaconda.org/conda-forge
  - defaults

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

Do you know how to solve this problem?

Thanks!

creating the environment and modifications to genscore.py

Hi,

It took me some time to get a working environment. For sure, you should remove this from your read.me file:

conda create --prefix xxx --file ./requirements_conda.txt
pip install -r ./requirements_pip.txt

Here is how I was able to create the environment:

conda create -n genscore python=3.8.11
conda activate genscore
conda install nvidia/label/cuda-11.5.2::cuda-toolkit
conda install mdanalysis==2.0.0 -c conda-forge
conda install rdkit==2021.03.5 -c conda-forge
pip install prody==2.1.0
conda install numpy==1.20.3 -c conda-forge
pip install torch==1.11.0+cu115 torchvision==0.12.0+cu115 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu115
pip install torch-geometric==2.0.3
pip install torch-scatter==2.0.9 -f https://pytorch-geometric.com/whl/torch-1.11.0+cu115.html
pip install seaborn==0.11.2
pip install pandas==1.3.2
pip install matplotlib==3.4.3
pip install scipy==1.6.2
pip install scikit-learn==0.24.2
conda install conda-forge::pytorch_sparse
pip install torch_sparse==0.6.15 -f https://pytorch-geometric.com/whl/torch-1.11.0+cu115.html
conda install openbabel==3.1.0 -c conda-forge
pip install joblib==1.0.1

in genscore.py, you hard-coded the link to openbabel libraries but you did not indicate it in your Read.me file

#you need to set the babel libdir first if you need to generate the pocket
#os.environ["BABEL_LIBDIR"] = "/home/shenchao/.conda/envs/my3/lib/openbabel/3.1.0"

You could replace this code by:

default_babel_libdir = os.path.join(os.getenv("CONDA_PREFIX", ""), "lib", "openbabel", "3.1.0")
os.environ["BABEL_LIBDIR"] = os.getenv("BABEL_LIBDIR", default_babel_libdir)

so it is not hard-coded anymore. However, it will still be linked to a variable... CONDA_PREFIX

I get deprecation warnings from MDAnalysis, Bio.pairwise2, and pkg_resources. Maybe other versions would solve the problem.

Best,
Christian

output meaning?

Can you please explain the output data? From the example data, I cannot find the output scores of the protein-ligand complex. Are higher scores better? Is there a way to convert to affinity?

pocket extraction does not work

Hi,
The pocket extraction step does not generate a pocket file. It seems to be doing something but no file is produced from it. If I put back the file: 1qkt_p_pocket_10.0.pdb that was already there before I ran the script: run_genscore.sh the scoring is done... so the only step that does not work is the pocket generation. One thing that does not seem to work is --parallel... I used "time python" to measure the execution time. It is exactly the same with or w/o the argument.

(genscore) christian@Linux00:/media/christian/VS1/VS/Tool_GenScore/example$ python genscore.py -p ./1qkt_p.pdb -l ./1qkt_decoys.sdf -rl ./1qkt_l.sdf -gen_pocket -c 10.0 -e gt -m ../trained_models/GT_0.0_1.pth
/home/christian/anaconda3/envs/genscore/lib/python3.8/site-packages/MDAnalysis/coordinates/chemfiles.py:108: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
MIN_CHEMFILES_VERSION = LooseVersion("0.9")
/home/christian/anaconda3/envs/genscore/lib/python3.8/site-packages/MDAnalysis/analysis/data/filenames.py:82: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
from pkg_resources import resource_filename
/home/christian/anaconda3/envs/genscore/lib/python3.8/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
warnings.warn(
@> 8026 atoms and 1 coordinate set(s) were parsed in 0.04s.
@> 44 atoms and 1 coordinate set(s) were parsed in 0.00s.

Dataset preprocessing

Hello! Could you offer the preprocessed datasets (lig.pt, prot.pt) mentioned in "train_model.py" or the download url of them? Thanks a lot.

dekois2.0x auc can't sota 0.76, average is 0.593

run the model with GT_0.0_1.pth.
parameter:
args["batch_size"] = 128
rgs["dist_threhold"] = 5.0
args['device'] = 'cuda:6' if th.cuda.is_available() else 'cpu'
print(f"""{args['device']}""")
args["num_workers"] = 10
args["num_node_featsp"] = 41
args["num_node_featsl"] = 41
args["num_edge_featsp"] = 5
args["num_edge_featsl"] = 10
args["hidden_dim0"] = 128
args["hidden_dim"] = 128
args["n_gaussians"] = 10
args["dropout_rate"] = 0.15

genscore.scoring paramter cut=5.0

dekios2.0x test data:
protein and ref-ligand dir: aurka_prot

dekios2.0x decoys and actives dir:
aurka_decoys_SP
aurka_actives_SP

average auc is:0.593.
can you help me what's parameter Caused difference

details about your ex: 1qkt

Hi,
I am trying to figure out how to use GenScore based on your example: 1qkt. Could you give me some details about the files, please?
I guess you prepared the files (1qkt_p.pdb & 1qkt_l.sdf) from 1qkt so the estrogen nuclear receptor with its ligand estrogen, using Maestro. Did you get 1qkt from RCSB pdb or from pdb-redo? Did you apply some minimization on his complex during preparation?

How did you generate: 1qkt_decoys.sdf? Why is it used with -gen_pocket... the binding pocket is selected based on both 1qkt_l.sdf and 1qkt_decoys.sdf?

GenScore accepts ligand files only in sdf or also in other format such as mol2?

Thanks for your help