zaixizhang / flag Goto Github PK

View Code? Open in Web Editor NEW

56.0 4.0 7.0 7.09 MB

Implementation of ICLR23 paper "Molecule Generation for Target Protein Binding with Structural Motifs"

Python 100.00%

drug-discovery generative-ai geometric-deep-learning protein-ligand-interactions

flag's Introduction

FLAG ICLR23

Molecule Generation For Target Protein Binding With Structural Motifs

Designing ligand molecules that bind to specific protein binding sites is a fundamental problem in structure-based drug design. Although deep generative models and geometric deep learning have made great progress in drug design, existing works either sample in the 2D graph space or fail to generate valid molecules with realistic substructures. To tackle these problems, we propose a Fragment-based Lig And Generation framework (FLAG), to generate 3D molecules with valid and realistic substructures fragment-by-fragment. In FLAG, a motif vocabulary is constructed by extracting common molecular fragments (i.e., motif) in the dataset. At each generation step, a 3D graph neural network is first employed to encode the intermediate context information. Then, our model selects the focal motif, predicts the next motif type, and attaches the new motif. The bond lengths/angles can be quickly and accurately determined by cheminformatics tools. Finally, the molecular geometry is further adjusted according to the predicted rotation angle and the structure refinement. Our model not only achieves competitive performances on conventional metrics such as binding affinity, QED, and SA, but also outperforms baselines by a large margin in generating molecules with realistic substructures.

📢 News

Please check out our latest work on structure-based drug design: Learning Subpocket Prototypes for Generalizable Structure-based Drug Design (ICML 2023)
- Code: https://github.com/zaixizhang/DrugGPS_ICML23
- Paper: https://arxiv.org/abs/2305.13997

Install conda environment via conda yaml file

conda env create -f flag_env.yaml
conda activate flag_env

Datasets

Please refer to README.md in the data folder.

Dataset Preprocessing and motif vocab construction

python build_vocab.py

Training

python train.py

Sampling

python motif_sample.py

FLAG demo with checkpoints

Demo: https://huggingface.co/spaces/Zaixi/ICLR_FLAG

Checkpoints: https://drive.google.com/drive/folders/1NI-Tl7YzyMsfljEZXaTxbpuiO7lvUBt9?usp=drive_link

Generated Molecules for CrossDocked dataset

The generated molecular structures for 100 protein targets are stored in flag_gen.pt

The index file is test_index.pkl

Reference

@inproceedings{
zhang2023molecule,
title={Molecule Generation For Target Protein Binding with Structural Motifs},
author={ZAIXI ZHANG and Shuxin Zheng and Yaosen Min and Qi Liu},
booktitle={International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=Rq13idF0F73}
}

flag's People

Contributors

Stargazers

Watchers

Forkers

minju-hits yangnianzu0515 hk-jeon ouyang-cmd luoqichao shunsunsun theangle134

flag's Issues

how to solve the processing data problem

Hello author, thank you very much for your great work. When I run the file “train.py” and process data, there are often cases of skipping, skipping all data. Could you please give me a method to solve that？

dataset missing

hi, author, I'm still having trouble getting the trainer running.
I saw that in a closed issue, people say there're dataset files missing. But they remained missing till now.
And I've tried every way and looked through the codes, while I still cannot find or generate those files, including './data/cross docked_pocket10/index.pt', './data/pdbbind_pocket10/*', ''/n/holyscratch01/mzitnik_lab/zaixizhang/pdbbind_pocket10/index.pt''.
Please shed some light on me.

how to build the dataset

hi, thank you for nice work.
May i know how you build the dataset files like pdbbind_pocket10_xxx
Would you please share your code?

how to solve this training error?

Indexing: 1%|█ | 2200/166398 [00:20<25:42, 106.46it/s]
Traceback (most recent call last):
File "train.py", line 63, in
dataset, subsets = get_dataset(config=config.dataset, transform=transform, )
File "/data/zhouzihan/FLAG-main/utils/datasets/init.py", line 11, in get_dataset
dataset = PocketLigandPairDataset(root, *args, **kwargs)
File "/data/zhouzihan/FLAG-main/utils/datasets/pl.py", line 64, in init
self._precompute_name2id()
File "/data/zhouzihan/FLAG-main/utils/datasets/pl.py", line 132, in _precompute_name2id
data = self.getitem(i)
File "/data/zhouzihan/FLAG-main/utils/datasets/pl.py", line 153, in getitem
data = self.transform(data)
File "/home/ahmu/ENTER/envs/flag_env/lib/python3.8/site-packages/torch_geometric/transforms/compose.py", line 24, in call
data = transform(data)
File "/data/zhouzihan/FLAG-main/utils/transforms.py", line 471, in call
bfs_perm, bfs_focal = self.get_bfs_perm_motif(data['moltree'], self.vocab)
File "/data/zhouzihan/FLAG-main/utils/transforms.py", line 449, in get_bfs_perm_motif
node.wid = vocab.get_index(node.smiles)
File "/data/zhouzihan/FLAG-main/utils/mol_tree.py", line 24, in get_index
return self.vmap[smiles]
KeyError: 'C1CC2CCC(O1)O2'

error during sampling

During sampling, once the exception of UFF is triggered, an error will come from this line (https://github.com/zaixizhang/FLAG/blob/main/motif_sample.py#L295), and the program will exit.

ValueError: Bad Conformer Id

Could you give some explanations or suggestions to address this?

motif_sample.py

I downloaded the source code and Checkpoints files.
I want to use the motif_sample.py, but it stops with the following error.

$ python motif_sample.py

[2024-03-24 01:11:33,984::sample::INFO] Namespace(config='./configs/sample.yml', data_id=1, device='cuda:0', num_workers=64, outdir='./outputs', vocab_path='vocab.txt')
[2024-03-24 01:11:33,984::sample::INFO] {'dataset': {'name': 'pl', 'path': './data/pdbbind_pocket10', 'split': './data/split_by_name.pt'}, 'model': {'checkpoint': './checkpoints/pretrained.pt', 'hidden_channels': 256, 'random_alpha': False}, 'sample': {'seed': 2024, 'num_samples': 100, 'num_retry': 5, 'max_steps': 12, 'batch_size': 10, 'num_workers': 4, 'n_samples': 5}}
[2024-03-24 01:11:33,984::sample::INFO] Loading data...
Segmentation fault (core dumped)

I checked the code in motif_sample.py and there is an error at line 507 where
data = testset[args.data_id]

Do you have any ideas to solve this?

Can you provide your pre-trained model?

Hello author, thank you very much for your great work. May I ask if it is convenient for you to provide your pre-trained model? I saw in the sample.yml that it is the file "./pretrained/model.pt". If you could provide this trained model file, I would be very grateful. Thank you.

enum_assemble not found

The error "cands = enum_assemble(self, neighbors)" on line 91 of the mol_tree.py file in the utils folder is occurring because the method enum_assemble is not defined. I also did not find any import statement for this method. Could you please let me know where I can find this method? Thank you.

Sampling Code Not Working

We are trying to reproduce the results from the original FLAG paper. We have been able to tweak the training code to make it work, but we still bump into some knotty issues during the sampling/generation stage. Following the original instructions from README.md, we start the sampling process by running the following command:

python motif_sample.py

The Python interpreter gives the following error:

Traceback (most recent call last):
  File "motif_sample.py", line 18, in <module>
    from models.maskfill import MaskFillModel
ModuleNotFoundError: No module named 'models.maskfill'

Then I came to realize: there is no such file as ./models/maskfill.py in the Github repo. I googled for the file and found a file with the same name in the 3DSBDD repo (https://github.com/luost26/3D-Generative-SBDD/blob/main/models/maskfill.py). However, the class __init__() function arguments do not match. In motif_sample.py:412:

    model = MaskFillModel(
        ckpt['config'].model,
      	protein_atom_feature_dim=protein_featurizer.feature_dim,
        ligand_atom_feature_dim=ligand_featurizer.feature_dim,
        vocab=vocab,
        weight=weight,
    ).to(args.device)

The vocab and weight arguments are non-existent in the 3DSBDD version of maskfill.py. I assume the FLAG authors have made substantial changes in the maskfill.py file, but happen not to upload it to Github. I cannot proceed with my experiment reproduction beyond this point before the FLAG version of maskfill.py is uploaded.

How to resume train.py from checkpoint files

I ran "python train.py" but it seemed that the training job didn't finish normally. How can I resume the training process with a checkpoint file? Many thanks!