MolGen

Code for the paper "Molecular Language Model as Multi-task Generator".

❗NOTE: We provide a NLP for science paper-list at https://github.com/zjunlp/NLP4Science_Papers.
❗NOTE: We release our pre-trained model at huggingface.

Requirements

To run the codes, you need to install the requirements:

pip install -r requirements.txt

Resource Download

You can download the pre-trained model via this link1, and the fine-tuned models via this link2.

Moreover, the dataset used for downstream tasks can be found here.

The expected structure of files is:

moldata
├── checkpoint 
│   ├── molgen.pkl              # pre-trained model
│   ├── syn_qed_model.pkl       # fine-tuned model for QED optimization on synthetic data
│   ├── syn_plogp_model.pkl     # fine-tuned model for p-logP optimization on synthetic data
│   ├── np_qed_model.pkl        # fine-tuned model for QED optimization on natural product data
│   ├── np_plogp_model.pkl      # fine-tuned model for p-logP optimization on natural product data
├── finetune
│   ├── np_test.csv             # nature product test data
│   ├── np_train.csv            # nature product train data
│   ├── plogp_test.csv          # synthetic test data for plogp optimization
│   ├── qed_test.csv            # synthetic test data for plogp optimization
│   └── zinc250k.csv            # synthetic train data
├── generate                    # generate molecules
├── output                      # molecule candidates
└── vocab_list
    └── zinc.npy                # SELFIES alphabet

How to run

Fine-tune
- First, preprocess the finetuning dataset by generating candidate molecules using our pre-trained model. The preprocessed data will be stored in the folder output.
```
    cd MolGen
    bash preprocess.sh
```
- Then do multi-task prefix tuning in combine with the self-feedback paradigm. The fine-tuned model will be stored in the folder checkpoint.
```
    bash finetune.sh
```
Generate

To generate molecules, run this script. Please specify the checkpoint_path to determine whether to use the pre-trained model or the fine-tuned model.
```
cd MolGen
bash generate.sh
```

Citation

If you use or extend our work, please cite the paper as follows:

@article{fang2023molecular,
  title={Molecular Language Model as Multi-task Generator},
  author={Fang, Yin and Zhang, Ningyu and Chen, Zhuo and Fan, Xiaohui and Chen, Huajun},
  journal={arXiv preprint arXiv:2301.11259},
  year={2023}
}

zju-fangyin / molgen Goto Github PK

molgen's Introduction

MolGen

Requirements

Resource Download

How to run

Fine-tune

Generate

Citation

molgen's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent