Code for the paper "Molecular Language Model as Multi-task Generator".
- ❗NOTE: We provide a NLP for science paper-list at https://github.com/zjunlp/NLP4Science_Papers.
- ❗NOTE: We release our pre-trained model at huggingface.
To run the codes, you need to install the requirements:
pip install -r requirements.txt
You can download the pre-trained model via this link1, and the fine-tuned models via this link2.
Moreover, the dataset used for downstream tasks can be found here.
The expected structure of files is:
moldata
├── checkpoint
│ ├── molgen.pkl # pre-trained model
│ ├── syn_qed_model.pkl # fine-tuned model for QED optimization on synthetic data
│ ├── syn_plogp_model.pkl # fine-tuned model for p-logP optimization on synthetic data
│ ├── np_qed_model.pkl # fine-tuned model for QED optimization on natural product data
│ ├── np_plogp_model.pkl # fine-tuned model for p-logP optimization on natural product data
├── finetune
│ ├── np_test.csv # nature product test data
│ ├── np_train.csv # nature product train data
│ ├── plogp_test.csv # synthetic test data for plogp optimization
│ ├── qed_test.csv # synthetic test data for plogp optimization
│ └── zinc250k.csv # synthetic train data
├── generate # generate molecules
├── output # molecule candidates
└── vocab_list
└── zinc.npy # SELFIES alphabet
-
- First, preprocess the finetuning dataset by generating candidate molecules using our pre-trained model. The preprocessed data will be stored in the folder
output
.
cd MolGen bash preprocess.sh
- Then do multi-task prefix tuning in combine with the self-feedback paradigm. The fine-tuned model will be stored in the folder
checkpoint
.
bash finetune.sh
- First, preprocess the finetuning dataset by generating candidate molecules using our pre-trained model. The preprocessed data will be stored in the folder
-
To generate molecules, run this script. Please specify the
checkpoint_path
to determine whether to use the pre-trained model or the fine-tuned model.cd MolGen bash generate.sh
If you use or extend our work, please cite the paper as follows:
@article{fang2023molecular,
title={Molecular Language Model as Multi-task Generator},
author={Fang, Yin and Zhang, Ningyu and Chen, Zhuo and Fan, Xiaohui and Chen, Huajun},
journal={arXiv preprint arXiv:2301.11259},
year={2023}
}