Giter VIP home page Giter VIP logo

instructplm's Introduction

InstructPLM

image Design protein sequences following structure instructions. Read the InstructPLM paper.

Setup

We recommend using docker for a quick start. You can launch an instance of instructPLM with the following commands:

docker pull jundesiat/instructplm:mpnn-progen2-xlarge
docker run --gpus all -it -v /path/to/input_output:/workspace/ jundesiat/instructplm:mpnn-progen2-xlarge
cd /root/InstructPLM

Or you can run InstructPLM from the source code, clone this repo and install dependence:

git clone --recurse-submodules https://github.com/Eikor/InstructPLM.git
cd InstructPLM
pip install -r requirements.txt

Usage

Code organization:

Important

Make sure you have obtained structure embedding before running InstructPLM, you can construct preprocessed structure embeddings by python structure_embeddings/preprocess.py. This script will process protein pdbs stored in pdbs/ and save the result in structure_embeddings/.

Protein Design

For protein design, run python run_generate.py --total 10 --save_suffix test. This script will read embeddings automatically in structure_embeddings/ and save the result at the path specified by --save_prefix. For generating fix-length proteins, setting --fix_length=True.

Tip

Large language models some times suffer from Hallucinations, so as pLMs ๐Ÿค” . You may need to generate a large set of candidates and a select policy (e.g., TM-Score, DEDAL, etc.) to get better results.

InstructPLM requires a GPU with more than 24GB of VRAM to run, if you encounter an OOM issue, you can try reducing the --num_return_sequences.

Recovery Rate

recovery_rate.py gives a example for calculating recovery rate of generated sequences.

1. Calculate recovery rate of pre-generated sequences by indicating the --sequence_path and --sequence_suffix arguments. The script read sequences file organized as follows:

sequences_path
   โ”œโ”€โ”€ seq1_suffix.fasta
   ...
   โ”œโ”€โ”€ seqN_suffix.fasta
structure_embeddings
   โ”œโ”€โ”€ ref1.pyd
   ...
   โ””โ”€โ”€ refN.pyd

2. Generate and calculating use pre-defined parameters. Set --generate as True and passing a empty sequence path.

python recovery_rate.py --sequence_path recovery_res/ --generate

Note

Recovery rate only supports protein sequences generated with fix-length. Different seeds can cause different results.

Results

InstructPLM achieves new SOTA performance on the CATH 4.2 test set:

Acknowledgments

Please cite our paper:

@article {Qiu2024.04.17.589642,
 author = {Jiezhong Qiu and Junde Xu and Jie Hu and Hanqun Cao and Liya Hou and Zijun Gao and Xinyi Zhou and Anni Li and Xiujuan Li and Bin Cui and Fei Yang and Shuang Peng and Ning Sun and Fangyu Wang and Aimin Pan and Jie Tang and Jieping Ye and Junyang Lin and Jin Tang and Xingxu Huang and Pheng Ann Heng and Guangyong Chen},
 title = {InstructPLM: Aligning Protein Language Models to Follow Protein Structure Instructions},
 elocation-id = {2024.04.17.589642},
 year = {2024},
 doi = {10.1101/2024.04.17.589642},
 publisher = {Cold Spring Harbor Laboratory},
 URL = {https://www.biorxiv.org/content/early/2024/04/20/2024.04.17.589642},
 eprint = {https://www.biorxiv.org/content/early/2024/04/20/2024.04.17.589642.full.pdf},
 journal = {bioRxiv}
}

instructplm's People

Contributors

eikor avatar zjgao02 avatar

Stargazers

950288 avatar Zhiwe Xia avatar  avatar Rain avatar Colby T. Ford avatar  avatar yanqiangmiffy avatar jinyuan sun avatar Tao Shen avatar Jiezhong Qiu avatar Hugo Hrbรกลˆ avatar Spy Han avatar WHUANLEE avatar Jiadong Lu avatar 0x1orz avatar Supernova Zhang avatar  avatar  avatar CASEA avatar  avatar Haitao Huang avatar DD avatar ZhiyeGuo avatar Sharan Pai avatar  avatar  avatar Enric Domingo avatar KoyaS avatar Naoya Kobayashi avatar Guangyong Chen avatar Wei Lu ๏ผˆ้™†ๅจ๏ผ‰ avatar Abdulrahman Tabaza avatar BIO-RAT avatar  avatar Qizhi Pei avatar  avatar James Loong avatar Alex Naka avatar Zihao avatar  avatar Shaw avatar ChloePrice avatar Sunny_ztg avatar  avatar Jingjie Zhang avatar milky avatar Thuan Phu NGUYEN-VO avatar  avatar Markus Rauhalahti avatar jongseo avatar Mellon.TANG avatar Junyang Lin avatar

Watchers

kehan liu avatar  avatar

instructplm's Issues

RuntimeError: shape '[-1, 1152]' is invalid for input of size 49896

I'm getting the following RuntimeError when trying to run 'python structure_embeddings/preprocess.py'.

Traceback (most recent call last): File "structure_embeddings/preprocess.py", line 316, in <module> write_pyd() File "structure_embeddings/preprocess.py", line 312, in write_pyd record = process_mpnn_embedding_fn(record) File "structure_embeddings/preprocess.py", line 167, in process_mpnn_embedding_fn sample["mpnn_emb"] = mpnn_emb1.view(-1, 1152).cpu() RuntimeError: shape '[-1, 1152]' is invalid for input of size 49896

The issue seems to stem from the dimensions of mpnn_emb1. For the file Fast-PETase.pdb, the dimension of mpnn_emb11 is torch.Size([1, 264, 21]). mpnn_emb1 concatenates the output of 9 models, resulting in a dimension of torch.Size([1, 264, 189]). The error arises when executing sample["mpnn_emb"] = mpnn_emb1.view(-1, 1152).cpu(), as the dimension is not an integral multiple of 1152.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.