zjunlp / ontoprotein Goto Github PK

[ICLR 2022] OntoProtein: Protein Pretraining With Gene Ontology Embedding

License: MIT License

Python 97.03% Shell 2.97%

protein-protein-interaction gene-ontology protein-pretraining protein-function-prediction protein-structure-prediction knowledge-graph protein ontoprotein bert nlp

ontoprotein's Introduction

OntoProtein

This is the implement of the ICLR2022 paper "OntoProtein: Protein Pretraining With Ontology Embedding". OntoProtein is an effective method that make use of structure in GO (Gene Ontology) into text-enhanced protein pre-training model.

❗NOTE: We provide a NLP for science paper-list at https://github.com/zjunlp/NLP4Science_Papers.

Quick links

Overview
Requirements
Data preparation
- Pre-training data
- Downstream task data
Protein pre-training model
Usage for protein-related tasks
Citation

Overview

In this work we present OntoProtein, a knowledge-enhanced protein language model that jointly optimize the KE and MLM objectives, which bring excellent improvements to a wide range of protein tasks. And we introduce ProteinKG25, a new large-scale KG dataset, promting the research on protein language pre-training.

Requirements

To run our code, please install dependency packages for related steps.

Environment for pre-training data generation

python3.8 / biopython 1.37 / goatools

For extracting the definition of the GO term, we motified the code in goatools library. The changes in goatools.obo_parser are as follows:

# line 132
elif line[:5] == "def: ":
    rec_curr.definition = line[5:]

# line 169
self.definition = ""

Environment for OntoProtein pre-training

python3.8 / pytorch 1.9 / transformer 4.5.1+ / deepspeed 0.5.1/ lmdb /

Environment for protein-related tasks

python3.8 / pytorch 1.9 / transformer 4.5.1+ / lmdb / tape_proteins

Specially, in library tape_proteins, it only implements the calculation of metric P@L for the contact prediction task. So, for reporting the metrics P@K taking different K values, in which the metrics P@K are precisions for the top K contacts, we made some changes in the library. Detailed changes could be seen in [isssue #8]

Note: environments configurations of some baseline models or methods in our experiments, e.g. BLAST, DeepGraphGO, we provide related links to configurate as follows:

BLAST / Interproscan / DeepGraphGO / GNN-PPI

Data preparation

For pretraining OntoProtein, fine-tuning on protein-related tasks and inference, we provide acquirement approach of related data.

Pre-training data

To incorporate Gene Ontology knowledge into language models and train OntoProtein, we construct ProteinKG25, a large-scale KG dataset with aligned descriptions and protein sequences respectively to GO terms and protein entities. There have two approach to acquire the pre-training data: 1) download our prepared data ProteinKG25, 2) generate your own pre-training data.

Download released data

We have released our prepared data ProteinKG25 in Google Drive.

The whole compressed package includes following files:

go_def.txt: GO term definition, which is text data. We concatenate GO term name and corresponding definition by colon.
go_type.txt: The ontology type which the specific GO term belong to. The index is correponding to GO ID in go2id.txt file.
go2id.txt: The ID mapping of GO terms.
go_go_triplet.txt: GO-GO triplet data. The triplet data constitutes the interior structure of Gene Ontology. The data format is < h r t>, where h and t are respectively head entity and tail entity, both GO term nodes. r is relation between two GO terms, e.g. is_a and part_of.
protein_seq.txt: Protein sequence data. The whole protein sequence data are used as inputs in MLM module and protein representations in KE module.
protein2id.txt: The ID mapping of proteins.
protein_go_train_triplet.txt: Protein-GO triplet data. The triplet data constitutes the exterior structure of Gene Ontology, i.e. Gene annotation. The data format is <h r t>, where h and t are respectively head entity and tail entity. It is different from GO-GO triplet that a triplet in Protein-GO triplet means a specific gene annotation, where the head entity is a specific protein and tail entity is the corresponding GO term, e.g. protein binding function. r is relation between the protein and GO term.
relation2id.txt: The ID mapping of relations. We mix relations in two triplet relation.

Generate your own pre-training data

For generating your own pre-training data, you need download following raw data:

go.obo: the structure data of Gene Ontology. The download link and detailed format see in Gene Ontology`
uniprot_sprot.dat: protein Swiss-Prot database. [link]
goa_uniprot_all.gaf: Gene Annotation data. [link]

When download these raw data, you can excute following script to generate pre-training data:

python tools/gen_onto_protein_data.py

Downstream task data

Our experiments involved with several protein-related downstream tasks. [Download datasets]

Protein pre-training model

You can pre-training your own OntoProtein based above pretraining dataset. Before pretraining OntoProtein, you need to download two pretrained model, respectively ProtBERT and PubMedBERT and save them in data/model_data/ProtBERT and data/model_data/PubMedBERT. We provide the script bash script/run_pretrain.sh to run pre-training. And the detailed arguments are all listed in src/training_args.py, you can set pre-training hyperparameters to your need.

Usage for protein-related tasks

We have released the checkpoint of pretrained model on the model library of Hugging Face. [Download model].

Running examples

The shell files of training and evaluation for every task are provided in script/ , and could directly run. Also, you can utilize the running codes run_downstream.py , and write your shell files according to your need:

run_downstream.py: support {ss3, ss8, contact, remote_homology, fluorescence, stability} tasks;

Training models

Running shell files: bash script/run_{task}.sh, and the contents of shell files are as follow:

bash run_main.sh \
    --model model_data/ProtBertModel \
    --output_file ss3-ProtBert \
    --task_name ss3 \
    --do_train True \
    --epoch 5 \
    --optimizer AdamW \
    --per_device_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --eval_step 100 \
    --eval_batchsize 4 \
    --warmup_ratio 0.08 \
    --frozen_bert False

Arguments for the training and evalution script are as follows,

--task_name: Specify which task to evaluate on, and now the script supports {ss3, ss8, contact, remote_homology, fluorescence, stability} tasks;
--model: The name or path of a protein pre-trained checkpoint.
--output_file: The path of the fine-tuned checkpoint saved.
--do_train: Specify if you want to finetune the pretrained model on downstream tasks.
--epoch: Epochs for training model.
--optimizer: The optimizer to use, e.g., AdamW.
--per_device_batch_size: Batch size per GPU.
--gradient_accumulation_steps: The number of gradient accumulation steps.
--warmup_ratio: Ratio of total training steps used for a linear warmup from 0 to learning_rate.
--frozen_bert: Specify if you want to frozen the encoder in the pretrained model.

Additionally, you can set more detailed parameters in run_main.sh.

Notice: the best checkpoint is saved in OUTPUT_DIR/.

How to Cite

@inproceedings{
zhang2022ontoprotein,
title={OntoProtein: Protein Pretraining With Gene Ontology Embedding},
author={Ningyu Zhang and Zhen Bi and Xiaozhuan Liang and Siyuan Cheng and Haosen Hong and Shumin Deng and Qiang Zhang and Jiazhang Lian and Huajun Chen},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=yfe1VMYAXa4}
}

ontoprotein's People

Contributors

Stargazers

Watchers

ontoprotein's Issues

Rationale for choosing this loss function

Regarding your KE loss function, could you kindly provide some intuitions on why this specific loss function was chosen (given there are so many metric learning losses on KG)? A few relevant pieces of literature that you referenced would be appreciated.

Question: What is the evaluation metric for protein function classification?

Hello,
I want to know what evaluation metric is used for protein function classification.
I scrutinized the paper and code, but could not found how it was evaluated.
In the paper I found the description "4**%** improvement with transductive setting ..." but still cannot understand what is the percentage for.
I'd appreciate your reply.

ImportError in deepspeed.py

Hi,

Thanks for open-sourcing this really cool model.

I'm trying to play around with pretraining it myself, but I run into this ImportError when I run the run_pretrain.sh script. I would greatly appreciate any guidance, thanks!

(onto_env) tomcobley@compute-g-17-147:~/OntoProtein $ bash script/run_pretrain.sh 

[2022-11-26 21:06:55,053] [WARNING] [runner.py:179:fetch_hostfile] Unable to find hostfile, will proceed with training with loc
al resources only.
[2022-11-26 21:06:56,224] [INFO] [runner.py:508:main] cmd = /home/tomcobley/.conda/envs/onto_env/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 run_pretrain.py --do_train --output_dir data/output_data/filtered_ke_text --pretrain_data_dir path/todata/ProteinKG25 --protein_seq_data_file_name swiss_seq --in_memory true --max_protein_seq_length 1024 --model_protein_seq_data true --model_protein_go_data true --model_go_go_data true --use_desc true --max_text_seq_length 128 --dataloader_protein_go_num_workers 1 --dataloader_go_go_num_workers 1 --dataloader_protein_seq_num_workers 1 --num_protein_go_neg_sample 128 --num_go_go_neg_sample 128 --negative_sampling_fn simple_random --protein_go_sample_head false --protein_go_sample_tail true --go_go_sample_head true --go_go_sample_tail true --protein_model_file_name data/model_data/ProtBERT --text_model_file_name data/model_data/PubMedBERT --go_encoder_cls bert --protein_encoder_cls bert --ke_embedding_size 512 --double_entity_embedding_size false --max_steps 60000 --per_device_train_batch_size 4 --weight_decay 0.01 --optimize_memory true --gradient_accumulation_steps 256 --lr_scheduler_type linear --mlm_lambda 1.0 --lm_learning_rate 1e-5 --lm_warmup_steps 50000 --ke_warmup_steps 50000 --ke_lambda 1.0 --ke_learning_rate 2e-5 --ke_max_score 12.0 --ke_score_fn transE --ke_warmup_ratio --seed 2021 --deepspeed dp_config.json --fp16 --dataloader_pin_memory


Traceback (most recent call last):
  File "/home/tomcobley/.conda/envs/onto_env/lib/python3.8/runpy.py", line 185, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/home/tomcobley/.conda/envs/onto_env/lib/python3.8/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/home/tomcobley/OntoProtein/deepspeed.py", line 25, in <module>
    from .dependency_versions_check import dep_version_check
ImportError: attempted relative import with no known parent package

Trainer problem for pre-training

Hello, researchers
Thanks for your research. I have some problems with the pre-training phases.

did the pre-training code only support the training manner with epochs? Because when I replace the parameter "--max_steps" with "--num_train_epochs," I get an exception. Therefore, I am not sure whether trainer.py support the training manner with epochs.
if Q1 is not, I got the following results for steps, therefore, is the loss for each step meaningful? and Another question, Why does the loss of "mlm" keep oscillating after certain steps? Could you give me any advice about this situation?
{'mlm': 1.3232421875, 'protein_go_ke': 0.66796875, 'go_go_ke': 1.9619140625, 'global_step': 180, 'learning_rate': [9.965326633165831e-06, 9.965326633165831e-06, 1.9930653266331662e-05]} {'mlm': 0.77783203125, 'protein_go_ke': 0.66650390625, 'go_go_ke': 1.857421875, 'global_step': 181, 'learning_rate': [9.964824120603016e-06, 9.964824120603016e-06, 1.9929648241206033e-05]} {'mlm': 0.7373046875, 'protein_go_ke': 0.64111328125, 'go_go_ke': 1.984375, 'global_step': 182, 'learning_rate': [9.964321608040202e-06, 9.964321608040202e-06, 1.9928643216080404e-05]} {'mlm': 0.447509765625, 'protein_go_ke': 2.140625, 'go_go_ke': 2.029296875, 'global_step': 183, 'learning_rate': [9.963819095477387e-06, 9.963819095477387e-06, 1.9927638190954775e-05]} {'mlm': 1.3056640625, 'protein_go_ke': 0.64990234375, 'go_go_ke': 1.91015625, 'global_step': 184, 'learning_rate': [9.963316582914575e-06, 9.963316582914575e-06, 1.992663316582915e-05]} {'mlm': 2.1015625, 'protein_go_ke': 0.6806640625, 'go_go_ke': 1.8505859375, 'global_step': 185, 'learning_rate': [9.96281407035176e-06, 9.96281407035176e-06, 1.992562814070352e-05]} {'mlm': 1.146484375, 'protein_go_ke': 0.6494140625, 'go_go_ke': 1.9150390625, 'global_step': 186, 'learning_rate': [9.962311557788946e-06, 9.962311557788946e-06, 1.992462311557789e-05]} {'mlm': 1.3505859375, 'protein_go_ke': 0.666015625, 'go_go_ke': 1.8994140625, 'global_step': 187, 'learning_rate': [9.961809045226131e-06, 9.961809045226131e-06, 1.9923618090452263e-05]} {'mlm': 1.359375, 'protein_go_ke': 2.775390625, 'go_go_ke': 1.8330078125, 'global_step': 188, 'learning_rate': [9.961306532663317e-06, 9.961306532663317e-06, 1.9922613065326634e-05]} {'mlm': 1.0927734375, 'protein_go_ke': 0.65087890625, 'go_go_ke': 1.8271484375, 'global_step': 189, 'learning_rate': [9.960804020100502e-06, 9.960804020100502e-06, 1.9921608040201005e-05]} [2022-10-13 09:32:21,562] [INFO] [logging.py:68:log_dist] [Rank 0] step=190, skipped=11, lr=[9.96030150753769e-06, 9.96030150753769e-06, 1.992060301507538e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] [2022-10-13 09:32:21,992] [INFO] [timer.py:157:stop] 0/190, SamplesPerSec=4.138257351966589 {'mlm': 1.7353515625, 'protein_go_ke': 0.6669921875, 'go_go_ke': 1.7998046875, 'global_step': 190, 'learning_rate': [9.96030150753769e-06, 9.96030150753769e-06, 1.992060301507538e-05]} {'mlm': 1.2763671875, 'protein_go_ke': 0.6787109375, 'go_go_ke': 1.8154296875, 'global_step': 191, 'learning_rate': [9.959798994974875e-06, 9.959798994974875e-06, 1.991959798994975e-05]} {'mlm': 0.80712890625, 'protein_go_ke': 0.6708984375, 'go_go_ke': 1.876953125, 'global_step': 192, 'learning_rate': [9.95929648241206e-06, 9.95929648241206e-06, 1.991859296482412e-05]} {'mlm': 0.59716796875, 'protein_go_ke': 0.6787109375, 'go_go_ke': 1.7919921875, 'global_step': 193, 'learning_rate': [9.958793969849248e-06, 9.958793969849248e-06, 1.9917587939698496e-05]} {'mlm': 0.7734375, 'protein_go_ke': 0.6611328125, 'go_go_ke': 1.90625, 'global_step': 194, 'learning_rate': [9.958291457286433e-06, 9.958291457286433e-06, 1.9916582914572867e-05]} {'mlm': 0.77587890625, 'protein_go_ke': 0.6865234375, 'go_go_ke': 1.76171875, 'global_step': 195, 'learning_rate': [9.957788944723619e-06, 9.957788944723619e-06, 1.9915577889447238e-05]} {'mlm': 0.89404296875, 'protein_go_ke': 0.6533203125, 'go_go_ke': 1.91015625, 'global_step': 196, 'learning_rate': [9.957286432160806e-06, 9.957286432160806e-06, 1.9914572864321612e-05]} {'mlm': 1.1416015625, 'protein_go_ke': 0.654296875, 'go_go_ke': 1.78125, 'global_step': 197, 'learning_rate': [9.956783919597992e-06, 9.956783919597992e-06, 1.9913567839195983e-05]} {'mlm': 1.0224609375, 'protein_go_ke': 0.66162109375, 'go_go_ke': 1.7841796875, 'global_step': 198, 'learning_rate': [9.956281407035177e-06, 9.956281407035177e-06, 1.9912562814070354e-05]} {'mlm': 0.56005859375, 'protein_go_ke': 0.65966796875, 'go_go_ke': 1.806640625, 'global_step': 199, 'learning_rate': [9.955778894472363e-06, 9.955778894472363e-06, 1.9911557788944725e-05]} [2022-10-13 09:33:02,238] [INFO] [logging.py:68:log_dist] [Rank 0] step=200, skipped=11, lr=[9.955276381909548e-06, 9.955276381909548e-06, 1.9910552763819096e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] [2022-10-13 09:33:02,671] [INFO] [timer.py:157:stop] 0/200, SamplesPerSec=4.141814535981229

Best regards,
Xinghao

Graph knowledge creation

Hello,
I have been looking at this repo for the last couple of days.
I'm interested in graph knowledge generation from the go features.
How did you manage to create the graph?
Is there any snippet of code or detailed documentation?
Thank you!

预训练数据缺失问题

您好，我正在尝试运行预训练部分的代码。但我发现您提供的ProteinKG25中并不包含swiss_seq文件夹及其下面包含的mdb数据文件。并且这些数据会在后续的预训练中使用到。请问这部分数据是否有办法通过ProteinKG25生成呢，还是只能通过原始数据生成？

ontoProtein pretrained model

您好，我想使用ontoProtein计算蛋白质的embedding，我在https://huggingface.co/zjukg/OntoProtein/tree/main
上下载了模型保存在本地，但是不同蛋白计算的embedding是一样的，请问这样正常吗？

下载文件到本地
包括config.json pytorch_model.bin tokenizer_config.json vocab.txt四个文件
计算embedding的脚本

import logging
import os
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple
from torch.utils.data.dataloader import DataLoader
import yaml
import os
import numpy as np
import torch
from tqdm import tqdm
from transformers import (
    AutoConfig,
    AutoTokenizer,
    AutoModel,
)
import argparse
import torch
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
logger = logging.getLogger(__name__)
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
model_name_or_path = '/data/wenyuhao/55/model/ontology'
config = AutoConfig.from_pretrained(model_name_or_path,)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,use_fast=False,)
model = AutoModel.from_pretrained(model_name_or_path,config=config,).to(device)
def getArray(seq):
    input_ids = torch.tensor(tokenizer.encode(seq)).unsqueeze(0).to(device)  # Batch size 1
    with  torch.no_grad():
        outputs = model(input_ids)
    return outputs[1].cpu().numpy()

效果

In [14]: a = getArray('VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS')
In [15]: b = getArray('YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF')
In [16]: a
Out[16]: 
array([[-0.11852779,  0.1262154 , -0.11203501, ...,  0.11941278,
         0.11056887, -0.12232994]], dtype=float32)
In [17]: b
Out[17]: 
array([[-0.11852779,  0.1262154 , -0.11203501, ...,  0.11941278,
         0.11056887, -0.12232994]], dtype=float32)
In [18]: Counter(a[0]==b[0])
Out[18]: Counter({True: 1024})

我计算了swissprot的所有蛋白质，发现都是一样的

In [29]: s
Out[29]: 
array([[-0.11852774,  0.12621534, -0.11203495, ...,  0.11941272,
         0.11056883, -0.12232988],
       [-0.11852774,  0.12621534, -0.11203495, ...,  0.11941272,
         0.11056883, -0.12232988],
       [-0.11852774,  0.12621534, -0.11203495, ...,  0.11941272,
         0.11056883, -0.12232988],
       ...,
       [-0.11852774,  0.12621534, -0.11203495, ...,  0.11941272,
         0.11056883, -0.12232988],
       [-0.11852774,  0.12621534, -0.11203495, ...,  0.11941272,
         0.11056883, -0.12232988],
       [-0.11852774,  0.12621534, -0.11203495, ...,  0.11941272,
         0.11056883, -0.12232988]], dtype=float32)

In [30]: s.shape
Out[30]: (20083, 1024)

In [31]: (s==s).all()
Out[31]: True

No module named 'transformers.deepspeed'

When I tried to run the sample command sh run_main.sh ......

Traceback (most recent call last):
  File "run_downstream.py", line 8, in <module>
    from src.models import model_mapping, load_adam_optimizer_and_scheduler
  File "/mnt/SSD2/pmtnet_proj/code/github/OntoProtein/src/models.py", line 16, in <module>
    from transformers.deepspeed import is_deepspeed_zero3_enabled
ModuleNotFoundError: No module named 'transformers.deepspeed'

Issue in creating an environment for OntoProtein pretraining

Hello Researchers,
I am finding the bugs in installing the deepspeed of version=0.5.1. I have already installed python 3.8.13, pytorch=1.12.0 with torch vision=0.13.0, torch audio=0.12.0, and cudatookit=11.3.1, tranformers=4.9.2, lmdb=1.3.0. But when I install the deepspeed=0.5.1. My all dependencies are not installed correctly for deepspeed. can you please tell the exact versions which you have used for pytorch, python, and deepspeed?
Below is the error which I found:
Traceback (most recent call last):
File "", line 1, in
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/deepspeed/init.py", line 15, in
from .runtime.engine import DeepSpeedEngine, DeepSpeedOptimizerCallable, DeepSpeedSchedulerCallable
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 20, in
from tensorboardX import SummaryWriter
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/init.py", line 5, in
from .torchvis import TorchVis
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/torchvis.py", line 11, in
from .writer import SummaryWriter
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/writer.py", line 15, in
from .event_file_writer import EventFileWriter
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 28, in
from .proto import event_pb2
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/proto/event_pb2.py", line 15, in
from tensorboardX.proto import summary_pb2 as tensorboardX_dot_proto_dot_summary__pb2
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/proto/summary_pb2.py", line 15, in
from tensorboardX.proto import tensor_pb2 as tensorboardX_dot_proto_dot_tensor__pb2
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/tensorboardX/proto/tensor_pb2.py", line 15, in
from tensorboardX.proto import resource_handle_pb2 as tensorboardX_dot_proto_dot_resource__handle__pb2
File "//mnt/user1/envs/pretraining/lib/python3.8/site-packages/tensorboardX/proto/resource_handle_pb2.py", line 35, in
_descriptor.FieldDescriptor(
File "//mnt/user1/.conda/envs/pretraining/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 560, in new
_message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:

Downgrade the protobuf package to 3.20.x or lower.
Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
Can you tell the exact versions which you have used?

goatools包问题

博主你好，请问一下您在运行gen_onto_protein_data.py文件中create_go_data部分时是否出现下面类似问题：
goatools版本为1.2.3时，go_term.definition报错：没有.definition属性
goatools版本为1.0.11时，提示RecursionError: maximum recursion depth exceeded while calling a Python object

关于GO的embedding

您好，请问您们预训练好的GO的embedding可以分享一下吗？

Running Code in Google Colab

Hi,

I am really interested in your work on OntoProtein . Is your code run on Google Colab? It would be helpful if you have google Colab tutorials to run your code.

Computational Resources and Time

Can you provide a recommendation for the allocated resources of computational power to run one of the downstream tasks, like run_contact including fine-tuning the mode, i.e do_train = True , like the suggest number of cores and memory and how long it is expected to take. And what were the ones used in experiments and how long it took?

I am trying to run the protein contact prediction task on 16 cores and 120GB of memory with an estimation of a week required to get the results, however, I keep getting the process killed because of the insufficient memory space.

Exploding KE loss when training a new `protein-go` set

Hi! I was trying to re-train your model with a new set of protein-GO relations, and it looks like the KE losses for positive and negative samples are exploding unexpectedly. After a few steps, the positive one increases to several tens of thousands and the negative one decreases to minus several tens of thousands. It seems they would continue to explode. Have you encountered this issue in your experiments? I'm thinking of confining the score to [0, 1] before inputting into the KE loss. Thanks in advance！

请问是否有和ESM-1b的benchmark结果

感谢分享非常有趣的论文的代码！请问在这些任务上有和ESM-1b的比较结果吗？另外，请问TAPE上报告的metric都是finetune (supervised)结果而不是unsupervised结果是吗？

[Confirmation] Optimal Hyperparameters and Reproducibility

Hi there,

Thanks for providing the nice codebase. I'm trying to reproduce the results for downstream tasks, and I have the following questions.

I'm wondering if the scripts under this folder are only samples? For the optimal hyperparameters for OntoProtein, we should follow Table 6 in the paper?
For ProtBert, are you using the same optimal hyperparameters for each downstream task?
Table 6 doesn't cover the optimal values for gradient_accumulation_steps and eval_step. Can you help clarify this?

Any help is appreciated.

run_contact.sh error

Hi,
I setup fresh environment for running the script and when I run [run_contact.sh] I get the following error in "contact-ontoprotein.out"

***** Running Prediction *****
Num examples = 40
Batch size = 1
Traceback (most recent call last):
File "run_downstream.py", line 286, in
main()
File "run_downstream.py", line 281, in main
predictions_family, input_ids_family, metrics_family = trainer.predict(test_dataset)
File "/home/sakher/miniconda3/envs/onto2/lib/python3.8/site-packages/transformers/trainer.py", line 2358, in predict
output = eval_loop(
File "/data3/sakher/onto2/OntoProtein/src/benchmark/trainer.py", line 217, in evaluation_loop
loss, logits, labels, prediction_score = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
File "/data3/sakher/onto2/OntoProtein/src/benchmark/trainer.py", line 50, in prediction_step
prediction_score['precision_at_l2'] = logits[3]['precision_at_l2']
KeyError: 'precision_at_l2'

goa_uniprot_all.gaf和goa_uniprot_all.gat有什么不同?

你好，我想问下在你readme文件里面，我看你们用的是goa_uniprot_all.gat,但在gen_onto_protein_data.py里create_goa_triplet用的又是goa_uniprot_all.gaf来构建triplet，所以我想确定下应该用哪一个文件？

What is the `OntoModel` referenced in `run_pretrain.sh`

Hi!

I am trying to pretrain the model using the datasets and pretrained models mentioned in the README and the run_pretrain.sh script.

However, I have run into a few problems which seem to be due to my choice of TEXT_MODEL_PATH in run_pretrain.sh on this line.

What should this model be?

I assumed this meant the PubMedBERT model mentioned in the README, but execution fails if I use this - am I missing something?

Many thanks in advance!

Backward Error during Pre-training

Dear authors:
Thank you for your inspiring work !
When we try to follow your work and run the following scripts for pre-training

python -m torch.distributed.launch --nproc_per_node=4 run_pretrain.py \
  --output_dir 'output' \
  --do_train \
  --in_memory $IN_MEMORY \
  --max_protein_seq_length $MAX_PROTEIN_SEQ_LENGTH \
  --model_protein_seq_data true \
  --model_protein_go_data true \
  --model_go_go_data true \
  --use_desc $USE_DESC \
  --max_text_seq_length $MAX_TEXT_SEQ_LENGTH \
  --dataloader_protein_go_num_workers $PROTEIN_GO_NUM_WORKERS \
  --dataloader_go_go_num_workers $GO_GO_NUM_WORKERS \
  --dataloader_protein_seq_num_workers $PROTEIN_SEQ_NUM_WORKERS \
  --num_protein_go_neg_sample $NUM_PROTEIN_GO_NEG_SAMPLE \
  --num_go_go_neg_sample $NUM_GO_GO_NEG_SAMPLE \
  --negative_sampling_fn $NEGTIVE_SAMPLING_FN \
  --protein_go_sample_head $PROTEIN_GO_SAMPLE_HEAD \
  --protein_go_sample_tail $PROTEIN_GO_SAMPLE_TAIL \
  --go_go_sample_head $GO_GO_SAMPLE_HEAD \
  --go_go_sample_tail $GO_GO_SAMPLE_TAIL \
  --protein_model_file_name $PROTEIN_MODEL_PATH \
  --text_model_file_name $TEXT_MODEL_PATH \
  --go_encoder_cls $GO_ENCODER_CLS \
  --protein_encoder_cls $PROTEIN_ENCODER_CLS \
  --ke_embedding_size $KE_EMBEDDING_SIZE \
  --double_entity_embedding_size $DOUBLE_ENTITY_EMBEDDING_SIZE \
  --max_steps $MAX_STEPS \
  --per_device_train_batch_size $BATCH_SIZE \
  --weight_decay $WEIGHT_DECAY \
  --optimize_memory $OPTIMIZE_MEMORY \
  --gradient_accumulation_steps $ACCUMULATION_STEPS \
  --lr_scheduler_type $SCHEDULER_TYPE \
  --mlm_lambda $MLM_LAMBDA \
  --lm_learning_rate $MLM_LEARNING_RATE \
  --lm_warmup_ratio $LM_WARMUP_RATIO \
  --ke_warmup_ratio $KE_WARMUP_RATIO \
  --ke_lambda $KE_LAMBDA \
  --ke_learning_rate $KE_LEARNING_RATE \
  --ke_max_score $KE_MAX_SCORE \
  --ke_score_fn $KE_SCORE_FN \
  --ke_warmup_ratio $KE_WARMUP_RATIO \
  --seed 2021 \
  --fp16 \
  --dataloader_pin_memory \

we get error :

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 486 with name protein_lm.bert.encoder.layer.29.output.LayerNorm.weight has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.RuntimeError

It seems that this error only appears when performing DDP training, besides we have tried deep speed but an error occurs during loading optimizer at line 168 in trainer.py.

Wonder how I can solve this. Thank you very much!

Question about computing resource and batch size

Hi,

Thanks for sharing the code. I noticed in your run_pretrain.sh, the batch size of protein-GO and protein MLM is 8, while the batch size of GO-GO is 64. Meanwhile, the num of negative samples for each positive sample is 128, or 256 for GO-GO.

(1) Does this mean in each GO-GO pass, at most (64*2+64*256) samples of length at most 128 are fed into the GO encoder (in one batch)?

(2) How many V100s did you use for this pretraining?

Also, I noticed that you didn't permutate proteins for protein-GO relations.

(3) Is this due to computing resource limit (i.e. 8*128 is just too large a number for proteins)?

(4) Did you experiment with a lower number of negative samples while considering such protein permutation?

Thanks in advance!

关于GO prediction的表现

再次感谢作者分享代码，我看到文章中对GO预测的数值似乎和最近的另一篇paper呈现出不太一样的趋势（BP, MF, CC的变化），见此文表3，请问主要是因为对GO cutoff的level不同吗？另外，似乎CCO报告的数值小数点多了一位（如果我没有理解错数值的话）。

the problem of biopython version?

您好，我在Readme上看到的biopython版本号是1.37，但现在官方提供的pip安装没有提供1.37的版本，只有1.30以下，1.30，然后就到1.40以上的版本了，所以我就安了最新的1.78版本，但我现在运行gen_onto_protein_data.py中的create_uniprot_data函数，就给我报了这样的错

，所以我想问下是否是biopython版本的问题？如果是的话，我现在应该选取什么版本比较合适？

No such file or directory 'protein_go_train_triplet_filtered.txt'

The file 'protein_go_train_triplet.txt' exists in ProteinKG25 , but the code requires 'protein_go_train_triplet_filtered.txt'

Go-Go relations are the same as Protein-Go relations?

Hello, researchers.

I have a question about your pre-train datasets, when I checked your lists of pre-train datasets, I found only one dataset named relation2id.txt that is related to relations. And I found you use the same file for GO-GO relations and Protein-GO relations in your codes. Therefore, I am confused about this. And I want to ask if the relation file for Protein-GO and GO-GO are the same, if yes, I want to know why they are the same or how you define or think the relations for Protein-Go and GO-GO should be the same.

Thanks.
Best regards,
Xinghao

Generating Embedding of Protein Sequence

Hi is it possible to use https://huggingface.co/zjunlp/OntoProtein model to get the embedding of a protein sequence?

Inconsistent data statistics between the downloaded dataset and the reported statistics

Hi Authors,

Thanks for the great work!

As I was checking your dataset, I found that the dataset statistics of downloaded dataset are different from those reported in this link.

Specifically, the downloaded valid set and test set is much larger than the reported size. I also found that the valid set contained data from the training set.

Question: Is it possible get the embedding of the relations and entities separately using Ontoprotein ?

Hi,

I am just curious is it possible to get the embedding of proteins and relations in https://www.zjukg.org/project/ProteinKG25/ separately using huggingface ontoprotein model ( https://huggingface.co/zjunlp/OntoProtein)? Let us say,

Protein_1 Enables Protein_2 is triple in the knowledge graph.

Is it to get the embedding of Protein_1, Enables, Protein_2 separately using https://huggingface.co/zjunlp/OntoProtein

Cannot run the code without GPU

I'm trying to run the pre-train script on a server that has no GPU, would that be possible?

run_pretrain.sh 报错

我配置了deepspeed环境，然后运行run_pretrain.sh,但出现了以下错误：

然后我将config属性指向protein_model_config，并且运行了training_arg.py中注释掉的部分，结果出现了以下错误：

File "run_pretrain.py", line 135, in
main()
File "run_pretrain.py", line 131, in main
trainer.train()
File "OntoProtein/src/trainer.py", line 167, in train
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/transformers/deepspeed.py", line 437, in deepspeed_init
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/init.py", line 120, in initialize
engine = DeepSpeedEngine(args=args,
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 239, in init
self._configure_with_arguments(args, mpu)
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 872, in _configure_with_arguments
self._config = DeepSpeedConfig(self.config, mpu)
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 875, in init
self._configure_train_batch_size()
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 1051, in _configure_train_batch_size
self._batch_assertion()
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 987, in _batch_assertion
train_batch > 0
TypeError: '>' not supported between instances of 'str' and 'int'

请问这是什么原因引起的呢？

数据文件

博主你好，请问一下代码数据集中，protein_go_triplet，protein_seq_map.txt文件如何获取？似乎没有在代码中找到

Important relation for the protein sequence

Hi,

I see the following relations in the knowledge graph:
['enables_nucleotide_binding', 'enables_metal_ion_binding', 'enables_transferase_activity', 'enables', 'involved_in_signal_transduction', 'involved_in_regulation_of_transcription,_DNA-templated', 'involved_in_phosphorylation', 'involved_in', 'part_of_nucleus', 'part_of_cytoplasm', 'part_of', 'part_of_cytosol', 'part_of_membrane', 'colocalizes_with', 'involved_in_proteolysis', 'NOT|involved_in', 'part_of_integral_component_of_membrane', 'involved_in_cation_transport', 'involved_in_cellular_response_to_DNA_damage_stimulus', 'part_of_mitochondrion', 'involved_in_metabolic_process', 'involved_in_cell_cycle', 'involved_in_cell_division', 'involved_in_lipid_metabolic_process', 'enables_RNA_binding', 'acts_upstream_of_or_within', 'enables_catalytic_activity', 'enables_hydrolase_activity', 'enables_DNA_binding', 'contributes_to', 'involved_in_carbohydrate_metabolic_process', 'involved_in_translation', 'part_of_extracellular_region', 'acts_upstream_of_or_within_positive_effect', 'involved_in_protein_transport', 'NOT|enables', 'acts_upstream_of', 'part_of_ribosome', 'involved_in_transmembrane_transport', 'NOT|part_of', 'NOT|involved_in_tRNA_processing', 'is_active_in', 'located_in', 'NOT|located_in', 'acts_upstream_of_positive_effect']

which relation is the most important for protein sequence?