tttianhao / clean Goto Github PK

View Code? Open in Web Editor NEW

213.0 213.0 39.0 338.31 MB

CLEAN: a contrastive learning model for high-quality functional prediction of proteins

License: MIT License

Python 86.59% Jupyter Notebook 12.31% Dockerfile 1.01% Shell 0.09%

clean's People

Contributors

Stargazers

Watchers

clean's Issues

Job failed due to undefined. Please try again, or click the feedback link at the bottom of the page to report a problem.

When I choose to run the example data in the web again it works fine, but when I use any sequence or even example data there is no response and the message:
Job failed due to undefined. Please try again, or click the feedback link at the bottom of the page to report a problem.
Looking forward to your help!

Error:

The online website does not work, and error reported in the local training. Which always reported that there is No such file or directory: './data/esm_data/P76077_4.pt'.

Traceback (most recent call last):
File "/home/yangshihui/CLEAN/./train-triplet.py", line 138, in
main()
File "/home/yangshihui/CLEAN/./train-triplet.py", line 117, in main
train_loss = train(model, args, epoch, train_loader,
File "/home/yangshihui/CLEAN/./train-triplet.py", line 44, in train
for batch, data in enumerate(train_loader):
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 530, in next
data = self._next_data()
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 570, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/CLEAN-0.1-py3.10.egg/CLEAN/dataloader.py", line 76, in getitem
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 699, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 231, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 212, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './data/esm_data/P76077_4.pt'
Traceback (most recent call last):
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/CLEAN-0.1-py3.10.egg/CLEAN/infer.py", line 90, in infer_maxsep
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 699, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 231, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/yangshihui/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 212, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './data/pretrained/split100.pth'

No such file or directory: './gmm_test/GMM_100_500_0.pkl'

I encountered this error when running gmm.py. I noticed that the model needed to be retrained with split100 before to generate the necessary distance map. I would like to generate confidence levels for each CLEAN inference. What else is necessary to do this after running gmm.py? Thanks

Question to protein language model

Hi, congratulations！protein language models have played a key role in the protein feature extraction. Since none of the methods you compared use a language model, is that the main reason your method is good. I think supervised contrastive learning should not be so different from supervised learning. Our lab has also developed a lightweight protein language model (ProtFlash), and we would like to collaborate to test this approach if given the opportunity.

Questions about querying unclassified proteins

Hi - I have a few questions about inferring EC number from unclassified proteins or those not found in uniprot/swissprot. Can I pass in all protein fastas from my data, or can I only pass in enzymes? In other words, has CLEAN been designed to not give results for proteins that are not enzymes? I also noticed that the "Enzyme IDs" in split100.csv all correspond to accession numbers. Can I make up a unique ID for each uncategorized protein, or does CLEAN only work on proteins found in UniProt. If so, would passing in KEGG accessions (or those from other databases) as the enzyme IDs work?

offline tool issue with the training

Hello,
I am currently trying to run your tool with the docker container and I am facing an issue at the training step.
Here is the command line that I use:
/shared/projects/seabioz/softwares/CLEAN/clean-1.0.1.sif python ./scripts/train-supconH.py --training_data split100 --model_name split100_supconH --epoch 4100 --n_pos 9 --n_neg 30 -T 0.1

And here is the error I get:
Traceback (most recent call last): File "/shared/projects/seabioz/softwares/CLEAN/./scripts/train-supconH.py", line 139, in <module> main() File "/shared/projects/seabioz/softwares/CLEAN/./scripts/train-supconH.py", line 118, in main train_loss = train(model, args, epoch, train_loader, File "/shared/projects/seabioz/softwares/CLEAN/./scripts/train-supconH.py", line 50, in train for batch, data in enumerate(train_loader): File "/usr/local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 634, in __next__ data = self._next_data() File "/usr/local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 678, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.10/site-packages/CLEAN-0.1-py3.10.egg/CLEAN/dataloader.py", line 105, in __getitem__ File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 791, in load with _open_file_like(f, 'rb') as opened_file: File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 271, in _open_file_like return _open_file(name_or_buffer, mode) File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 252, in __init__ super().__init__(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: './data/esm_data/P32143_6.pt'

It's like a pt file was missing but I downloaded them with your scripts so I don't get what could be missing?

bad gateway in webserver

Hi, I was trying to use the webserver but it give the bad gateway error.
Do you know if it is down for good or if this problem will be fixed meanwhile? Thank you very much

ValueError: The number of weights does not match the population

Hi! Thanks for the great work. When I train the CLEAN model using the codes provided, I met this error:
"
File "/Users/Zachary/opt/anaconda3/envs/clean/lib/python3.10/site-packages/CLEAN-0.1-py3.10.egg/CLEAN/dataloader.py", line 39, in mine_negative
File "/Users/Zachary/opt/anaconda3/envs/clean/lib/python3.10/random.py", line 537, in choices
raise ValueError('Total of weights must be finite')
"
So I added the following snippet to the mine_hard_negative function in the dataloader.py:
"
valid_freq = []
for value in freq:
if math.isfinite(value) and not math.isnan(value):
valid_freq.append(value)
else:
# Replace invalid value with a very small value to minimaize its effect
valid_freq.append(1e-8)

    normalized_freq = [i/sum(valid_freq) for i in valid_freq]

" to replace the original code
"normalized_freq = [i/sum(freq) for i in freq]"
I just wanna ask is this choice reasonable? many thanks

data clustering and split

I clustered 'split100.csv' with MMseqs using a 0.5 identity condition, and there were 32,283 clusters,
whereas 'split50.csv' resulted in 29,942, showing a difference.
(mmseqs cluster --min-seq-id 0.5)
Was there any additional data removal after the clustering?
If the clustering conditions were different, please let me know how it was done.
Can you provide specific details on how you split the data for cross-validation?
If you did a random split, could you inform me about the random state used?
What is the reason for not selecting models using a separate validation set or
utilizing models from cross-validation for benchmark testing?

on single sequence query: ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

Hello,

We are attempting to determine the lowest P value range of a single protein sequence using the conda CLEAN install. I.e., the input CSV is one sequence with an EC number and identifier. When we run this using infer_pvalue with default parameters, it calculates results but gives the following error/warning and does not print the model fit statistics (recall, precision etc):

ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

It seems that the AUC cannot be calculated on only one input sequence. Does this affect the P value cutoff or model interpretation? Does the infer_pvalue function depend on multiple queries in an input file? If multiple sequences are required, how should we interpret the predictions for the one query sequence of interest?

Thanks so much for your help.

training & validation details

Hello! How did you train the model with SupCon-Hard loss? The number of epoch numbers was suggested in the readme. How did you get these numbers? I didn't seem to find your setup for validation. Thanks!

Inquiry about the inplementation of SupCon-Hard loss

In your paper, you mention a logarithmic term in the SupCon-Hard loss

, which seems to be missing in the code:

CLEAN/app/src/CLEAN/losses.py

Line 56 in a268777

log_prob = (pos_logits_sum - exp_logits_sum)/n_pos

. I would greatly appreciate it if you could provide clarification on this matter.

Annotating functions of proteins having more than 1022 amino acid residues

Hi,
I tried using the web version (https://clean.platform.moleculemaker.org/configuration) for predicting ec numbers from protein sequences, but it is throwing an error whenever the protein sequence length is greater than 1022. I am planning to install the git hub package and am wondering if git hub package can predict function of those proteins that have sequence length greater than 1022.

Please let me know,

Thanks,
Sourav Dutta

Attempting to deserialize object on a CUDA

Hello I have recently come across the following message when using the infer_pvalue option:

Traceback (most recent call last):
File "", line 1, in
File "/home/guillermo/miniconda3/envs/clean/lib/python3.10/site-packages/CLEAN-0.1-py3.10.egg/CLEAN/infer.py", line 28, in infer_pvalue
File "/home/guillermo/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 712, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/home/guillermo/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 1046, in _load
result = unpickler.load()
File "/home/guillermo/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 1016, in persistent_load
load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
File "/home/guillermo/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 1001, in load_tensor
wrap_storage=restore_location(storage, location),
File "/home/guillermo/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 176, in default_restore_location
result = fn(storage, location)
File "/home/guillermo/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 152, in _cuda_deserialize
device = validate_cuda_device(location)
File "/home/guillermo/miniconda3/envs/clean/lib/python3.10/site-packages/torch/serialization.py", line 136, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

I have been trying to fix this but so far I have not been able to. Could you please help with this.
Cheers

a bug on `random_positive` in `dataloader.py`

There is a bug in the random_positive function in dataloader. When there is only one protein ID in EC,
like EC:3.4.22.54, and it only have ['Q9TTH8'] in it

Note, i use split10 dataset, so maybe lack of some sequence in this EC, but i think it's doesnt matter about this bug
it will return Q9TTH8_x ( where x is a randint num among 1-9,) as the positive sequence. However, since this sequence does not exist in the data, it will prompt that the sequence cannot be read when the dataloader is run subsequently.

So i think the question is here

def random_positive(id, id_ec, ec_id): 
    pos_ec = random.choice(id_ec[id])
    pos = id
    if len(ec_id[pos_ec]) == 1:  # this is the error comes from, and i think if only 1 in EC then use this is a better idea
        return pos + '_' + str(random.randint(0, 9))
    while pos == id:  # also this could change into a new list which don't contain the anchor(in this function is id variable), and then use random.choice will be efficient
        pos = random.choice(ec_id[pos_ec])
    return pos

all in the comments is my opinion, and i will raise a PR soon later

How to get embedding distance between the query protein and the CLEAN-predicted EC number

Hello, I am reaching out to follow up on issue #32, which was marked as closed. I was wondering if it might be possible to revisit the follow up question in that thread or get further clarification. I greatly appreciate your time and assistance in this matter.

Thank you!

FileNotFoundError: [Errno 2] No such file or directory: './data/esm_data/D4AXL1.pt'

Thank you for your job! It really helped me a lot. I tried to train the model under the guidance of the manual, but I encountered an error, as shown below.

When I execute "compute_esm_distance(train_file)" Times error, display "FileNotFoundError: [Errno 2] No such file or directory: './data/esm_data/D4AXL1.pt' ", presumably because only the mutant *pt file of the orphan sequence is generated in esm_data, and the *pt file of the non-orphan sequence is missing. How can I solve this problem? Thank you very much.

Some confusion about the creation of positive and negative sample pairs

Hi! As you described in the article, the anchor sequence and the positive sequence are derived from the same EC number. However, when using the torch.utils.data.Dataset class in your code to create the dataset, the positive sequence is randomly selected from some EC numbers that belong to the anchor sequence, which may lead to inconsistent EC numbers between the anchor sequence and the positive sequence (due to the function random_positive()). I wonder if there is a problem with my understanding?

class Triplet_dataset_with_mine_EC(torch.utils.data.Dataset):

    def __init__(self, id_ec, ec_id, mine_neg):
        self.id_ec = id_ec
        self.ec_id = ec_id
        self.full_list = []
        self.mine_neg = mine_neg
        for ec in ec_id.keys():
            if '-' not in ec:
                self.full_list.append(ec)

    def __len__(self):
        return len(self.full_list)

    def __getitem__(self, index):
        anchor_ec = self.full_list[index]
        anchor = random.choice(self.ec_id[anchor_ec])
        pos = random_positive(anchor, self.id_ec, self.ec_id)
        neg = mine_negative(anchor, self.id_ec, self.ec_id, self.mine_neg)
        a = torch.load('./data/esm_data/' + anchor + '.pt')
        p = torch.load('./data/esm_data/' + pos + '.pt')
        n = torch.load('./data/esm_data/' + neg + '.pt')
        return format_esm(a), format_esm(p), format_esm(n)

def random_positive(id, id_ec, ec_id):
    pos_ec = random.choice(id_ec[id])
    pos = id
    if len(ec_id[pos_ec]) == 1:
        return pos + '_' + str(random.randint(0, 9))
    while pos == id:
        pos = random.choice(ec_id[pos_ec])
    return pos

No such file or directory: 'results/inputs/init_maxsep.csv'

Hi,
Interesting work!
I am trying to install employ this software. I have created all the embeddings. I am testing it with some sample data that came with this software (init)
However I do get this error. (please refer to the screenshot)
Could please help me with this?

Dataset contain duplicate structures

The split100.csv file contains duplicate structures with different uniprot_id but same ec_number. Is that expected?

Cannot reproduce results in README

Hi,

Thanks so much for this work, and for making the repo super nice and straightforward!

Before evaluating the model on a separate use case I have, I wanted to make sure I didn't do anything wrong when setting up the project, so I've been trying to evaluate the model per the README to ensure I get consistent results. However, I obtain very poor performance when calling inference.py on the test sets provided (price, new), so it would be great to understand what I've done wrong.

#### results on new

The embedding sizes for train and test: torch.Size([241025, 128]) torch.Size([392, 128])
100%|███████████████████████████████████████████████| 5242/5242 [00:00<00:00, 7713.09it/s]
Calculating eval distance map, between 392 test ids and 5242 train EC cluster centers
392it [00:04, 93.33it/s] 
############ EC calling results using maximum separation ############
---------------------------------------------------------------------------
>>> total samples: 392 | total ec: 177 
>>> precision: 0.0135 | recall: 0.0139| F1: 0.0136 | AUC: 0.507 
---------------------------------------------------------------------------



#### results on price

The embedding sizes for train and test: torch.Size([241025, 128]) torch.Size([149, 128])
100%|███████████████████████████████████████████████| 5242/5242 [00:00<00:00, 8758.51it/s]
Calculating eval distance map, between 149 test ids and 5242 train EC cluster centers
149it [00:00, 155.21it/s]
############ EC calling results using maximum separation ############
---------------------------------------------------------------------------
>>> total samples: 149 | total ec: 56 
>>> precision: 0.0 | recall: 0.0| F1: 0.0 | AUC: 0.5 
---------------------------------------------------------------------------

From my attempts to debug, this doesn't seem to be an issue of models/data processing. For instance, I compared the embedding of the first cluster (EC 2.7.10.2) from data/pretrained/100.pt with ones I manually recomputed, and obtained the same values (up to some numerical error). Specifically, I made a FASTA file using the sequences that are in EC 2.7.10.2, extracted their embeddings, then passed them through the pretrained model (data/pretrained/split100.pth). I compared these with what we get from calling get_cluster_center on the precomputed tensor. These appeared to be consistent. So, if the embeddings are calculated in a consistent manner, I'm not sure why the predictions are turning out to be wrong.

Python 3.10.4 
#### recalculate the first EC cluster embeddings

>>> import os, torch
>>> from CLEAN.utils import * 
>>> from CLEAN.distance_map import *
>>> from CLEAN.model import LayerNormNet 

>>> train_data = "split100"
>>> train_csv = pandas.read_csv('data/split100.csv', delimiter='\t')
>>> id_ec_train, ec_id_dict_train = get_ec_id_dict('data/split100.csv')
>>> list(ec_id_dict_train.keys())[0]
2.7.10.2

#### make fasta of sequences in 2.7.10.2
>>> with open("data/ec_2.7.10.2.fasta", "w") as f:           
>>>    for u in ec_id_dict_train['2.7.10.2']:
>>>        sequence = train_csv[train_csv['Entry'] == u].iloc[0].Sequence                                                                                                                                                            
>>>        f.write(f">{u}\n")                                                                                                                                                                                        
>>>        f.write(f"{sequence}\n")

#### calculate ESM embeddings
>>> retrive_esm1b_embedding('ec_2.7.10.2')                 

#### load the split100 model weights
>>> device = torch.device("cpu")
>>> dtype = torch.float32 
>>> model = LayerNormNet(512, 128, device, dtype)
>>> checkpoint = torch.load('./data/pretrained/'+ train_data +'.pth', map_location="cpu")                                                                                                                                
>>> model.load_state_dict(checkpoint) 
<All keys matched successfully>
>>> model.eval()   

#### calculate model embeddings
>>> esm_to_cat = [load_esm(id) for id in ec_id_dict_train['2.7.10.2']]
>>> esm_emb = torch.cat(esm_to_cat)
>>> model_emb = model(esm_emb)
>>> model_emb.mean(0)
tensor([ 0.5685, -0.2730,  1.3413, -0.0456,  0.5519, -0.5602,  0.4451,  0.3555,
        -0.3991,  0.8149, -0.7487,  0.8769, -0.0774, -1.2195, -0.3510,  0.3407,
         0.6934, -0.4897, -0.6785,  0.4822, -0.4403,  0.1503,  0.6215, -0.2650,
         1.0949,  0.4402,  0.4229,  1.4833,  0.2911, -2.0526, -1.0108,  0.8270,
         0.0103, -0.4964,  0.4265,  0.6308, -0.5499, -1.2762, -0.9738,  0.3144,
        -0.9146,  0.4415,  0.2395,  0.2096,  0.0948, -0.6719,  0.1269, -0.6432,
         1.3322,  0.8958,  0.2907,  1.5833,  1.6047,  0.0428, -0.1019, -0.1428,
         0.6814, -0.9868,  0.4500,  0.1788, -0.3415,  1.0227,  0.2723,  0.2320,
         0.5672, -0.8140, -0.4842,  0.3829, -1.4036, -0.3750, -2.0640, -0.9057,
        -1.1886,  0.3434, -1.0756, -1.4245,  1.1374, -0.1440, -0.1107, -2.4469,
         0.1129, -0.2940,  0.3541,  0.9514, -0.1509, -1.1097, -0.3776,  0.0645,
         0.1615, -0.3648,  0.8489, -0.1049,  0.1044, -0.9301,  0.1868,  0.8924,
         0.1700, -1.5468,  0.9586, -1.1084,  1.4576,  1.4288,  0.3229,  0.3504,
        -0.1556, -0.0749,  0.1157,  0.2287, -0.2752,  1.2659,  0.7747,  0.2845,
         0.5852, -0.9135,  1.0046,  1.1457, -0.8711,  0.5439,  0.4540,  0.0190,
        -0.2778,  1.8937, -1.7569, -1.3366, -0.5689, -1.9689,  0.2271, -0.3354],
       device='cuda:0', grad_fn=<MeanBackward1>)

#### compare with precomputed embeddings
>>> emb_train = torch.load('./data/pretrained/100.pt', map_location=device)
>>> cluster_center_model = get_cluster_center(emb_train, ec_id_dict_train)
>>> cluster_center_model['2.7.10.2']
tensor([ 0.5684, -0.2730,  1.3413, -0.0455,  0.5519, -0.5602,  0.4452,  0.3555,
        -0.3990,  0.8149, -0.7487,  0.8769, -0.0774, -1.2195, -0.3510,  0.3407,
         0.6935, -0.4897, -0.6786,  0.4822, -0.4402,  0.1503,  0.6215, -0.2650,
         1.0949,  0.4402,  0.4229,  1.4833,  0.2911, -2.0526, -1.0108,  0.8270,
         0.0103, -0.4963,  0.4265,  0.6308, -0.5499, -1.2763, -0.9737,  0.3145,
        -0.9146,  0.4415,  0.2394,  0.2096,  0.0948, -0.6719,  0.1269, -0.6432,
         1.3322,  0.8958,  0.2907,  1.5833,  1.6047,  0.0427, -0.1019, -0.1428,
         0.6814, -0.9868,  0.4500,  0.1787, -0.3415,  1.0227,  0.2722,  0.2320,
         0.5672, -0.8140, -0.4843,  0.3830, -1.4036, -0.3750, -2.0639, -0.9057,
        -1.1886,  0.3434, -1.0756, -1.4245,  1.1373, -0.1440, -0.1108, -2.4469,
         0.1129, -0.2940,  0.3541,  0.9514, -0.1508, -1.1097, -0.3776,  0.0645,
         0.1616, -0.3648,  0.8489, -0.1049,  0.1043, -0.9301,  0.1868,  0.8923,
         0.1699, -1.5468,  0.9585, -1.1083,  1.4576,  1.4288,  0.3229,  0.3504,
        -0.1556, -0.0749,  0.1157,  0.2288, -0.2752,  1.2659,  0.7747,  0.2845,
         0.5853, -0.9134,  1.0046,  1.1458, -0.8711,  0.5439,  0.4540,  0.0190,
        -0.2778,  1.8937, -1.7569, -1.3366, -0.5689, -1.9689,  0.2270, -0.3354])

Would greatly appreciate any help wherever you think I made a mistake.

Thank you!!

Just one EC

Thank you for providing this excellent tool. Could you guide us on how to specifically obtain annotations for halogenases from metagenomes using it?

Some confusion about Fig 2.F in the paper

Hello,

I have been studying the original paper and came across Fig 2. F, which presents the comparison of CLEAN results with other tools, including BLASTp and ProtInfer. I found it intriguing that the prediction accuracy of BLASTp for EC X.X.X.X is higher than that of EC X.X.X.-, and a similar pattern is observed for ProtInfer. Intuitively, one might expect that if the tool can predict EC X.X.X.X correctly, it should also be able to predict EC X.X.X.- accurately.

Could you please provide some insights or clarification on this observation?

Request to clarify license

Hi,
Could you please clarify the license for this project eg if it’s available for commercial use? Thank you for sharing this fascinating research

Set Up Questions

Reading through this protocol I am extremely excited about this tool and working on getting it set up. First off, I am looking to get some clarification on the required steps in Part 1 of the README. If we are doing the quickstart 1.2, is section 1.3 also required? Some of the code overlaps, so I assume these are independent of each other and it is an either or.

Additionally, I am needing for this to be fully set up on a Linux system in conda with cpuonly pytorch, but can't determine from the protocol if local installation is required for some steps or if the functionality can exist entirely in a conda environment.

Finally, Part 2 is inference and Part 3 is training. If we all we are looking to do is inference, similar to the what is available on the web-server version of CLEAN, is Part 3 required to continue using the inference tools? Since Part 3 is after Inference and I only need the functionality of what is available on the web-server, just needing to scale up for a large dataset.

Thank you very much in advance. very excited to get this set up. Appreciate any help you can offer.

There have a problem when we run the command "python CLEAN_infer_fasta.py --fasta_data price"

Traceback (most recent call last):
File "/share/database/CLEAN/CLEAN_infer_fasta.py", line 30, in
main()
File "/share/database/CLEAN/CLEAN_infer_fasta.py", line 24, in main
infer_maxsep(train_data, test_data, report_metrics=False, pretrained=True)
File "/share/soft/miniconda3/envs/clean/lib/python3.10/site-packages/CLEAN-0.1-py3.10.egg/CLEAN/infer.py", line 92, in infer_maxsep
Exception: No pretrained weights for this training data

We have found pretrained dataset, can you provided a website to download it?

pretrained model

Will model checkpoints be provided for download?

Changes in dataset

Hi, thanks to your work, we are also conducting researches on EC-number prediction using dataset you provided in data folder.

However, we've recognized the changes in commit done on Jan, where datasets are changed to older version.

Can you explain about what and why changes are given in dataset from that commit?

Cheers,
Doyeong Hwang

Duplicated amino acid sequece in datasets

There exist duplicated amino acid sequences in split100.csv and new.csv.

The feature extraction script of ESM provided by Facebook does not allow duplicated sequences. However, the authors of this repository utilize the feature extraction script of ESM which can not work properly when there exist duplicates. Please provide a detailed procedure including how the features of sequences in the NEW-392 dataset and Split-100 dataset are obtained.

Almost 30k duplicates in split100.csv. But there are no duplicates in split70.csv. This is weird too. If the sequences are duplicated intentionally for contrastive learning, then why there are no duplicates in split70.csv?

Is there a threshold for relatively reliable prediction of pairwise distance

Thanks for your useful code！
I'm a bit confused whether there is a recommend threshold of pairwise distance in the result file under the condition that the prediction is relatively reliable？

Details about data split

Hi,

Thanks for your great work and nice code.

I'm interested in your data split, e.g. 'split10.csv' and 'split100.csv'. There are few details about how to get the splited data in both your paper and code. I guess you preprocess it through comparing data from SwissProt with data from your two test set.

I'd appreciate it if you could give more details about data split either in description or code.

The cross-validation process details.

Hi,

Thanks for contributing the code! I was wondering the details about your cross-validation process.

I conducted a standard K-fold split on the split10.csv for cross validation. Then I replicated the training process and train a CLEAN model with triplet loss on the training set(in CV). The final F1 score on the validation set(in CV) is around 0.5.

Therefore, I am curious about how to obtain the results shown in Figure S1. Did you just use a standard random cross validation split, or use some tricks(like let the EC numbers of the validation set in CV has at least one sample in the training set in CV)?

Meanwhile, I am also curious about how did you spliet the "understudied validation dataset" shown in Figure S2. How did you maintain the "no more than 5 times"? Are there some samples not used neither in training nor inferencing(to get the result in Figure S2)?

Thanks for your answers in advance and look forward to your response!

Testing 1.2 Quickstart

There is not data/pretained directory. I recommend the pretrained folder and just dropping a .gitkeep in it.
- Maybe this is unnecessary if the point 2. is clarified.
When you download the data you get a folder CLEAN_pretrained which then needs to be renamed to pretrained. I think changing CLEAN_pretrained to pretrained will be the easiest since this is what the src is referencing.

How to set parameters？

Hello, I have a few questions about parameter settings. I would like to know how your parameters are set, such as n_ Pos, n_ Neg, batch_ Size these parameters. Do I need to consider the sample size of categories in the dataset when setting parameters? If my dataset has 60 categories, with the highest sample size category containing 50000 samples and the lowest sample size category containing 20 samples, facing such a dataset, n_ Pos, n_ Neg, batch_ How should these parameters of size be set?

Error when inference with p-value in CPU only mode

Prompts

"RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU."

when I tried to run infer_pvalue as suggested in the 2.2.1 section of README with a CPU only installation.

Fixed by change

CLEAN/app/src/CLEAN/infer.py

Line 28 in 0e5f244

checkpoint = torch.load('./data/pretrained/'+ train_data +'.pth')

to
checkpoint = torch.load('./data/pretrained/'+ train_data +'.pth', map_location=device)

The same changes may also be applied to codes listed below:

CLEAN/app/src/CLEAN/infer.py

Line 33 in 0e5f244

checkpoint = torch.load('./data/model/'+ model_name +'.pth')

CLEAN/app/src/CLEAN/infer.py

Line 41 in 0e5f244

emb_train = torch.load('./data/pretrained/70.pt')

CLEAN/app/src/CLEAN/infer.py

Line 43 in 0e5f244

emb_train = torch.load('./data/pretrained/100.pt')

CLEAN/app/src/CLEAN/infer.py

Line 95 in 0e5f244

checkpoint = torch.load('./data/model/'+ model_name +'.pth')

FileNotFoundError: [Errno 2] No such file or directory: './data/esm_data/P24665_0.pt'

When running train-triplet.py or train-supconH.py I get the following error:

FileNotFoundError: [Errno 2] No such file or directory: './data/esm_data/P24665_0.pt'

I've already retrieved all the esms and './data/esm_data/P24665.pt' exist but './data/esm_data/P24665_0.pt' doesn't.

I checked the code and I found this

def random_positive(id, id_ec, ec_id):
    pos_ec = random.choice(id_ec[id])
    pos = id
    if len(ec_id[pos_ec]) == 1:
        return pos + '_' + str(random.randint(0, 9))
    while pos == id:
        pos = random.choice(ec_id[pos_ec])
    return pos

If I understand correctly the bug comes from return pos + '' + str(random.randint(0, 9)), I do not understand why the + '' + str(random.randint(0, 9)) is needed. It seems to just create a filename that doesn't exist. Could you explain?

problem with job at web server

Thanks for this excellent tool for structure-based function prediction. I submitted a job at web server and received an email for the result, However, the web page cannot be refreshed. Can you help solve this problem？
Thanks a lot.

How to filter the prediction results based on the pairwise distance between cluster center of EC number?

How should I filter the prediction results based on the pairwise distance between the cluster center of a EC number?
what is the cutoff value?

No such file or directory: './data/esm_data/WP_063460136.pt'

I've followed the installation steps but when running the python CLEAN_infer_fasta.py --fasta_data price command I get the following error:
FileNotFoundError: [Errno 2] No such file or directory: './data/esm_data/WP_063460136.pt'

From what I checked WP_063460136 is the name of the first sequence in the fasta file.

What should I do?

Using local clean for large-scale data prediction

Hello, thank you for your excellent work. I now want to predict EC numbers for large-scale data locally, with approximately 100 million sequences. How can I improve my speed? I used it in the cluster and tested it using 10000 sequences, with a single CPU time of one hour. I tried to split the file into 200000 lines per file, but except for the first file, the prediction speed of the remaining files significantly slowed down.

Best,
SJY

non-deterministic behavior in inference

https://github.com/tttianhao/CLEAN/blob/main/app/src/CLEAN/utils.py#L88 creates inputs for model and gets embeddings. However, list(dict.keys()) can be non-deterministic (depending on python version) and therefore the output embeddings can mismatch the actual entries. Please fix them and ensure that the paper was using the correct embeddings.

how can I get Confidencel Level ?

I see a confidencel level column from https://clean.frontend.mmli1.ncsa.illinois.edu/results/554b268ec1074df283bc0081cea971a9/1 result，

what men confidencel level? and how get this column ?

tttianhao / clean Goto Github PK

clean's People

Contributors

Stargazers

Watchers

Forkers

clean's Issues

Recommend Projects

Recommend Topics

Recommend Org