changwenxu98 / transpolymer Goto Github PK

View Code? Open in Web Editor NEW

46.0 46.0 16.0 1.69 MB

Implementation of "TransPolymer: a Transformer-based language model for polymer property predictions" in PyTorch

License: MIT License

Python 100.00%

deep-learning polymer pretrained-language-model pytorch self-supervised-learning transformer

transpolymer's People

Contributors

Stargazers

Watchers

Forkers

sangyongjeong zhangyning liberty-1776 himanshisyadav milaniiti shubhampachori12110095 algoskynet lopez-hector luispintoc yasharthy gpilania dionxia pankajshroff yinqiaozhang wildoncao chubbypear

transpolymer's Issues

OSError: Can't load tokenizer for 'roberta-base'.

I always get the following error when I run the .py file. Can you please help me to solve this problem?

Some weights of RobertaModel were not initialized from the model checkpoint at ckpt/pretrain.pt and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
File "/home/liwei/下载/TransPolymer-master/Attention_vis.py", line 137, in <module>
tokenizer = PolymerSmilesTokenizer.from_pretrained("roberta-base", max_len=attention_config['blocksize'])
File "/home/liwei/anaconda3/envs/TransPolymer/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2013, in from_pretrained
raise EnvironmentError(
OSError: Can't load tokenizer for 'roberta-base'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'roberta-base' is the correct path to a directory containing all relevant files for a PolymerSmilesTokenizer tokenizer.

Question about Tokenizer

Why does the tokenizer tokenize Cl as just C?

sequence = "[O-][Cl+3]([O-])([O-])[O-]"
print(tokenizer.tokenize(sequence))
['[', 'O', '-', ']', '[', 'C', '+', '3', ']', '(', '[', 'O', '-', ']', ')', '(', '[', 'O', '-', ']', ')', '[', 'O', '-', ']']

However, the tokenizer performs differently for Li and Br.

sequence = "[Li+].[Br-]"
print(tokenizer.tokenize(sequence))
['[', 'Li', '+', ']', '.', '[', 'Br', '-', ']']

Thank you for your help!

Issues with the running of Downsteam.py

Hello, I successfully prepared the environment, and the pre-training goes pretty well, however, when I begin to run the Downstream file, I encountered this issue (attched in the output.txt).

output.txt

There could be some issues with line 364 of the code:
train_data.iloc[:, 1] = scaler.fit_transform(train_data.iloc[:, 1].values.reshape(-1, 1))

I output the shape of the train_data and test_data, it is (2,1)

TypeError when using Downstream.py

Traceback (most recent call last):
  File "D:\project\file\TransPolymer\test.py", line 4, in <module>
    tokenizer = PolymerSmilesTokenizer.from_pretrained("roberta-base", max_len=411)
  File "D:\anaconda3\envs\TransPolymer\lib\site-packages\transformers\tokenization_utils_base.py", line 1783, in from_pretrained
    return cls._from_pretrained(
  File "D:\anaconda3\envs\TransPolymer\lib\site-packages\transformers\tokenization_utils_base.py", line 1928, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "D:\project\file\TransPolymer\PolymerSmilesTokenization.py", line 216, in __init__
    with open(merges_file, encoding="utf-8") as merges_handle:
TypeError: expected str, bytes or os.PathLike object, not NoneType

I found the BUG on both windows and Ubuntu22.04.

Finetuned model

Hello, Thank you for the great work.
I am wondering if you have trained model to run inference on. I want to test model for property prediction on test dataset.
I tried training but its super slow to train on my local machine!

Different block_size for pretrain and finetune

Dear @ChangwenXu98 ,

I found the block_size of pretraining and finetuning different in the config files. The block_size for pretraining is 175, but that for fine-tuning is 411.

As block_size would influence the size of the pretrained model, I'm wondering should this parameter be the same for the two tasks in order to load the pretrained model for finetuning?

Finetuning attention maps

I was wondering how to determine which tokens have a higher attention score than the others. In short, how do you arrive at the red highlights in Figure 6 of the paper? How do you aggregate the attention scores from all the 12 attention heads in order to come to this conclusion?

Cloning Issue

Downloading ckpt/pretrain.pt/pytorch_model.bin (329 MB)
Error downloading object: ckpt/pretrain.pt/pytorch_model.bin (5e93519): Smudge error: Error downloading ckpt/pretrain.pt/pytorch_model.bin (5e93519a38a9f3ab51477591daf9ab443f5bf6a0c1bbfe6b5f5a5609fc95e767): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Errors logged to /home/trios/TransPolymer/.git/lfs/logs/20230717T142851.171781074.log
Use git lfs logs last to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: ckpt/pretrain.pt/pytorch_model.bin: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

Smiles Tokenization for Pretraining

class LoadPretrainData(Dataset):

    def __init__(self, tokenizer, dataset, blocksize):
        self.tokenizer = tokenizer
        self.blocksize = blocksize
        self.dataset = dataset

    def __len__(self):
        self.len = len(self.dataset)
        return self.len

    def __getitem__(self, i):
        smiles = self.dataset[i] #original
        **# smiles = self.dataset[i][0] updated version**

        encoding = self.tokenizer(
            str(smiles),
            add_special_tokens=True,
            max_length=self.blocksize,
            return_token_type_ids=False,
            padding="max_length",
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )

        return dict(
            input_ids=encoding["input_ids"].flatten(),
            attention_mask=encoding["attention_mask"].flatten(),
        )

This is a code snippet from Dataset.py module.
I believe you have made a small mistake, when you try to access smiles from the dataset using self.dataset[i], you not only get the smile but '[' and ']' these symbols as well because it return the whole numpy array not the string.

For e.g SMILE = NC(=O)c1ccc(C(=O)OCCCCCCCCCCCc2ccc()cc2)cc1
dataset[i] for some this smile will return something like array(['NC(=O)c1ccc(C(=O)OCCCCCCCCCCCc2ccc()cc2)cc1'], dtype=object)
and when you will convert it into string inside encoding it will look like something ['[', '', 'N', 'C', '(', '=', 'O', ')', 'c', '1', 'c', 'c', 'c', '(', 'C', '(', '=', 'O', ')', 'O', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'c', '2', 'c', 'c', 'c', '(', '', ')', 'c', 'c', '2', ')', 'c', 'c', '1', ']']. So basically you are feeding this to encoder instead of ['', 'N', 'C', '(', '=', 'O', ')', 'c', '1', 'c', 'c', 'c', '(', 'C', '(', '=', 'O', ')', 'O', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'c', '2', 'c', 'c', 'c', '(', '', ')', 'c', 'c', '2', ')', 'c', 'c', '1'], which might improve the performance of the model.
To resolve that issue, only thing you need to do is just access the 1st index of the numpy array which I have added as a comment in the above code snippet.

Supplementary Vocab File

I may be missing something and am not quite sure what the supplementary vocab file is for. Can someone give me a rough idea or resources to look into?

Thanks!

Regarding Egc Dataset

Hi there!
The Egc dataset that you have provided seems to be different from other related datasets(For e.g. Egb,Ei,Xc etc.). The smiles in Egc dataset have these '[]'(square brackets) around * symbol but in other datasets the * symbol is not followed or preceded by these brackets. And one more thing the result you reported in the paper for Egc, is that using this same Egc dataset or one without square bracket around it? And if this is the case I was wondering why you guys have used square brackets particularly only for Egc datasets since the RoBERTa model is trained on PI1M dataset which does not contain square brackets around * symbol in smiles.

RuntimeError: Error(s) in loading state_dict for DownstreamRegression:

Hello，I have successfully fine-tuned the pre-trained model and also output the visual attention graph of the pre-trained model, but when I run Attention_vis.py to generate the attention graph of the fine-tuned model, the following error occurs, can you please help me?

Traceback (most recent call last):
File "/home/liwei/桌面/TransPolymer-master/Attention_vis.py", line 141, in <module>
main(attention_config)
File "/home/liwei/桌面/TransPolymer-master/Attention_vis.py", line 89, in main
model.load_state_dict(checkpoint['model'])
File "/home/liwei/anaconda3/envs/TransPolymer/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DownstreamRegression:
size mismatch for PretrainedModel.embeddings.word_embeddings.weight: copying a param with shape torch.Size([50908, 768]) from checkpoint, the shape in current model is torch.Size([50265, 768]).

This is my modified config_attention.yaml file in order to generate the fine-tuned model attention map.
task: 'finetune' # the task to visualize the attention scores
smiles: 'CCO' # the SMILES used for visualization when task=='pretrain'
layer: 0 # the hidden layer for visualization when task=='pretrain'
index: 8 # the index of the sequence used for visualization when task=='finetune'
add_vocab_flag: False # whether to add supplementary vocab

file_path: 'data/PE_I.csv' # train file path
vocab_sup_file: 'data/vocab_sup_PE_I.csv' # supplementary vocab file path
model_path: 'ckpt/PE_I_best_model.pt' # finetuned model path
pretrain_path: 'ckpt/pretrain.pt' # pretrained model path
save_path: 'figs/attention_vis.png' # figure save path
blocksize: 7 # max length of sequences after tokenization

figsize_x: 30 # the size of figure in x
figsize_y: 18 # the size of figure in y
fontsize: 20 # fontsize
labelsize: 15 # label size
rotation: 45 # rotation of figure

Regarding Validation Error and Testing Error

Hi there!
In your code(Downstream.py file), you have reported the validation error(on K-Fold Cross Validation). I am curious whether the reported value in the paper is validation error or test error specifically for Egc, Egb, Eea, Ei, Xc, EPS, and Nc these datasets. To be specific, in your data directory, you have provided with the whole dataset for the above mentioned properties, and during Fine-Tuning for Cross Validation you are taking the whole dataset and dividing it into the fold, but there is no creation of test dataset. So, whether in the paper you have reported the validation error or you have created a test(unseen) data and reported the error on that.