seyonechithrananda / bert-loves-chemistry Goto Github PK

bert-loves-chemistry: a repository of HuggingFace models applied on chemical SMILES data for drug design, chemical modelling, etc.

License: MIT License

Jupyter Notebook 99.21% Python 0.73% JavaScript 0.06% Shell 0.01%

huggingface bert chemical-modelling chemical-smiles-data

bert-loves-chemistry's Introduction

ChemBERTa

ChemBERTa: A collection of BERT-like models applied to chemical SMILES data for drug design, chemical modelling, and property prediction. To be presented at Baylearn and the Royal Society of Chemistry's Chemical Science Symposium.

Tutorial
ArXiv ChemBERTa-2 Paper
Arxiv ChemBERTa Paper
Poster
Abstract
BibTex

License: MIT License

Right now the notebooks are all for the RoBERTa model (a variant of BERT) trained on the task of masked-language modelling (MLM). Training was done over 10 epochs until loss converged to around 0.26 on the ZINC 250k dataset. The model weights for ChemBERTA pre-trained on various datasets (ZINC 100k, ZINC 250k, PubChem 100k, PubChem 250k, PubChem 1M, PubChem 10M) are available using HuggingFace. We expect to continue to release larger models pre-trained on even larger subsets of ZINC, CHEMBL, and PubChem in the near future.

This library is currently primarily a set of notebooks with our pre-training and fine-tuning setup, and will be updated soon with model implementation + attention visualization code, likely after the Arxiv publication. Stay tuned!

I hope this is of use to developers, students and researchers exploring the use of transformers and the attention mechanism for chemistry!

Citing Our Work

Please cite ChemBERTa-2's ArXiv paper if you have used these models, notebooks, or examples in any way. The link to the BibTex is available here.

Example

You can load the tokenizer + model for MLM prediction tasks using the following code:

from transformers import AutoModelWithLMHead, AutoTokenizer, pipeline

#any model weights from the link above will work here
model = AutoModelWithLMHead.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")
tokenizer = AutoTokenizer.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")

fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

Todo:

bert-loves-chemistry's People

Contributors

Stargazers

Watchers

bert-loves-chemistry's Issues

Description of Models on Huggingface

Thanks for everyone who is developing this repo. Great work!

Is there a 1-2 sentence description somewhere of what each of the pre-trained models on https://huggingface.co/seyonec are? I'm a bit confused on the differences between all the models on the hub.

Details about curating pubchem dataset

Thank you for publishing this great work!. I have a question about the pubchem dataset, using as a pretraining set.

In this arxiv paper, it is shortly mentioned that the 77M pubchem dataset is curated to the 10M pubchem data.

Could you explain a bit more about the details how to curate the 77M pubchem dataset?

ex) Smiles with nonbonding is removed

Accuracy metric in the pre-training stage

This is a question about smiles pretraining. Do you have any metric measurements to determine whether pretraining is going well without any problems? In the short paper in the README description, only the results after fine tuning with tox21 are displayed. Do you have an accuracy measure of how many masks are matched in the unsupervised training stage (pre-training) ? If so, how does the accuracy vary depending on the representation type (SMILES-BPE, SMILES, SELFIES-BPE etc)?

vocab special token - bos_token_id: 0 , overlapping padding index

Looking at the Chemberta vocab's special id, I found that bos_id=0. In pytorch, sentences of variable length are collated and short sentences are generally set to 0 when padding. In my opinion, it would be better to start with a special token such as bos_token from 1, but is there a reason for assigning it to 0?

config.json: seyonec/ChemBERTa-zinc-base-v1

{
"architectures": [
"RobertaForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 0.00001,
"max_position_embeddings": 514,
"model_type": "roberta",
"num_attention_heads": 12,
"num_hidden_layers": 6,
"output_past": true,
"pad_token_id": 1,
"type_vocab_size": 1,
"vocab_size": 767
}

how to use SELFIES with ChemBERTa

Hi, I am Keanu and very much interested in AI driven drug discovery. I found you in the huggingface model hub and very impressed of what you have done. I'd like to use your pretrained network leveraging SELFIES parser. I look around the configuration and vocab, but I am little bit confusing.

This is a question about how to use the SELFIES parser. In the case of the huggingface wordpiece tokenizer, it is split first by the basic tokenizer and then divided into wordpieces by the BPE. In the case of the SMILES tokenizer that you implemented in Deepchem, SMILES is first parsed in atom units by the basic tokenizer. , It seems that it does not split additionally into word pieces by BPE. On the other hand, as I roughly looked at seyonec/BPE_SELFIES_PubChem_shard00_120k, I guess, it seems that the SELFIES parser first parses and then creates a second sub token with BPE. Is that right? If yes, is there any reason for doing that?

When will we get an example of dealing with chemBERTa with SELFIES in tutorial format? I tried to use your SELFIES pretrained model mentioned above by referring to deepchem's SMILES tokenizer, but it keeps failing :-)

RuntimeError: Found dtype Double but expected Float

when i run the command :
python chemberta/finetune/finetune.py --datasets=delaney

but get this error:
RuntimeError: Found dtype Double but expected Float

Tokenize incorrect when getting pretrianed feature

Hi ChemBERTa team🤗, I got a problem when I tokenize smiles seq, And I found that in you example smiles has the Cl atom. So I want to konw if you meet the same question.

The Iuput seq is COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl
However the output of tokenizer incorrectly labeled the Cl as C, I don't konw how the fix it. And I also found that in the ChenBERTa token table, it indeed has the 'Cl' token.

The output of the tokenize() result as follow:
['C', 'O', 'C', '1', '=', 'C', '(', 'C', '=', 'C', '2', 'C', '(', '=', 'C', '1', ')', 'C', 'C', 'N', '=', 'C', '2', 'C', '3', '=', 'C', 'C', '(', '=', 'C', '(', 'C', '=', 'C', '3', ')', 'C', ')', 'C', ')', 'C']

Originally posted by @Chris-Tang6 in #58 (comment)

What is the major difference between ChemBERTa_zinc250k_v2_40k and v1 (or others)

Hi seyonec,

I am using several version of models published ChemBERTa, and I have tried several versions of ChemBERTa tokenizer/pretrained model, and found result using "ChemBERTa_zinc250k_v2_40k" quite better than "seyonec/ChemBERTa-zinc-base-v1" or "seyonec/ChemBERTa-zinc250k-v1". So I am curious what is the difference. Can I ask for an explanation about the upgraded part of "v2" different from "base v1" or other previous versions?

Using rawtext

Loving the work here!

I've been trying to use the classification roberta, using pubchem_1k_smiles.txt via

if __name__ == "__main__":
    smiles_data = "pubchem_1k_smiles.txt"
    smiles_token = prebuilt_smiles_tokenizer("vocab.txt")

    example_dataset = RawTextDataset(smiles_token, smiles_data, block_size=512)

and I get the following error

Traceback (most recent call last):
  File "/Users/rorygarland/Work/bert-loves-chemistry/chemberta/utils/raw_text_dataset.py", line 234, in <module>
    example_dataset = RawTextDataset(smiles_token, smiles_data, block_size=512)
  File "/Users/rorygarland/Work/bert-loves-chemistry/chemberta/utils/raw_text_dataset.py", line 37, in __init__
    self.dataset = load_dataset("text", data_files=data_files)["train"]
  File "/Users/rorygarland/Library/Caches/pypoetry/virtualenvs/chemberta-l_Axq8W4-py3.8/lib/python3.8/site-packages/nlp/load.py", line 548, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/rorygarland/Library/Caches/pypoetry/virtualenvs/chemberta-l_Axq8W4-py3.8/lib/python3.8/site-packages/nlp/builder.py", line 462, in download_and_prepare
    self._download_and_prepare(
  File "/Users/rorygarland/Library/Caches/pypoetry/virtualenvs/chemberta-l_Axq8W4-py3.8/lib/python3.8/site-packages/nlp/builder.py", line 537, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/Users/rorygarland/Library/Caches/pypoetry/virtualenvs/chemberta-l_Axq8W4-py3.8/lib/python3.8/site-packages/nlp/builder.py", line 865, in _prepare_split
    for key, table in utils.tqdm(generator, unit=" tables", leave=False):
  File "/Users/rorygarland/Library/Caches/pypoetry/virtualenvs/mmdeacon-l_Axq8W4-py3.8/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/Users/rorygarland/Library/Caches/pypoetry/virtualenvs/chemberta-l_Axq8W4-py3.8/lib/python3.8/site-packages/nlp/datasets/text/c3b177069f0fad4da737a020bb39bbdb7aa16992e1f401e4347568618c906e28/text.py", line 95, in _generate_tables
    pa_table = pac.read_csv(
  File "pyarrow/_csv.pyx", line 1217, in pyarrow._csv.read_csv
  File "pyarrow/_csv.pyx", line 1221, in pyarrow._csv.read_csv
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ParseOptions: delimiter cannot be \r or \n

My versions:

nlp: 0.4.0
pyarrow: 8.0.0

Is this a version issue or am I doing something very silly?

Question about chirality and model results from paper

Hi,

ChemBERTa was trained on achiral canonicalized molecules, as evidenced by the achiral canonicalized dataset.

The MoleculeNet fine-tuning datasets in the paper contain chiral molecules. How was this addressed?

Did you canonicalize the chiral molecules to make them achiral canonicalized? What about duplicates? (since there are stereoisomers)

Or, did you just inference on the chiral strings? This would certainly cause some problems.

I didn't see any mention of this in the paper.

Please let me know,
Clay

Getting SMILES Vector in Pretrained ChemBERTa Model

Firstly, thanks for open sourcing the model. I want to cluster SMILES compounds by using their pretrained model vectors. I have implemented following code for this but it doesn't work for more than 1000 row data. Idk if I am doing this correctly since I haven't seen any official documentation for this. The only relatable code part in the repo is viz_utils.gen_embeddings function. Unfortunately it also suffers from larger SMILES dataset size

I am open to suggestions. Thank in advance.

from transformers import RobertaTokenizerFast, RobertaModel
from sklearn.cluster import KMeans
import torch

model_name = "DeepChem/ChemBERTa-77M-MLM"
tokenizer = RobertaTokenizerFast.from_pretrained(model_name)
model = RobertaModel.from_pretrained(model_name, output_hidden_states = True)
model.eval()

smiles_compounds = [
  "O=C(Cc1cccc2ccccc12)Nc1n[nH]c2ccc(N3CCCS3(=O)=O)cc12",
  "COC(=O)NC[C@@H](NC(=O)c1ccc(-c2nc(C3CCOCC3)cnc2N)cc1F)c1cccc(Br)c1",
  "COc1ccccc1Nc1cc(Oc2cc(C)c(C)nc2-c2ccccn2)ccn1",
  "O=C(/C=C/CN1CCCC1)N1CCOc2cc3ncnc(Nc4ccc(F)c(Cl)c4)c3cc21",
]

inputs = tokenizer(smiles_compounds, return_tensors='pt', padding=True, truncation=True)

with torch.no_grad():
  out = model(**inputs)

# Shape is: [len(smiles_compounds), 65, 384]
states = out.hidden_states[-1].squeeze()

# Average the token vectors for each sample, which will give you a single 384-dimensional vector for each sample.
states_2d = states.mean(dim=1).numpy()
states_2d.shape

kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(states_2d)

clusters = kmeans.predict(states_2d)
clusters

Masking multiple tokens at a time

Hi,
is it possible to mask multiple tokens at a time?

E.g. fill_mask('CCCO<mask>C') works fine. But writing fill_mask('CC<mask>CO<mask>C') I obtain:

~/miniconda3/envs/paccmann/lib/python3.7/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
    553                 values, predictions = topk.values.numpy(), topk.indices.numpy()
    554             else:
--> 555                 masked_index = (input_ids == self.tokenizer.mask_token_id).nonzero().item()
    556                 logits = outputs[i, masked_index, :]
    557                 probs = logits.softmax(dim=0)

ValueError: only one element tensors can be converted to Python scalars

Am I doing sth wrong or is this feature not supported @seyonechithrananda?
Many thanks for a reply!