Giter VIP home page Giter VIP logo

bert-loves-chemistry's Introduction

ChemBERTa

ChemBERTa: A collection of BERT-like models applied to chemical SMILES data for drug design, chemical modelling, and property prediction. To be presented at Baylearn and the Royal Society of Chemistry's Chemical Science Symposium.

Tutorial
ArXiv ChemBERTa-2 Paper
Arxiv ChemBERTa Paper
Poster
Abstract
BibTex

License: MIT License

Right now the notebooks are all for the RoBERTa model (a variant of BERT) trained on the task of masked-language modelling (MLM). Training was done over 10 epochs until loss converged to around 0.26 on the ZINC 250k dataset. The model weights for ChemBERTA pre-trained on various datasets (ZINC 100k, ZINC 250k, PubChem 100k, PubChem 250k, PubChem 1M, PubChem 10M) are available using HuggingFace. We expect to continue to release larger models pre-trained on even larger subsets of ZINC, CHEMBL, and PubChem in the near future.

This library is currently primarily a set of notebooks with our pre-training and fine-tuning setup, and will be updated soon with model implementation + attention visualization code, likely after the Arxiv publication. Stay tuned!

I hope this is of use to developers, students and researchers exploring the use of transformers and the attention mechanism for chemistry!

Citing Our Work

Please cite ChemBERTa-2's ArXiv paper if you have used these models, notebooks, or examples in any way. The link to the BibTex is available here.

Example

You can load the tokenizer + model for MLM prediction tasks using the following code:

from transformers import AutoModelWithLMHead, AutoTokenizer, pipeline

#any model weights from the link above will work here
model = AutoModelWithLMHead.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")
tokenizer = AutoTokenizer.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")

fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

Todo:

  • Official DeepChem implementation of ChemBERTa using model API (In progress)
  • Open-source attention visualization suite used in paper (After formal publication - Beginning of September).
  • Release larger pre-trained models, and support for a wider array of property prediction tasks (BBBP, etc). - See HuggingFace
  • Finish writing notebook to train model
  • Finish notebook to preload and run predictions on a single molecule β€”> test if HuggingFace works
  • Train RoBERTa model until convergence
  • Upload weights onto HuggingFace
  • Create tutorial using evaluation + fine-tuning notebook.
  • Create documentation + writing, visualizations for notebook.
  • Setup PR into DeepChem

bert-loves-chemistry's People

Contributors

elanapearl avatar gabegrand avatar rbharath avatar seyonechithrananda avatar walid0925 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bert-loves-chemistry's Issues

Details about curating pubchem dataset

Thank you for publishing this great work!. I have a question about the pubchem dataset, using as a pretraining set.

In this arxiv paper, it is shortly mentioned that the 77M pubchem dataset is curated to the 10M pubchem data.

Could you explain a bit more about the details how to curate the 77M pubchem dataset?

ex) Smiles with nonbonding is removed

Accuracy metric in the pre-training stage

This is a question about smiles pretraining. Do you have any metric measurements to determine whether pretraining is going well without any problems? In the short paper in the README description, only the results after fine tuning with tox21 are displayed. Do you have an accuracy measure of how many masks are matched in the unsupervised training stage (pre-training) ? If so, how does the accuracy vary depending on the representation type (SMILES-BPE, SMILES, SELFIES-BPE etc)?

vocab special token - bos_token_id: 0 , overlapping padding index

Looking at the Chemberta vocab's special id, I found that bos_id=0. In pytorch, sentences of variable length are collated and short sentences are generally set to 0 when padding. In my opinion, it would be better to start with a special token such as bos_token from 1, but is there a reason for assigning it to 0?

config.json: seyonec/ChemBERTa-zinc-base-v1

{
"architectures": [
"RobertaForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 0.00001,
"max_position_embeddings": 514,
"model_type": "roberta",
"num_attention_heads": 12,
"num_hidden_layers": 6,
"output_past": true,
"pad_token_id": 1,
"type_vocab_size": 1,
"vocab_size": 767
}

how to use SELFIES with ChemBERTa

Hi, I am Keanu and very much interested in AI driven drug discovery. I found you in the huggingface model hub and very impressed of what you have done. I'd like to use your pretrained network leveraging SELFIES parser. I look around the configuration and vocab, but I am little bit confusing.

This is a question about how to use the SELFIES parser. In the case of the huggingface wordpiece tokenizer, it is split first by the basic tokenizer and then divided into wordpieces by the BPE. In the case of the SMILES tokenizer that you implemented in Deepchem, SMILES is first parsed in atom units by the basic tokenizer. , It seems that it does not split additionally into word pieces by BPE. On the other hand, as I roughly looked at seyonec/BPE_SELFIES_PubChem_shard00_120k, I guess, it seems that the SELFIES parser first parses and then creates a second sub token with BPE. Is that right? If yes, is there any reason for doing that?

When will we get an example of dealing with chemBERTa with SELFIES in tutorial format? I tried to use your SELFIES pretrained model mentioned above by referring to deepchem's SMILES tokenizer, but it keeps failing :-)

Tokenize incorrect when getting pretrianed feature

Hi ChemBERTa teamπŸ€—, I got a problem when I tokenize smiles seq, And I found that in you example smiles has the Cl atom. So I want to konw if you meet the same question.

The Iuput seq is COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl
However the output of tokenizer incorrectly labeled the Cl as C, I don't konw how the fix it. And I also found that in the ChenBERTa token table, it indeed has the 'Cl' token.
image

The output of the tokenize() result as follow:
['C', 'O', 'C', '1', '=', 'C', '(', 'C', '=', 'C', '2', 'C', '(', '=', 'C', '1', ')', 'C', 'C', 'N', '=', 'C', '2', 'C', '3', '=', 'C', 'C', '(', '=', 'C', '(', 'C', '=', 'C', '3', ')', 'C', ')', 'C', ')', 'C']

Originally posted by @Chris-Tang6 in #58 (comment)

What is the major difference between ChemBERTa_zinc250k_v2_40k and v1 (or others)

Hi seyonec,

I am using several version of models published ChemBERTa, and I have tried several versions of ChemBERTa tokenizer/pretrained model, and found result using "ChemBERTa_zinc250k_v2_40k" quite better than "seyonec/ChemBERTa-zinc-base-v1" or "seyonec/ChemBERTa-zinc250k-v1". So I am curious what is the difference. Can I ask for an explanation about the upgraded part of "v2" different from "base v1" or other previous versions?

Using rawtext

Loving the work here!

I've been trying to use the classification roberta, using pubchem_1k_smiles.txt via

if __name__ == "__main__":
    smiles_data = "pubchem_1k_smiles.txt"
    smiles_token = prebuilt_smiles_tokenizer("vocab.txt")

    example_dataset = RawTextDataset(smiles_token, smiles_data, block_size=512)

and I get the following error

Traceback (most recent call last):
  File "/Users/rorygarland/Work/bert-loves-chemistry/chemberta/utils/raw_text_dataset.py", line 234, in <module>
    example_dataset = RawTextDataset(smiles_token, smiles_data, block_size=512)
  File "/Users/rorygarland/Work/bert-loves-chemistry/chemberta/utils/raw_text_dataset.py", line 37, in __init__
    self.dataset = load_dataset("text", data_files=data_files)["train"]
  File "/Users/rorygarland/Library/Caches/pypoetry/virtualenvs/chemberta-l_Axq8W4-py3.8/lib/python3.8/site-packages/nlp/load.py", line 548, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/rorygarland/Library/Caches/pypoetry/virtualenvs/chemberta-l_Axq8W4-py3.8/lib/python3.8/site-packages/nlp/builder.py", line 462, in download_and_prepare
    self._download_and_prepare(
  File "/Users/rorygarland/Library/Caches/pypoetry/virtualenvs/chemberta-l_Axq8W4-py3.8/lib/python3.8/site-packages/nlp/builder.py", line 537, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/Users/rorygarland/Library/Caches/pypoetry/virtualenvs/chemberta-l_Axq8W4-py3.8/lib/python3.8/site-packages/nlp/builder.py", line 865, in _prepare_split
    for key, table in utils.tqdm(generator, unit=" tables", leave=False):
  File "/Users/rorygarland/Library/Caches/pypoetry/virtualenvs/mmdeacon-l_Axq8W4-py3.8/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/Users/rorygarland/Library/Caches/pypoetry/virtualenvs/chemberta-l_Axq8W4-py3.8/lib/python3.8/site-packages/nlp/datasets/text/c3b177069f0fad4da737a020bb39bbdb7aa16992e1f401e4347568618c906e28/text.py", line 95, in _generate_tables
    pa_table = pac.read_csv(
  File "pyarrow/_csv.pyx", line 1217, in pyarrow._csv.read_csv
  File "pyarrow/_csv.pyx", line 1221, in pyarrow._csv.read_csv
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ParseOptions: delimiter cannot be \r or \n

My versions:

nlp: 0.4.0
pyarrow: 8.0.0

Is this a version issue or am I doing something very silly?

Question about chirality and model results from paper

Hi,

ChemBERTa was trained on achiral canonicalized molecules, as evidenced by the achiral canonicalized dataset.

The MoleculeNet fine-tuning datasets in the paper contain chiral molecules. How was this addressed?

Did you canonicalize the chiral molecules to make them achiral canonicalized? What about duplicates? (since there are stereoisomers)

Or, did you just inference on the chiral strings? This would certainly cause some problems.

I didn't see any mention of this in the paper.

Please let me know,
Clay

Getting SMILES Vector in Pretrained ChemBERTa Model

Firstly, thanks for open sourcing the model. I want to cluster SMILES compounds by using their pretrained model vectors. I have implemented following code for this but it doesn't work for more than 1000 row data. Idk if I am doing this correctly since I haven't seen any official documentation for this. The only relatable code part in the repo is viz_utils.gen_embeddings function. Unfortunately it also suffers from larger SMILES dataset size

I am open to suggestions. Thank in advance.

from transformers import RobertaTokenizerFast, RobertaModel
from sklearn.cluster import KMeans
import torch

model_name = "DeepChem/ChemBERTa-77M-MLM"
tokenizer = RobertaTokenizerFast.from_pretrained(model_name)
model = RobertaModel.from_pretrained(model_name, output_hidden_states = True)
model.eval()

smiles_compounds = [
  "O=C(Cc1cccc2ccccc12)Nc1n[nH]c2ccc(N3CCCS3(=O)=O)cc12",
  "COC(=O)NC[C@@H](NC(=O)c1ccc(-c2nc(C3CCOCC3)cnc2N)cc1F)c1cccc(Br)c1",
  "COc1ccccc1Nc1cc(Oc2cc(C)c(C)nc2-c2ccccn2)ccn1",
  "O=C(/C=C/CN1CCCC1)N1CCOc2cc3ncnc(Nc4ccc(F)c(Cl)c4)c3cc21",
]

inputs = tokenizer(smiles_compounds, return_tensors='pt', padding=True, truncation=True)

with torch.no_grad():
  out = model(**inputs)

# Shape is: [len(smiles_compounds), 65, 384]
states = out.hidden_states[-1].squeeze()

# Average the token vectors for each sample, which will give you a single 384-dimensional vector for each sample.
states_2d = states.mean(dim=1).numpy()
states_2d.shape

kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(states_2d)

clusters = kmeans.predict(states_2d)
clusters

Masking multiple tokens at a time

Hi,
is it possible to mask multiple tokens at a time?

E.g. fill_mask('CCCO<mask>C') works fine. But writing fill_mask('CC<mask>CO<mask>C') I obtain:

~/miniconda3/envs/paccmann/lib/python3.7/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
    553                 values, predictions = topk.values.numpy(), topk.indices.numpy()
    554             else:
--> 555                 masked_index = (input_ids == self.tokenizer.mask_token_id).nonzero().item()
    556                 logits = outputs[i, masked_index, :]
    557                 probs = logits.softmax(dim=0)

ValueError: only one element tensors can be converted to Python scalars

Am I doing sth wrong or is this feature not supported @seyonechithrananda?
Many thanks for a reply!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.