benevolentai / guacamol Goto Github PK
View Code? Open in Web Editor NEWBenchmarks for generative chemistry
License: MIT License
Benchmarks for generative chemistry
License: MIT License
I installed guacamol on a blank new conda environment, with only rdkit and pytorch preinstalled. Guacamol was therefore installing scipy. However, the scipy version guacamol is installing doesn't have the imread function anymore (removed since scipy 1.2, guacamole installs 1.4.1).
Simply removing the import of imread in FCD.py line 24 seems to fix the problem, as the functions is not used in the whole file.
When I execute the order :
python -m guacamol.data.get_data -o "/home/zh/桌面/project/git2/从头设计的分子基准模型测试/guacamol/data" --chembl
I get a different hash value:
Traceback (most recent call last):
File "/home/zh/sda3/Anaconda3/envs/guac/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/zh/sda3/Anaconda3/envs/guac/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/project/git2/从头设计的分子基准模型测试/guacamol/guacamol/data/get_data.py", line 263, in <module>
main()
File "/project/git2/从头设计的分子基准模型测试/guacamol/guacamol/data/get_data.py", line 253, in main
compare_hash(train_path, TRAIN_HASH)
File "/project/git2/从头设计的分子基准模型测试/guacamol/guacamol/data/get_data.py", line 149, in compare_hash
raise ValueError(f'{output_file} file has different hash {output_hash} than expected {correct_hash}!')
ValueError: /home/zh/桌面/project/git2/从头设计的分子基准模型测试/guacamol/guacamol/data/chembl24_canon_train.smiles file has different hash 75a644a29fdd347687f96aa65f1dbbce than expected 05ad85d871958a05c02ab51a4fde8530!
What is this because of this?
Unit testing suite
You can test your installation of the guacamol benchmarking library by running the unit tests from this directory:
pytest .
but how to use it ????
fcd has been ported to pytorch at https://github.com/insilicomedicine/fcd_torch
How do you feel about supporting both fcd methods?
You can use fcd by default and fallback to fcd_torch if fcd is missing and also provide an opt-in option to be able to select one of the two at runtime.
When using guacamol
with PytorchLightning==1.6.5
and PyTorch==1.12.0
I get a mysterious segfault when running the following code:
import pytorch_lightning as pl
from guacamol import standard_benchmarks as sb
sb.valsartan_smarts()
However, when using PyTorch==1.11.0
this segfault does not occur. Unclear what is causing this issue.
For reproducibility I've attached the exports of my Conda environments for both the working configuration and the broken configuration. broken.yml
is the environment that will segfault while working.yml
is the environment that works.
Environments.zip
guacamol is now available on conda-forge
https://github.com/conda-forge/guacamol-feedstock
conda install -c conda-forge guacamol
Hi,
The ChemNet file name has changed in FCD version 1.2, causing a bug when evaluating this metric in assess_distribution_learning. The new name is 'ChemNet_v0.13_pretrained.pt' (see here).
The bug is simply fixed by downgrading to FCD 1.1. Could you please update the dependencies or change the file name in your code ?
Cheers
Hi all,
I tried using the assess_distribution_learning() function to calculate the benchmark metrics for one of my models with a custom training dataset. I have created a class as an instance of the DistributionMatchingGenerator and written the sampling code to obtain any number of molecules from my pre-trained model as instructed. The code runs fine for a while and in the FCD metric calculation, it fails with the following stack trace:
Traceback (most recent call last):
File "benchmark_model_with_guacamol_v2.py", line 475, in <module>
assess_distribution_learning(vae_model, chembl_training_file=training_data, json_output_file=json_file_path, benchmark_version="v1")
File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/guacamol/assess_distribution_learning.py", line 34, in assess_distribution_learning
number_samples=10000)
File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/guacamol/assess_distribution_learning.py", line 51, in _assess_distribution_learning
results = _evaluate_distribution_learning_benchmarks(model=model, benchmarks=benchmarks)
File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/guacamol/assess_distribution_learning.py", line 83, in _evaluate_distribution_learning_benchmarks
result = benchmark.assess_model(model)
File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/guacamol/frechet_benchmark.py", line 53, in assess_model
mu_ref, cov_ref = self._calculate_distribution_statistics(chemnet, self.reference_molecules)
File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/guacamol/frechet_benchmark.py", line 94, in _calculate_distribution_statistics
gen_mol_act = fcd.get_predictions(model, sample_std)
File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/fcd/FCD.py", line 196, in get_predictions
steps=np.ceil(len(gen_mol) / 128))
File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1915, in predict_generator
callbacks=callbacks)
File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1629, in predict
tmp_batch_outputs = self.predict_function(iterator)
File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
result = self._call(*args, **kwds)
File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 862, in _call
results = self._stateful_fn(*args, **kwds)
File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2943, in __call__
filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 560, in call
ctx=ctx)
File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
Traceback (most recent call last):
File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 249, in __call__
ret = func(*args)
File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 620, in wrapper
return func(*args, **kwargs)
File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 891, in generator_py_func
values = next(generator_state.get_iterator(iterator_id))
File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/keras/engine/data_adapter.py", line 807, in wrapped_generator
for data in generator_fn():
File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/fcd/FCD.py", line 156, in myGenerator_predict
smiEnc = get_one_hot(currentSmiles, pad_len=nn)
File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/fcd/FCD.py", line 118, in get_one_hot
smiles = smiles + '.'
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
[[{{node PyFunc}}]]
[[IteratorGetNext]] [Op:__inference_predict_function_2225]
Function call stack:
predict_function
I am unable to identify why this error pops up at this stage. Any suggestions to resolve this will be really helpful. I have previously used the same code with ChEMBL dataset a couple months ago to benchmark another model and it worked fine at that time. Not sure if any of the package versions are not compatible anymore. So I am giving the specs of the packages below:
Tensorflow: v2.4.0
Keras: v2.4.3
GuacaMol: v0.5.2
Python: v3.6.13
Thanks in advance!
Sowmya
it says 'Make start_pop_ranolazine more polar and add a fluorine',
but the code is:
logP_under_4 = RdkitScoringFunction(descriptor=logP, score_modifier=MaxGaussianModifier(mu=7, sigma=1))
I guess logP_under_4 is a correct name stands for 'trying to minimize logP till its under 4', but the function uses MaxGaussianModifier with a mu=7, shouldn't that be MinGaussianModifier with a mu=4?
From what I understand you set isomericSmiles = False in your preprocessing (filter_and_canonicalize
function).
This means you don't take into account any isomeric information. Do you think this might be an issue, especially since isomers don't necessarily have similar chemical or physical properties?
in utils.chemistry
it references the histogram
class from copy which no longer exists
I exchanged it with from numpy import histogram
and it seems to work.
Is it possible to update guacamole to support ?
This is actually not an issue, but a type of "might be useful to know". This graph shows the effect of two variables on the FCD value: the sample size of the molecule reference set (GuacaMol
uses 10,000 afaik) and the padding length of the molecules before they go into the ChemNet
model (fcd
uses 350). A bit more background is in this repo: https://github.com/hogru/GuacaMolEval
Main result/diagram: https://github.com/hogru/GuacaMolEval/blob/main/figures/fcd_values.jpg
I would like to use Guacamol to benchmark 3rd party products for generative chemistry. I realize that some default Guacamol benchmarks may be unsuitable for this, such as those that measure training data distributions (which we cannot see) against generated molecule distributions. However, we’d still like to do our best evaluating these tools in the Guacamol framework.
Do you have any advice around this? I have explored usage of Guacamol as a Python library that integrates with my generative code, but these 3rd party tools instead typically yield molecules via web browser interfaces or minimal web APIs. Would it be best for me to create Python subroutines that can mock molecule generation for Guacamol, but are really reading from a file containing molecules generated by these tools? Or are there other options you suggest? Many thanks in advance!
when I finished specialize "DistributionMatchingGenerator" class and try to use assess_distribution_learning to asses my model,there was some error I can't figure it out ,could you please give me some advice? thanks~ the error as follow:
File "D:\Anaconcada3\envs\my-rdkit-env\Lib\site-packages\guacamol\main_analysis.py", line 20, in
benchmark_version='v1')
File "D:\Anaconcada3\envs\my-rdkit-env\Lib\site-packages\guacamol\assess_distribution_learning.py", line 34, in assess_distribution_learning
number_samples=10000)
File "D:\Anaconcada3\envs\my-rdkit-env\Lib\site-packages\guacamol\assess_distribution_learning.py", line 51, in _assess_distribution_learning
results = _evaluate_distribution_learning_benchmarks(model=model, benchmarks=benchmarks)
File "D:\Anaconcada3\envs\my-rdkit-env\Lib\site-packages\guacamol\assess_distribution_learning.py", line 83, in _evaluate_distribution_learning_benchmarks
result = benchmark.assess_model(model)
File "D:\Anaconcada3\envs\my-rdkit-env\lib\site-packages\guacamol\distribution_learning_benchmark.py", line 69, in assess_model
molecules = model.generate(number_samples=self.number_samples)
TypeError: generate() missing 1 required positional argument: 'self'
I would like to test a molecule generator but I can only use the generated smiles and the training smiles mostly for the distribution evaluation (not goal one), is there any function I can use or modify.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.