benevolentai / guacamol Goto Github PK

View Code? Open in Web Editor NEW

394.0 394.0 84.0 98 KB

Benchmarks for generative chemistry

License: MIT License

Dockerfile 0.83% Python 99.17%

guacamol's People

Contributors

Stargazers

Watchers

Forkers

stephenra hyzcn nawfaltachfine mynhervankoek lhm30 gnatpat andriibuv clinuxmdl joergkurtwegner biao-ma roysh silviaamam aspirincode masterwhook cieplinski-tobiasz physthoth amina11 xxffliu boston123456 iwwwish bbyun28 abdulelahalshehri kjogr12 srkm009 sailfish009 msgbai unixjunkie adneimantaite guacamol khatvangi chemical-project jcheminform maxmed larsac sergeyanufriev collabacct mantasbandonis icamps shunsunsun layeqa pk-organics rct20140922 austint minghao2016 wmjpillow ifyoungnet livc193 paccmann joshuameyers yujialinncu adiv5 girisd8075 rnaimehaom devesh85 tiger-tiger oriondollar jourmore den-run-ai jlv100 fermiq pykao mathcom mars-wei aksub99 bwang-ecnu mdcao tim25651 qiangbo1222 kilvia otabekrustamov johannasommer ayaz345 yhb18174 wilson-zhang indykpol prasadtk hogru renzph marcellocostamagna xiaozhengd haydn-jones

guacamol's Issues

scipy imread

I installed guacamol on a blank new conda environment, with only rdkit and pytorch preinstalled. Guacamol was therefore installing scipy. However, the scipy version guacamol is installing doesn't have the imread function anymore (removed since scipy 1.2, guacamole installs 1.4.1).
Simply removing the import of imread in FCD.py line 24 seems to fix the problem, as the functions is not used in the whole file.

The hash value of the file is inconsistent with the hash value given in the code

When I execute the order :
python -m guacamol.data.get_data -o "/home/zh/桌面/project/git2/从头设计的分子基准模型测试/guacamol/data" --chembl

I get a different hash value:

Traceback (most recent call last):
  File "/home/zh/sda3/Anaconda3/envs/guac/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/zh/sda3/Anaconda3/envs/guac/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/project/git2/从头设计的分子基准模型测试/guacamol/guacamol/data/get_data.py", line 263, in <module>
    main()
  File "/project/git2/从头设计的分子基准模型测试/guacamol/guacamol/data/get_data.py", line 253, in main
    compare_hash(train_path, TRAIN_HASH)
  File "/project/git2/从头设计的分子基准模型测试/guacamol/guacamol/data/get_data.py", line 149, in compare_hash
    raise ValueError(f'{output_file} file has different hash {output_hash} than expected {correct_hash}!')
ValueError: /home/zh/桌面/project/git2/从头设计的分子基准模型测试/guacamol/guacamol/data/chembl24_canon_train.smiles file has different hash 75a644a29fdd347687f96aa65f1dbbce than expected 05ad85d871958a05c02ab51a4fde8530!

What is this because of this?

how to test my Distribution-Learning Benchmarks?

Unit testing suite
You can test your installation of the guacamol benchmarking library by running the unit tests from this directory:
pytest .

but how to use it ????

Support fcd_torch

fcd has been ported to pytorch at https://github.com/insilicomedicine/fcd_torch

How do you feel about supporting both fcd methods?

You can use fcd by default and fallback to fcd_torch if fcd is missing and also provide an opt-in option to be able to select one of the two at runtime.

Unknown segfault

When using guacamol with PytorchLightning==1.6.5 and PyTorch==1.12.0 I get a mysterious segfault when running the following code:

import pytorch_lightning as pl
from guacamol import standard_benchmarks as sb
sb.valsartan_smarts()

However, when using PyTorch==1.11.0 this segfault does not occur. Unclear what is causing this issue.

For reproducibility I've attached the exports of my Conda environments for both the working configuration and the broken configuration. broken.yml is the environment that will segfault while working.yml is the environment that works.
Environments.zip

Now available on conda-forge

guacamol is now available on conda-forge

https://github.com/conda-forge/guacamol-feedstock

conda install -c conda-forge guacamol

ChemNet file name has changed in FCD version 1.2

Hi,

The ChemNet file name has changed in FCD version 1.2, causing a bug when evaluating this metric in assess_distribution_learning. The new name is 'ChemNet_v0.13_pretrained.pt' (see here).

The bug is simply fixed by downgrading to FCD 1.1. Could you please update the dependencies or change the file name in your code ?

Cheers

Error while assessing distribution learning benchmarks - FCD metric

Hi all,

I tried using the assess_distribution_learning() function to calculate the benchmark metrics for one of my models with a custom training dataset. I have created a class as an instance of the DistributionMatchingGenerator and written the sampling code to obtain any number of molecules from my pre-trained model as instructed. The code runs fine for a while and in the FCD metric calculation, it fails with the following stack trace:

Traceback (most recent call last):
  File "benchmark_model_with_guacamol_v2.py", line 475, in <module>
    assess_distribution_learning(vae_model, chembl_training_file=training_data, json_output_file=json_file_path, benchmark_version="v1")
  File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/guacamol/assess_distribution_learning.py", line 34, in assess_distribution_learning
    number_samples=10000)
  File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/guacamol/assess_distribution_learning.py", line 51, in _assess_distribution_learning
    results = _evaluate_distribution_learning_benchmarks(model=model, benchmarks=benchmarks)
  File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/guacamol/assess_distribution_learning.py", line 83, in _evaluate_distribution_learning_benchmarks
    result = benchmark.assess_model(model)
  File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/guacamol/frechet_benchmark.py", line 53, in assess_model
    mu_ref, cov_ref = self._calculate_distribution_statistics(chemnet, self.reference_molecules)
  File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/guacamol/frechet_benchmark.py", line 94, in _calculate_distribution_statistics
    gen_mol_act = fcd.get_predictions(model, sample_std)
  File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/fcd/FCD.py", line 196, in get_predictions
    steps=np.ceil(len(gen_mol) / 128))
  File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1915, in predict_generator
    callbacks=callbacks)
  File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1629, in predict
    tmp_batch_outputs = self.predict_function(iterator)
  File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 862, in _call
    results = self._stateful_fn(*args, **kwds)
  File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2943, in __call__
    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
  File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 560, in call
    ctx=ctx)
  File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError:  TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
Traceback (most recent call last):

  File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 249, in __call__
    ret = func(*args)

  File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 620, in wrapper
    return func(*args, **kwargs)

  File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 891, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/tensorflow/python/keras/engine/data_adapter.py", line 807, in wrapped_generator
    for data in generator_fn():

  File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/fcd/FCD.py", line 156, in myGenerator_predict
    smiEnc = get_one_hot(currentSmiles, pad_len=nn)

  File "/home/sowmya/anaconda3/envs/ddenv_new/lib/python3.6/site-packages/fcd/FCD.py", line 118, in get_one_hot
    smiles = smiles + '.'

TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'


	 [[{{node PyFunc}}]]
	 [[IteratorGetNext]] [Op:__inference_predict_function_2225]

Function call stack:
predict_function

I am unable to identify why this error pops up at this stage. Any suggestions to resolve this will be really helpful. I have previously used the same code with ChEMBL dataset a couple months ago to benchmark another model and it worked fine at that time. Not sure if any of the package versions are not compatible anymore. So I am giving the specs of the packages below:

Tensorflow: v2.4.0
Keras: v2.4.3
GuacaMol: v0.5.2
Python: v3.6.13

Thanks in advance!
Sowmya

Something wrong in ranolazine_mpo() ?

it says 'Make start_pop_ranolazine more polar and add a fluorine',
but the code is:
logP_under_4 = RdkitScoringFunction(descriptor=logP, score_modifier=MaxGaussianModifier(mu=7, sigma=1))
I guess logP_under_4 is a correct name stands for 'trying to minimize logP till its under 4', but the function uses MaxGaussianModifier with a mu=7, shouldn't that be MinGaussianModifier with a mu=4?

isomeriSmiles= False

From what I understand you set isomericSmiles = False in your preprocessing (filter_and_canonicalize function).

This means you don't take into account any isomeric information. Do you think this might be an issue, especially since isomers don't necessarily have similar chemical or physical properties?

Latest scypi does not Support histogram

in utils.chemistry it references the histogram class from copy which no longer exists

I exchanged it with from numpy import histogram and it seems to work.

Is it possible to update guacamole to support ?

How the FCD value changes with the sample size of the reference molecule set and the padding length

This is actually not an issue, but a type of "might be useful to know". This graph shows the effect of two variables on the FCD value: the sample size of the molecule reference set (GuacaMol uses 10,000 afaik) and the padding length of the molecules before they go into the ChemNet model (fcd uses 350). A bit more background is in this repo: https://github.com/hogru/GuacaMolEval

Main result/diagram: https://github.com/hogru/GuacaMolEval/blob/main/figures/fcd_values.jpg

Support for input files of generated molecules

I would like to use Guacamol to benchmark 3rd party products for generative chemistry. I realize that some default Guacamol benchmarks may be unsuitable for this, such as those that measure training data distributions (which we cannot see) against generated molecule distributions. However, we’d still like to do our best evaluating these tools in the Guacamol framework.

Do you have any advice around this? I have explored usage of Guacamol as a Python library that integrates with my generative code, but these 3rd party tools instead typically yield molecules via web browser interfaces or minimal web APIs. Would it be best for me to create Python subroutines that can mock molecule generation for Guacamol, but are really reading from a file containing molecules generated by these tools? Or are there other options you suggest? Many thanks in advance!

TypeError: generate() missing 1 required positional argument: 'self'

when I finished specialize "DistributionMatchingGenerator" class and try to use assess_distribution_learning to asses my model,there was some error I can't figure it out ,could you please give me some advice? thanks~ the error as follow:

File "D:\Anaconcada3\envs\my-rdkit-env\Lib\site-packages\guacamol\main_analysis.py", line 20, in
benchmark_version='v1')
File "D:\Anaconcada3\envs\my-rdkit-env\Lib\site-packages\guacamol\assess_distribution_learning.py", line 34, in assess_distribution_learning
number_samples=10000)
File "D:\Anaconcada3\envs\my-rdkit-env\Lib\site-packages\guacamol\assess_distribution_learning.py", line 51, in _assess_distribution_learning
results = _evaluate_distribution_learning_benchmarks(model=model, benchmarks=benchmarks)
File "D:\Anaconcada3\envs\my-rdkit-env\Lib\site-packages\guacamol\assess_distribution_learning.py", line 83, in _evaluate_distribution_learning_benchmarks
result = benchmark.assess_model(model)
File "D:\Anaconcada3\envs\my-rdkit-env\lib\site-packages\guacamol\distribution_learning_benchmark.py", line 69, in assess_model
molecules = model.generate(number_samples=self.number_samples)
TypeError: generate() missing 1 required positional argument: 'self'

Is it possible to evaluate a distribution without the models

I would like to test a molecule generator but I can only use the generated smiles and the training smiles mostly for the distribution evaluation (not goal one), is there any function I can use or modify.