Giter VIP home page Giter VIP logo

moses's Introduction

Molecular Sets (MOSES): A benchmarking platform for molecular generation models

Build Status PyPI version

Deep generative models are rapidly becoming popular for the discovery of new molecules and materials. Such models learn on a large collection of molecular structures and produce novel compounds. In this work, we introduce Molecular Sets (MOSES), a benchmarking platform to support research on machine learning for drug discovery. MOSES implements several popular molecular generation models and provides a set of metrics to evaluate the quality and diversity of generated molecules. With MOSES, we aim to standardize the research on molecular generation and facilitate the sharing and comparison of new models.

For more details, please refer to the paper.

If you are using MOSES in your research paper, please cite us as

@article{10.3389/fphar.2020.565644,
  title={{M}olecular {S}ets ({MOSES}): {A} {B}enchmarking {P}latform for {M}olecular {G}eneration {M}odels},
  author={Polykovskiy, Daniil and Zhebrak, Alexander and Sanchez-Lengeling, Benjamin and Golovanov, Sergey and Tatanov, Oktai and Belyaev, Stanislav and Kurbanov, Rauf and Artamonov, Aleksey and Aladinskiy, Vladimir and Veselov, Mark and Kadurin, Artur and Johansson, Simon and  Chen, Hongming and Nikolenko, Sergey and Aspuru-Guzik, Alan and Zhavoronkov, Alex},
  journal={Frontiers in Pharmacology},
  year={2020}
}

pipeline

Dataset

We propose a benchmarking dataset refined from the ZINC database.

The set is based on the ZINC Clean Leads collection. It contains 4,591,276 molecules in total, filtered by molecular weight in the range from 250 to 350 Daltons, a number of rotatable bonds not greater than 7, and XlogP less than or equal to 3.5. We removed molecules containing charged atoms or atoms besides C, N, S, O, F, Cl, Br, H or cycles longer than 8 atoms. The molecules were filtered via medicinal chemistry filters (MCFs) and PAINS filters.

The dataset contains 1,936,962 molecular structures. For experiments, we split the dataset into a training, test and scaffold test sets containing around 1.6M, 176k, and 176k molecules respectively. The scaffold test set contains unique Bemis-Murcko scaffolds that were not present in the training and test sets. We use this set to assess how well the model can generate previously unobserved scaffolds.

Models

Metrics

Besides standard uniqueness and validity metrics, MOSES provides other metrics to access the overall quality of generated molecules. Fragment similarity (Frag) and Scaffold similarity (Scaff) are cosine distances between vectors of fragment or scaffold frequencies correspondingly of the generated and test sets. Nearest neighbor similarity (SNN) is the average similarity of generated molecules to the nearest molecule from the test set. Internal diversity (IntDiv) is an average pairwise similarity of generated molecules. Fréchet ChemNet Distance (FCD) measures the difference in distributions of last layer activations of ChemNet. Novelty is a fraction of unique valid generated molecules not present in the training set.

Model Valid (↑) Unique@1k (↑) Unique@10k (↑) FCD (↓) SNN (↑) Frag (↑) Scaf (↑) IntDiv (↑) IntDiv2 (↑) Filters (↑) Novelty (↑)
Test TestSF Test TestSF Test TestSF Test TestSF
Train 1.0 1.0 1.0 0.008 0.4755 0.6419 0.5859 1.0 0.9986 0.9907 0.0 0.8567 0.8508 1.0 1.0
HMM 0.076±0.0322 0.623±0.1224 0.5671±0.1424 24.4661±2.5251 25.4312±2.5599 0.3876±0.0107 0.3795±0.0107 0.5754±0.1224 0.5681±0.1218 0.2065±0.0481 0.049±0.018 0.8466±0.0403 0.8104±0.0507 0.9024±0.0489 0.9994±0.001
NGram 0.2376±0.0025 0.974±0.0108 0.9217±0.0019 5.5069±0.1027 6.2306±0.0966 0.5209±0.001 0.4997±0.0005 0.9846±0.0012 0.9815±0.0012 0.5302±0.0163 0.0977±0.0142 0.8738±0.0002 0.8644±0.0002 0.9582±0.001 0.9694±0.001
Combinatorial 1.0±0.0 0.9983±0.0015 0.9909±0.0009 4.2375±0.037 4.5113±0.0274 0.4514±0.0003 0.4388±0.0002 0.9912±0.0004 0.9904±0.0003 0.4445±0.0056 0.0865±0.0027 0.8732±0.0002 0.8666±0.0002 0.9557±0.0018 0.9878±0.0008
CharRNN 0.9748±0.0264 1.0±0.0 0.9994±0.0003 0.0732±0.0247 0.5204±0.0379 0.6015±0.0206 0.5649±0.0142 0.9998±0.0002 0.9983±0.0003 0.9242±0.0058 0.1101±0.0081 0.8562±0.0005 0.8503±0.0005 0.9943±0.0034 0.8419±0.0509
AAE 0.9368±0.0341 1.0±0.0 0.9973±0.002 0.5555±0.2033 1.0572±0.2375 0.6081±0.0043 0.5677±0.0045 0.991±0.0051 0.9905±0.0039 0.9022±0.0375 0.0789±0.009 0.8557±0.0031 0.8499±0.003 0.996±0.0006 0.7931±0.0285
VAE 0.9767±0.0012 1.0±0.0 0.9984±0.0005 0.099±0.0125 0.567±0.0338 0.6257±0.0005 0.5783±0.0008 0.9994±0.0001 0.9984±0.0003 0.9386±0.0021 0.0588±0.0095 0.8558±0.0004 0.8498±0.0004 0.997±0.0002 0.6949±0.0069
JTN-VAE 1.0±0.0 1.0±0.0 0.9996±0.0003 0.3954±0.0234 0.9382±0.0531 0.5477±0.0076 0.5194±0.007 0.9965±0.0003 0.9947±0.0002 0.8964±0.0039 0.1009±0.0105 0.8551±0.0034 0.8493±0.0035 0.976±0.0016 0.9143±0.0058
LatentGAN 0.8966±0.0029 1.0±0.0 0.9968±0.0002 0.2968±0.0087 0.8281±0.0117 0.5371±0.0004 0.5132±0.0002 0.9986±0.0004 0.9972±0.0007 0.8867±0.0009 0.1072±0.0098 0.8565±0.0007 0.8505±0.0006 0.9735±0.0006 0.9498±0.0006

For comparison of molecular properties, we computed the Wasserstein-1 distance between distributions of molecules in the generated and test sets. Below, we provide plots for lipophilicity (logP), Synthetic Accessibility (SA), Quantitative Estimation of Drug-likeness (QED) and molecular weight.

logP SA
logP SA
weight QED
weight QED

Installation

PyPi

The simplest way to install MOSES (models and metrics) is to install RDKit: conda install -yq -c rdkit rdkit and then install MOSES (molsets) from pip (pip install molsets). If you want to use LatentGAN, you should also install additional dependencies using bash install_latentgan_dependencies.sh.

If you are using Ubuntu, you should also install sudo apt-get install libxrender1 libxext6 for RDKit.

Docker

  1. Install docker and nvidia-docker.

  2. Pull an existing image (4.1Gb to download) from DockerHub:

docker pull molecularsets/moses

or clone the repository and build it manually:

git clone https://github.com/molecularsets/moses.git
nvidia-docker image build --tag molecularsets/moses moses/
  1. Create a container:
nvidia-docker run -it --name moses --network="host" --shm-size 10G molecularsets/moses
  1. The dataset and source code are available inside the docker container at /moses:
docker exec -it molecularsets/moses bash

Manually

Alternatively, install dependencies and MOSES manually.

  1. Clone the repository:
git lfs install
git clone https://github.com/molecularsets/moses.git
  1. Install RDKit for metrics calculation.

  2. Install MOSES:

python setup.py install
  1. (Optional) Install dependencies for LatentGAN:
bash install_latentgan_dependencies.sh

Benchmarking your models

  • Install MOSES as described in the previous section.

  • Get train, test and test_scaffolds datasets using the following code:

import moses

train = moses.get_dataset('train')
test = moses.get_dataset('test')
test_scaffolds = moses.get_dataset('test_scaffolds')
  • You can use a standard torch DataLoader in your models. We provide a simple StringDataset class for convenience:
from torch.utils.data import DataLoader
from moses import CharVocab, StringDataset

train = moses.get_dataset('train')
vocab = CharVocab.from_data(train)
train_dataset = StringDataset(vocab, train)
train_dataloader = DataLoader(
    train_dataset, batch_size=512,
    shuffle=True, collate_fn=train_dataset.default_collate
)

for with_bos, with_eos, lengths in train_dataloader:
    ...
  • Calculate metrics from your model's samples. We recomend sampling at least 30,000 molecules:
import moses
metrics = moses.get_all_metrics(list_of_generated_smiles)
  • Add generated samples and metrics to your repository. Run the experiment multiple times to estimate the variance of the metrics.

Reproducing the baselines

End-to-End launch

You can run pretty much everything with:

python scripts/run.py

This will split the dataset, train the models, generate new molecules, and calculate the metrics. Evaluation results will be saved in metrics.csv.

You can specify the GPU device index as cuda:n (or cpu for CPU) and/or model by running:

python scripts/run.py --device cuda:1 --model aae

For more details run python scripts/run.py --help.

You can reproduce evaluation of all models with several seeds by running:

sh scripts/run_all_models.sh

Training

python scripts/train.py <model name> \
       --train_load <train dataset> \
       --model_save <path to model> \
       --config_save <path to config> \
       --vocab_save <path to vocabulary>

To get a list of supported models run python scripts/train.py --help.

For more details of certain model run python scripts/train.py <model name> --help.

Generation

python scripts/sample.py <model name> \
       --model_load <path to model> \
       --vocab_load <path to vocabulary> \
       --config_load <path to config> \
       --n_samples <number of samples> \
       --gen_save <path to generated dataset>

To get a list of supported models run python scripts/sample.py --help.

For more details of certain model run python scripts/sample.py <model name> --help.

Evaluation

python scripts/eval.py \
       --ref_path <reference dataset> \
       --gen_path <generated dataset>

For more details run python scripts/eval.py --help.

moses's People

Contributors

andreyfilimonov avatar beangoben avatar daniil-polykovskiy-insilico avatar danpol avatar golovanovsrg avatar hbq1 avatar lilleswing avatar oktai15 avatar pcko1 avatar rauf-kurbanov avatar rtriangle avatar seemonj avatar shayakhmetov avatar spoilt333 avatar stasbel avatar truskovskiyk avatar unixjunkie avatar zhebrak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

moses's Issues

#2: Request for exact steps to run each of the 5 models and required packages with versions

Following up from 4 days ago (Dec 14, 2019) issue...No luck for me. Instructions starting "Benchmarking your models" and below are actually not clear to me. Exactly what needs to be done step by step to run each of the 5 models? I would appreciate knowing the steps. For example, what exactly are the path --gen_path for the aae, charRNN, organ, latentgan and vae models under Training, Generation and Evaluation?.
What special needs to be done for latentgan, e.g. re. ddc_pub v3, molvengen, ...?
How exactly to execute sh scripts/run_all_models.sh?
I am using Python 3.7.3, Jupyter Notebook 6.0.2, and have installed rdkit 2019.03.4.0, and molsets 0.2 and many other packages? Is there an exact requirements list of packages required to run MOSES?
Will appreciate your clarifications. Thanks,

ChemVAE support

Hi,

Thanks for your wonderful software. I was just wondering since you have cited ChemVAE in your supported models whether you have implemented it. In VAE class I didn't see any GRU layer followed by conv1d layers as described in ChemVAE paper.

Thanks,
Mohsen

Insights on VAE's KL annealing scheme

Hello,

In most implementations of VAE for molecular generation on the web there seems to be a trend to downweight the KL penalty/suppress the reparametrization in the VAE training, basically degenerating the model into a standard AE. The reason being that a vanilla VAE seems to be unable to learn from a SMILES dataset otherwise.

In Moses VAE's training scheme it seems to me that the KL penalty has a weighting factor that starts at 0 and grows linearly towards 1 with the number of epochs increasing.

Do you find that this annealing of the KL term allows to get the best of both worlds, i.e not train just a plain AE while still being able to acheive low reconstruction error and high log-likelihood?
Any insights on the impact of this scheme and of the KL penalty in general on the training of a VAE on a SMILES dataset?

Thanks for your amazing repo!

Training ORGAN on cuda error

Hey,

when I train ORGAN via
python scripts/run.py --model organ --train_path ./my_organ/data/hce_smiles.smi --checkpoint_dir ./my_orgapt --device cuda:0 --metrics ./my_organ/metrics
it throws the error

Traceback (most recent call last):
  File "scripts/run.py", line 219, in <module>
    main(config)
  File "scripts/run.py", line 199, in main
    train_model(config, model, train_path, test_path)
  File "scripts/run.py", line 127, in train_model
    trainer_script.main(model, trainer_config)
  File "/home/luca/Projects/benchmark_level_up/local/moses/scripts/train.py", line 62, in main
    trainer.fit(model, train_data, val_data)
  File "/home/luca/anaconda3/envs/moses/lib/python3.6/site-packages/molsets-1.0-py3.6.egg/moses/organ/trainer.py", line 370, in fit
    gen_val_loader, logger)
  File "/home/luca/anaconda3/envs/moses/lib/python3.6/site-packages/molsets-1.0-py3.6.egg/moses/organ/trainer.py", line 104, in _pretrain_generator
    criterion, optimizer)
  File "/home/luca/anaconda3/envs/moses/lib/python3.6/site-packages/molsets-1.0-py3.6.egg/moses/organ/trainer.py", line 70, in _pretrain_generator_epoch
    outputs, _, _ = model.generator_forward(prevs, lens)
  File "/home/luca/anaconda3/envs/moses/lib/python3.6/site-packages/molsets-1.0-py3.6.egg/moses/organ/model.py", line 94, in generator_forward
    return self.generator(*args, **kwargs)
  File "/home/luca/anaconda3/envs/moses/lib/python3.6/site-packages/torch-1.7.1-py3.6-linux-x86_64.egg/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/luca/anaconda3/envs/moses/lib/python3.6/site-packages/molsets-1.0-py3.6.egg/moses/organ/model.py", line 23, in forward
    x = pack_padded_sequence(x, lengths, batch_first=True)
  File "/home/luca/anaconda3/envs/moses/lib/python3.6/site-packages/torch-1.7.1-py3.6-linux-x86_64.egg/torch/nn/utils/rnn.py", line 244, in pack_padded_sequence
    _VF._pack_padded_sequence(input, lengths, batch_first)
RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor, but got 1D cuda:0 Long tensor

Seems like changing line 23 in moses/organ/model.py from
x = pack_padded_sequence(x, lengths.cpu(), batch_first=True)
to
x = pack_padded_sequence(x, lengths.cpu(), batch_first=True)
fixes the problem though

Have a look of Organ sample/model generated smiles file!

This is what I did:
(moses) [trial6@xps8500 moses]$ python scripts/run.py --gpu 0 --model organ
the metrics generated looks OK.
(moses) [trial6@xps8500 moses]$ more metrics.csv
FCD/Test,FCD/TestSF,Filters,Frag/Test,Frag/TestSF,IntDiv,IntDiv2,NP,QED,SA,SNN/Test,SNN/TestSF,Scaf/Test,Scaf/TestSF,logP,model,unique@1000,unique@10
000,valid,weight
47.03170334364395,48.812974148127694,0.9452942921310225,0.7442927316911858,0.7363127970096939,0.6648230388481462,0.6524179217292552,1.860596282806586
,0.5782202256571298,2.5691228002718356,0.37832045248554563,0.329009914928975,0.5812598568222165,0.011773956277315145,139.47112549646172,organ,0.981,0
.9175,0.7866333333333333,524984.1029653098
However, if you visually check the generated smiles:
(moses) [trial6@xps8500 moses]$ head data/organ_generated.csv
SMILES
CCOC(=O)CNC(=O)C(C)NCCCCC(=O)NCCCCC(=O)NCCCCC(=O)NCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
COCCNC(=O)CNC(=O)C(C)N(C)C(=O)C(C)N(C)C(=O)C(C)NC(=O)C(C)N(C)C(=O)C(C)N(C)C(=O)c1ccccc1NC(=O)C(C)OCC
CC(C)C(=O)Nc1ccccc1NC(=O)C(C)N(C)C(=O)C(C)N(C)C(=O)C(C)N(C)C(=O)C(C)N(C)C(=O)C(C)N(C)C(=O)C(C)N(C)C(
CCOCCCCNC(=O)C(C)NC(=O)C(C)NC(=O)C(C)N(C)C(=O)C(C)N(C)C(=O)C(C)NCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
O=C(CCCO)NCCNC(=O)COC(=O)C(C)OC(=O)CCCCC(=O)NCCCCC(=O)NCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
Cc1ccccc1NC(=O)C(C)C(=O)NCCCCC(=O)NCCCCC(=O)NCCCCCC(=O)NCCCC(=O)NCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CC(C)NC(=O)C(NC(=O)C(C)OCCNC(=O)c1ccccc1)C(=O)NCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCNC(=O)C(
COCCOC(=O)CCCCCCCCCCCCCNC(=O)C(C)C(=O)NCCCCCC(=O)NCCCCCC(=O)NCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CC(C)CCNC(=O)C(C)NC(=O)C(C)N(C)C(=O)C(C)N(C)C(=O)C(C)NC(=O)C(C)N(C)C(=O)C(C)NC(=O)C(C)N(C)C(=O)C(C)N
Any idea?
btw, aae/char_rnn generated smiles look much more reasonable.

why does the variable 'vocab' there have the property - 'vectors'?

image

Hi, I have some confusions about the variable **'vocab'** there. I'm the beginner in this field, so maybe these are some silly questions. But hopefully, you guys can help to give me some suggestions. Thanks!

Firstly, whether the variable 'vocab' was generated by CharVocab.from_data function as mentioned in README.md. Additionally, if so, I couldn't find a property called 'vectors'.
What the exactly type of the variable 'vocab' there? Whether I missed other key points on 'vocab'?

The --ks option for script/eval.py should add "type=int"?

the code:
parser.add_argument('--ks',
nargs='+', default=[1000, 10000],
help='Prefixes to calc uniqueness at')
should be:
parser.add_argument('--ks',
type=int, nargs='+', default=[1000, 10000],
help='Prefixes to calc uniqueness at')

Comment on this database.

Among 1.9 million molecules, only 0.5k have not rings.
Ring topology may be well for drug design.
Yet, it is not a general database for simple topology molecule design.

Reduce docker image size

Consider creating a separate image for metrics for those who are not interested in running implemented models.

Aging, Epigenetic Drift and Gene Expression.

Sinclair suggests aging is caused by loss of epigenetic information [1]. The MOSES dataset contains chemical information for small molecules, but no information on how small molecules alter the epigenetics of human cells. This information is available in the CMap L1000 dataset [2]. In particular, the CMap dataset describe how gene expression (and thus indirectly epigenetics) change as a result of treating human cells with molecules.

Question. If you want to develop drugs that intervene in the aging process, and believe epigenetics and aging are related, why not include epigenetic information through gene expression? Are you already doing this internally?

I'd be happy to write a script that attempts to combine the two datasets and retrains some of the baselines you have. But I first want to be sure there's no obvious reason for not using the gene expression data.

[1] https://genetics.med.harvard.edu/sinclair/research.php
[2] https://clue.io/GEO-guide

Regarding Vocabulary

Sir,for aae model what is the type of vocabulary we should consider.I have text file.How should I convert this to vocabulary which can be used for aae model?

Posterior collapse for the VAE

Hi everyone,

After training the VAE for 50 epochs, when sampling from the prior I get a quite nice distribution of molecules and similar metrics as reported in the paper/the readme.
On the other hand, the reconstruction loss is quite high (approx. 0.5) while the KL loss is approx. zero, and the VAE fails to reconstruct accurately molecules.
This looks like posterior collapse to me (indeed, all smiles are encoded into roughly the same mean/variances), and while this still yields a nice generator, it removes the main interest of the VAE (obtaining a meaningful latent space) in contrast to a plain charRNN for instance.
Any thoughts/advices/comments on the matter?

Thanks

Any recent update?

Hi,

I was wondering whether you are going to maintain or further develop the MOSES benchmark any time soon? Thanks for the response from now.

multi gpu experiments

Hi,

Have there been any effort in scaling the models to multi-GPU systems for training?. I am not a molecular domain expert. I am curious if there are situations where using multi-GPU would help for the models used in the benchmark suite.

Best,
Trinayan

Suggesting Internal Diversity Metric

Just a suggestion from other people in industry doing this kind of work.

The internal diversity metric as of right now is calculating the mean absolute error of the distance matrix.
selection_445

I would suggest taking the RMSE instead.
selection_444

This will more heavily punish skewed datasets where all but a few compounds are highly similar with one compound being very different. We are using the second metric internally.

Add computation time

Please add the time it takes for each model to generate a new molecule. It would be very useful.

Not a gzipped file when get_dataset()

Hello,

I met with following error while get_dataset(). Any solutions?

  File "/home/user/anaconda3/envs/sgpt-env/lib/python3.6/site-packages/molsets-1.0-py3.6.egg/moses/metrics/metrics.py", line 66, in get_all_metrics
    test = get_dataset('test')
  File "/home/xiaopengxu/user/envs/sgpt-env/lib/python3.6/site-packages/molsets-1.0-py3.6.egg/moses/dataset/dataset.py", line 31, in get_dataset
    smiles = pd.read_csv(path)['SMILES'].values
  File "/home/user/anaconda3/envs/sgpt-env/lib/python3.6/site-packages/pandas/io/parsers.py", line 688, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/user/anaconda3/envs/sgpt-env/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/home/user/anaconda3/envs/sgpt-env/lib/python3.6/site-packages/pandas/io/parsers.py", line 948, in __init__
    self._make_engine(self.engine)
  File "/home/user/anaconda3/envs/sgpt-env/lib/python3.6/site-packages/pandas/io/parsers.py", line 1180, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/user/anaconda3/envs/sgpt-env/lib/python3.6/site-packages/pandas/io/parsers.py", line 2010, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 537, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 711, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 905, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2034, in pandas._libs.parsers.raise_parser_error
  File "/home/user/anaconda3/envs/sgpt-env/lib/python3.6/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/home/user/anaconda3/envs/sgpt-env/lib/python3.6/gzip.py", line 463, in read
    if not self._read_gzip_header():
  File "/home/user/anaconda3/envs/sgpt-env/lib/python3.6/gzip.py", line 411, in _read_gzip_header
    raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b've')

RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor, but got 1D cuda:1 Long tensor

I run the code use cuda, but I face this problem, so how can I solvie the problem?

Traceback (most recent call last):
File "scripts/run.py", line 219, in
main(config)
File "scripts/run.py", line 199, in main
train_model(config, model, train_path, test_path)
File "scripts/run.py", line 127, in train_model
trainer_script.main(model, trainer_config)
File "/home/cyy/Code/moses/scripts/train.py", line 62, in main
trainer.fit(model, train_data, val_data)
File "/home/cyy/anaconda3/envs/moses/lib/python3.7/site-packages/moses/aae/trainer.py", line 284, in fit
self._train(model, train_loader, val_loader, logger)
File "/home/cyy/anaconda3/envs/moses/lib/python3.7/site-packages/moses/aae/trainer.py", line 219, in _train
tqdm_data, criterions, optimizers)
File "/home/cyy/anaconda3/envs/moses/lib/python3.7/site-packages/moses/aae/trainer.py", line 116, in _train_epoch
latent_codes = model.encoder_forward(*encoder_inputs)
File "/home/cyy/anaconda3/envs/moses/lib/python3.7/site-packages/moses/aae/model.py", line 109, in encoder_forward
return self.encoder(*args, **kwargs)
File "/home/cyy/anaconda3/envs/moses/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/cyy/anaconda3/envs/moses/lib/python3.7/site-packages/moses/aae/model.py", line 26, in forward
x = pack_padded_sequence(x, lengths, batch_first=True)
File "/home/cyy/anaconda3/envs/moses/lib/python3.7/site-packages/torch/nn/utils/rnn.py", line 249, in pack_padded_sequence
_VF._pack_padded_sequence(input, lengths, batch_first)
RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor, but got 1D cuda:1 Long tensor

License?

What is the license of this repo?

how to use "--addition_rewards" to run organ

I was trying to use different rewards for organ, such as qed. One way I found is to use --add_rewards, which needs to call FrechetMetric and bring the molecules into the rdkit function, it seems there are some issues there. Could you make it working? Or is there another way to do it?

ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found in CentOS after using FCD module

This is the error message:

(moses-env) [trial0@xps8500 moses]$ python scripts/train.py organ --help
Traceback (most recent call last):
File "scripts/train.py", line 5, in
import rdkit
File "/home/trial0/anaconda3/envs/moses-env/lib/python3.6/site-packages/rdkit/init.py", line 2, in
from .rdBase import rdkitVersion as version
ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home/trial0/anaconda3/envs/moses-env/lib/python3.6/site-packages/rdkit/rdBase.so)

However, rdkit itself is working:

(moses-env) [trial0@xps8500 ~]$ python
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

from future import print_function
from rdkit import Chem
m = Chem.MolFromSmiles('Cc1ccccc1')
m
<rdkit.Chem.rdchem.Mol object at 0x7f1e30b8bf80>


This under CentOS 7.6 after introducing the FCD module with pytorch 1.01. It looks this is not rdkit problem because ORGAN is still running under this env. without any problem.
Any idea? Thanks!

Pre-trained models or model checkpoint

Hi,

I am trying to reproduce different models and they are taking large amount of time to run even for one epoch. I have Nvidia GTX 1050 GPU with 4GB RAM.

Is it possible to provide checkpoints/pre-trained models, so those can be used to generate the samples.

Stay safe!
Thanks.

TypeError: get_all_metrics() got an unexpected keyword argument 'train'

any modification from last update?
I got an error message from:
###########
python scripts/eval.py --n_jobs 4 --device cuda:0 --test_path data/test.csv --ptest_path data/test_stats.npz --test_scaffolds_path data/test_scaffolds.csv --ptest_scaffolds_path data/test_scaffolds_stats.npz --gen_path checkpoints/organ_generated.csv
Traceback (most recent call last):
File "scripts/eval.py", line 87, in
main(config)
File "scripts/eval.py", line 42, in main
train=train)
TypeError: get_all_metrics() got an unexpected keyword argument 'train'
############
It was OK before.
Oh Happy 1024!

JTN-VAE model implementation

Hi! Thanks a lot for your efforts in integrating these models.
In the README, you show the result of JTN-VAE. However, it seems that there has no real implementation for JTN-VAE. Could you show me a way to use JTN-VAE? Thanks!
c4620fae4db8cf70a13d4f590ab8e66

where is ddc_pub?

Hi,
Did you forget to tell us how to specify "ddc_pub"?
in
from .model import LatentGAN
moses/latentgan/model.py", line 4
from ddc_pub import ddc_v3 as ddc

Is the validity check of smiles in moses the same as RDKit?

I have this function to check the validity of smiles that is based on RDKit :

from rdkit import Chem
def checksmi(smi):
    m = Chem.MolFromSmiles(smi,sanitize=False)
    if m is None:
        #print('invalid SMILES')
        v = 0
    else:
        #print("valid smiles.")
        v = 1
        try:
            Chem.SanitizeMol(m)
        except:
            #print('invalid chemistry')
            v = 0
     return v

is this the same as the validity given by moses when we type the below code, does moses validity checks for valid smiles (grammar) or also valid chemistry:

import moses
metrics = moses.get_all_metrics(list_to_evaluate)
print(metrics)
{'valid': 0.8571428571428572,
 'unique@1000': 1.0,
 'unique@10000': 1.0,
 'FCD/Test': 52.710485508527654,
 'SNN/Test': 0.2737954681118329,
 'Frag/Test': 0.26661035762724716,
 'Scaf/Test': nan,
 'FCD/TestSF': 54.30007202575979,
 'SNN/TestSF': 0.23380156854788461,
 'Frag/TestSF': 0.3777161249748274,
 'Scaf/TestSF': nan,
 'IntDiv': 0.7468284898334079,
 'IntDiv2': 0.578859979666881,
 'Filters': 0.6666666666666666,
 'logP': 3.048971913797608,
 'SA': 0.8860967344024802,
 'QED': 0.3780715536953819,
 'weight': 184.10358258270205,
 'Novelty': 1.0}

Regarding of tests/test_metrics.py

Hi!

I ran the test code in this repo.

스크린샷 2020-11-12 오전 9 21 24

Could you check this?
Maybe the version of code related to SA function is not correct?

Thank you in advance!

Long time for training

Sir,it is taking a lot of time for training .I can see no progress in training.I am just training for 100 samples and 1 epoch in pre train and train functions for aae model.Can you please help me with this?What might be the problem sir?

RuntimeError when executing the run.py

Following the instructions on the readme, I did an End-to-End launch, by running the run.py in the scripts folder.

Traceback (most recent call last):
File "run.py", line 208, in
main(config)
File "run.py", line 190, in main
sample_from_model(config, model)
File "run.py", line 127, in sample_from_model
sampler_script.main(model, sampler_config)
File "/content/moses/scripts/moses/scripts/sample.py", line 47, in main
min(n, config.n_batch), config.max_len
File "/usr/local/lib/python3.7/site-packages/molsets-0.1.4-py3.7.egg/moses/vae/model.py", line 218, in sample
i_eos_mask = ~eos_mask & (w == self.eos)
RuntimeError: Expected object of scalar type Byte but got scalar type Bool for argument #2 'other'

How could I resolve this?

value of FCD Test and FCD Test SF

When I trained char-rnn model on my pc and got 30,000 samples from this generative model. After I evalutated the results between MOSES and my own , something weird happened, my FCD Test value and FCD TestSF value is much smaller than your results.
So , why ?

char-rnn MOSES MY OWN RESULT
FCD (↓) Test 0.355 02616
FCD (↓) TestSF 0.8995 0.7881

Error installing molsets due to dependency pomegranate==0.12.0

pip install molsets on Ubuntu 20.04.4 LTS with Python 3.8.13 and GCC 7.5.0 fails due to an error in installing a dependency.

Building wheel for pomegranate (setup.py) resulted in the following error:

building 'pomegranate.distributions.NeuralNetworkWrapper' extension

  gcc -pthread -B /home/dhanajayb/anaconda3/envs/DeepChem/compiler_compat -Wl,--sysroot=/ -Wsign-compare 
  -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC 
  -I/home/dhanajayb/anaconda3/envs/DeepChem/include/python3.8 
  -I/home/dhanajayb/anaconda3/envs/DeepChem/lib/python3.8/site-packages/numpy/core/include 
  -c pomegranate/distributions/NeuralNetworkWrapper.c 
  -o build/temp.linux-x86_643.8/pomegranate/distributions/NeuralNetworkWrapper.o
  
  gcc: error: pomegranate/distributions/NeuralNetworkWrapper.c: No such file or directory
  error: command '/usr/bin/gcc' failed with exit code 1
  ----------------------------------------
  ERROR: Failed building wheel for pomegranate

pip install pomegranate successfully installs pomegranate-0.14.8 on this machine. Has anyone else experienced this issue?

Moses on Knime

Hi all,
Has someone tried to implement some of the models of MOSES in Knime?
Thanks in advance,
Lionel

Request exact steps and the order to run the 5 models

Since there are many .py files across several folders, could you pl. share steps to execute the 5 models? I have been able to download most libraries including torch 1.1.0, etc. I am running into below errors. Clearly I am not running it the way it is supposed to be run. Could you kindly shed some light on this?
MOSES: Not knowing which py file to run in Jupyter notebook 6.0.3, I tried to execute individual model files. (i) I could run config.py, model.py, and train.py in 4 out of 5 models (except latentgan).  In Latentgan, ran into issues with (i) ddc_pub v3 (https://github.com/pcko1/Deep-Drug-Coder/blob/master/ddc_pub/ddc_v3.py), (ii) with molvecgen, and (iii) with Metrics (ModuleNotFoundError: No module named 'main.utils'; 'main' is not a package), Utils (NameError: name 'file' is not defined).I also ran into this error "ModuleNotFoundError    Traceback (most recent call last) in
     35
     36 # Custom dependencies
---> 37 from .vectorizers import SmilesVectorizer
     38 from .generators import CodeGenerator as DescriptorGenerator
     39 from .generators import HetSmilesGenerator

Stereochemistry

I'm wondering if anyone can help me with this but SMILE strings with stereochemistry seem to be not processable in this model. These strings include ones with "/" "" or "@"

Thank you

Joshua

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.