samuelstanton / lambo Goto Github PK

Code to reproduce experiments in "Accelerating Bayesian Optimization for Protein Design with Denoising Autoencoders" (Stanton et al 2022)

License: Apache License 2.0

Python 32.48% Jupyter Notebook 67.52%

bayesian-optimization protein-design

lambo's Introduction

🏎️🏎️🏎️🏎️

LaMBO: Accelerating Bayesian Optimization for Biological Sequence Design with Denoising Autoencoders

Abstract

Bayesian optimization (BayesOpt) is a gold standard for query-efficient continuous optimization. However, its adoption for drug and antibody sequence design has been hindered by the discrete, high-dimensional nature of the decision variables. We develop a new approach (LaMBO) which jointly trains a denoising autoencoder with a discriminative multi-task Gaussian process head, allowing gradient-based optimization of multi-objective acquisition functions in the latent space of the autoencoder. These acquisition functions allow LaMBO to balance the explore-exploit tradeoff over multiple design rounds, and to balance objective tradeoffs by optimizing sequences at many different points on the Pareto frontier. We evaluate LaMBO on a small-molecule task based on the ZINC dataset and introduce a new large-molecule task targeting fluorescent proteins. In our experiments LaMBO outperforms genetic optimizers and does not require a large pretraining corpus, demonstrating that BayesOpt is practical and effective for biological sequence design.

Key Results

BayesOpt can be used to maximize the simulated folding stability (-dG) and solvent-accessible surface area (SASA) of red-spectrum fluorescent proteins. Higher is better for both objectives. The starting proteins are shown as colored circles, with corresponding optimized offspring shown as crosses. Stability correlates with protein function (e.g. how long the protein can fluoresce) while SASA is a proxy for fluorescent intensity.

On all three tasks (described in Section 5.1 of the paper), LaMBO outperforms genetic algorithm baselines, specifically NSGA-2 and a model-based genetic optimizer with the same surrogate architecture (MTGP + NEHVI + GA). Performance is quantified by the hypervolume bounded by the optimized Pareto frontier. The midpoint, lower, and upper bounds of each curve depict the 50%, 20%, and 80% quantiles, estimated from 10 trials. See Section 5.2 in the paper for more discussion.

UPDATE 04/20/2024

An open-source contribution identified some subtle bugs that hurt performance of all methods substantially on some tasks. The proposed fix has been merged and therefore the current master commit will now produce better results than originally reported. If you wish to reproduce the original curves in the paper, check out the following commit

git checkout 431b052

Installation

FoldX

FoldX is available under a free academic license. After creating an account you will be emailed a link to download the FoldX executable and supporting assets. Copy the contents of the downloaded archive to ~/foldx. You may also need to rename the FoldX executable (e.g. mv -v ~/foldx/foldx_20221231 ~/foldx/foldx).

RDKit

RDKit is easiest to install if you're using Conda as your package manager (shown below).

TDC

TDC is required to run the DRD3 docking task. See the linked README for installation instructions.

git clone https://github.com/samuelstanton/lambo && cd lambo
conda create --name lambo-env python=3.8 -y && conda activate lambo-env
conda install -c conda-forge rdkit -y
conda install -c conda-forge pytdc pdbfixer openbabel -y
pip install -r requirements.txt --upgrade
pip install -e .

Reproducing the figures

This project uses Weight and Biases for logging. The experimental data used to produce the plots in our papers is available here.

See ./notebooks/plot_pareto_front for a demonstration of how to reproduce Figure 1.

See ./notebooks/plot_hypervolume for a demonstration of how to reproduce Figures 3 and 4.

Running the code

See ./notebooks/rfp_preprocessing.ipynb for a demonstration of how to download PDB files from the RCSB Protein Data Bank and prepare them for use with FoldX.

See ./notebooks/foldx_demo.ipynb for a demonstration of how to use our Python bindings for FoldX, given a starting sequence with known structure.

This project uses Hydra for configuration when running from the command line.

We recommend running NSGA-2 first to test your installation

python scripts/black_box_opt.py optimizer=mf_genetic optimizer/algorithm=nsga2 task=regex tokenizer=protein

For the model-based genetic baseline, run

python scripts/black_box_opt.py optimizer=mb_genetic optimizer/algorithm=soga optimizer.encoder_obj=mll task=regex tokenizer=protein surrogate=multi_task_exact_gp acquisition=nehvi

For the full LaMBO algorithm, run

python scripts/black_box_opt.py optimizer=lambo optimizer.encoder_obj=mlm task=regex tokenizer=protein surrogate=multi_task_exact_gp acquisition=nehvi

To evaluate on the multi-objective RFP (large-molecule) or ZINC (small-molecule) tasks, use task=proxy_rfp tokenizer=protein and task=chem tokenizer=selfies, respectively.

To evaluate on the single-objective ZINC task used in papers like Tripp et al (2020), run

python scripts/black_box_opt.py optimizer=lambo optimizer.encoder_obj=lanmt task=chem_lsbo tokenizer=selfies surrogate=single_task_svgp acquisition=ei encoder=lanmt_cnn surrogate.holdout_ratio=0.1 surrogate.bs=256 surrogate.eval_bs=256 optimizer.resampling_weight=0.5 optimizer.window_size=8

Below we list significant configuration options. See the config files in ./hydra_config for all configurable parameters. Note that any config field can be overridden from the command line, and some configurations are not supported.

Acquisition options

nehvi (default, multi-objective)
ehvi (multi-objective)
ei (single-objective)
greedy (single and multi-objective)

Encoder options

mlm_cnn (default, substitutions only)
mlm_transformer (substitutions only)
lanmt_cnn (substitutions, insertions, deletions)
lanmt_transformer (substitutions, insertions, deletions)

Optimizer options

lambo (default)
mb_genetic (Genetic baseline with model-based compound screening)
mf_genetic (Model-free genetic baseline)

Algorithm options

soga (default, single-objective)
nsga2 (multi-objective)

Surrogate options

multi_task_exact_gp (default, DKL MTGP regression)
single_task_svgp (DKL SVGP regression)
single_task_exact_gp (DKL GP regression)
string_kernel_exact_gp (not recommended, SSK GP regression)
deep_ensemble (MLE regression)

Task options

regex (default, maximize counts of 3 bigrams)
regex_easy (maximize counts of 2 tokens)
chem (ZINC small molecules, maximize LogP and QED)
chem_lsbo (ZINC small molecules, maximize penalized LogP)
tdc_docking (ZINC small molecules, minimize DRD3 docking affinity and synthetic accessibility)
proxy_rfp (FPBase large molecules, maximize stability and SASA)

Tokenizer options

protein (default, amino acid vocab for large molecules)
selfies (ZINC-derived SELFIES vocab for small molecules)
smiles (not recommended, ZINC-derived SMILES vocab for small molecules)

Tests

pytest tests

This project currently has very limited test coverage.

Citation

If you use any part of this code for your own work, please cite

@article{stanton2022accelerating,
  title={Accelerating Bayesian Optimization for Biological Sequence Design with Denoising Autoencoders},
  author={Stanton, Samuel and Maddox, Wesley and Gruver, Nate and Maffettone, Phillip and Delaney, Emily and Greenside, Peyton and Wilson, Andrew Gordon},
  journal={arXiv preprint arXiv:2203.12742},
  year={2022}
}

lambo's People

Contributors

Stargazers

Watchers

Forkers

nnguyen19 henryjia mj10 s-phurinut simonlevine hanyanghenry-wang clvnmng rnaimehaom kofoed12 henridwyer joshualin24 wangdepin tanxinxin00 kirjner miguelgondu rmichae1 biocoder007 badeok0716 yucenli

lambo's Issues

Issue with Acquisition Function Calculation in LAMBO Implementation

I'm currently working on biological sequence design experiments based on the LAMBO paper and considering LAMBO as one of the main baselines.
I appreciate the authors for providing well-reproducible experiments. I could easily reproduce Figure 3 based on this repository.

However, the implementation of this repo seems to have issues in calculating the acquisition function NoisyEHVI.

My concern is about lambo/optimizers/lambo.py Line 335:
batch_acq_val = acq_fn(best_seqs[None, :]).mean().item()
*best_seqs[None, :]: numpy array of strings with shape [1, batch_size]

The function call of acq_fn is based on https://github.com/samuelstanton/lambo/blob/main/lambo/acquisitions/monte_carlo.py#L69-L89.

def forward(self, X: array) -> Tensor:
     if isinstance(X, Tensor):
         baseline_X = self._X_baseline
         baseline_X = baseline_X.expand(*X.shape[:-2], -1, -1)
         X_full = torch.cat([baseline_X, X], dim=-2)
     else: 
         baseline_X = copy(self.X_baseline_string) # ensure contiguity
         baseline_X.resize(
             baseline_X.shape[:-(X.ndim)] + X.shape[:-1] + baseline_X.shape[-1:]
         )
         X_full = concatenate([baseline_X, X], axis=-1)
     # Note: it is important to compute the full posterior over `(X_baseline, X)``
     # to ensure that we properly sample `f(X)` from the joint distribution `
     # `f(X_baseline, X) ~ P(f | D)` given that we can already fixed the sampled
     # function values for `f(X_baseline)`
     posterior = self.model.posterior(X_full)
     q = X.shape[-2]
     self._set_sampler(q=q, posterior=posterior)
     samples = self.sampler(posterior)[..., -q:, :]
     # add previous nehvi from pending points
     return self._compute_qehvi(samples=samples) + self._prev_nehvi

Note that
X : numpy array of strings with shape [1, batch_size]
self.X_baseline_string: numpy array of sequences with shape [n_base]
X_full : numpy array of strings with shape [1, batch_size + n_base]
Hence,
q = X.shape[-2] equals to 1, and

        self._set_sampler(q=q, posterior=posterior)
        samples = self.sampler(posterior)[..., -q:, :] 
        # add previous nehvi from pending points
        return self._compute_qehvi(samples=samples) + self._prev_nehvi

this part only calculates 1-NEHVI of the last sequence of X (=X[0,-1]). In other words, other sequences in X (X[0,:-1]) are ignored during the calculation.
(Note that samples is torch Tensor of the shape [n_samples, 1, q, fdim].)
This problem occurs when X is numpy array of sequences with X.ndim=2 and X.shape[-1]>1.
I think, the problem originated from q = X.shape[-2], which assumes that X is extracted feature from sequences.

This causes two problems

[LAMBO] Incorrect logic to determine the best_batch_idx. (https://github.com/samuelstanton/lambo/blob/main/lambo/optimizers/lambo.py#L355-L365)
[MBGA] Incorrect NEHVI calculation in SurrogateTask.
(https://github.com/samuelstanton/lambo/blob/main/lambo/tasks/surrogate_task.py#L28)

There can be two solutions based on authors' original intention.

If authors wanted to compute N-NEHVI of X[0,:] (X: numpy array of shape [1, N]) when calling acq_fn(X):

def forward(self, X: array, debug:bool=True) -> Tensor:
      if isinstance(X, Tensor):
            baseline_X = self._X_baseline
            baseline_X = baseline_X.expand(*X.shape[:-2], -1, -1)
            X_full = torch.cat([baseline_X, X], dim=-2)
      elif X.ndim == 2:
            # X : (1, N)
            # To calculate 1-NEHVI, parallely (N in once)
            assert X.shape[0]==1, f"X type {type(X)}, X ndim {X.ndim}, X shape {X.shape}"
            X = self.model.get_features(X[0], self.model.bs) # X : (N, 16)
            X = X.unsqueeze(0) # X : (1, N, 16)
            baseline_X = self._X_baseline # baseline_X : (1, n, 16)
            baseline_X = baseline_X.expand(*X.shape[:-2], -1, -1) # baseline_X : (1, n, 16)
            X_full = torch.cat([baseline_X, X], dim=-2) # X_full : (1, n+N, 16)

If authors wanted to compute the average value of 1-NEHVI of X[0,i]s (X: numpy array of shape [1, N]) when calling acq_fn(X):

def forward(self, X: array, debug:bool=True) -> Tensor:
      if isinstance(X, Tensor):
            baseline_X = self._X_baseline
            baseline_X = baseline_X.expand(*X.shape[:-2], -1, -1)
            X_full = torch.cat([baseline_X, X], dim=-2)
      elif X.ndim == 2:
            # X : (1, N)
            # To calculate 1-NEHVI, parallely (N in once)
            assert X.shape[0]==1, f"X type {type(X)}, X ndim {X.ndim}, X shape {X.shape}"
            X = self.model.get_features(X[0], self.model.bs) # X : (N, 16)
            X = X.unsqueeze(-2) # X : (N, 1, 16)
            baseline_X = self._X_baseline # baseline_X : (1, n, 16)
            baseline_X = baseline_X.expand(*X.shape[:-2], -1, -1) # baseline_X : (N, n, 16)
            X_full = torch.cat([baseline_X, X], dim=-2) # X_full : (N, n+1, 16)

Since I couldn't find any mention of parallel acquisition function (q-NEHVI for q > 1 case) in the LAMBO paper, I conducted version (2.) and got the following results with the same commands (10 seeds).

(I'll update the figure after I finish the LaMBO-Fixed and MBGA-Fixed experiments on the other tasks. Some trials are not finished yet, but fixed versions seem to show better performance on the other tasks.)

Could you please verify if my concern is valid? Also, I'd like to know if the corrected implementation aligns with your original intent.
Please refer to Pull request #13

'NoneType' object

I have run the current version of the code with the below command:

python scripts/black_box_opt.py optimizer=lambo optimizer.encoder_obj=mlm task=proxy_rfp tokenizer=protein surrogate=multi_task_exact_gp acquisition=nehvi

and there is the full stack trace of the code with TypeError: 'NoneType' object is not iterable error.

wandb: Run `wandb offline` to turn off syncing.

logger:
  _target_: upcycle.logging.DataFrameLogger
  log_dir: data/experiments/test/summer-puddle-110/2023-05-01_06-34-09
task:
  _target_: lambo.tasks.proxy_rfp.proxy_rfp.ProxyRFPTask
  obj_dim: 2
  log_prefix: proxy_rfp
  batch_size: 16
  max_len: 244
  max_num_edits: null
  max_ngram_size: 1
  allow_len_change: false
  num_start_examples: 512
acquisition:
  _target_: lambo.acquisitions.ehvi.NoisyEHVI
  num_samples: 2
  batch_size: 16
encoder:
  _target_: lambo.models.lm_elements.LanguageModel
  name: mlm_cnn
  model:
    _target_: lambo.models.shared_elements.mCNN
    tokenizer:
      _target_: lambo.utils.ResidueTokenizer
    max_len: 244
    embed_dim: 64
    latent_dim: 16
    out_dim: 16
    kernel_size: 5
    p: 0.0
    layernorm: true
    max_len_delta: 0
  batch_size: 32
  num_epochs: 128
  patience: 32
  lr: 0.001
  max_shift: 0
  mask_ratio: 0.125
optimizer:
  _target_: lambo.optimizers.lambo.LaMBO
  _recursive_: false
  num_rounds: 64
  num_gens: 16
  num_opt_steps: 32
  patience: 32
  lr: 0.1
  concentrate_pool: 1
  mask_ratio: 0.125
  resampling_weight: 1.0
  encoder_obj: mlm
  optimize_latent: true
  position_sampler: uniform
  entropy_penalty: 0.01
  window_size: 1
  latent_init: null
  algorithm:
    _target_: pymoo.algorithms.soo.nonconvex.ga.GA
    pop_size: 16
    n_offsprings: null
    sampling:
      _target_: lambo.optimizers.sampler.BatchSampler
      batch_size: 16
    crossover:
      _target_: lambo.optimizers.crossover.BatchCrossover
      prob: 0.25
      prob_per_query: 0.25
    mutation:
      _target_: lambo.optimizers.mutation.LocalMutation
      prob: 1.0
      eta: 16
      safe_mut: false
    eliminate_duplicates: true
tokenizer:
  _target_: lambo.utils.ResidueTokenizer
surrogate:
  _target_: lambo.models.gp_models.MultiTaskExactGP
  max_shift: 0
  mask_size: 0
  bootstrap_ratio: null
  min_num_train: 128
  task_noise_init: 0.25
  gp_lr: 0.005
  enc_lr: 0.005
  bs: 32
  eval_bs: 16
  num_epochs: 256
  holdout_ratio: 0.2
  early_stopping: true
  patience: 32
  eval_period: 2
  out_dim: 2
  feature_dim: 16
  encoder_wd: 0.0001
  rank: null
  task_covar_prior:
    _target_: gpytorch.priors.LKJCovariancePrior
    'n': 2
    eta: 2.0
    sd_prior:
      _target_: gpytorch.priors.SmoothedBoxPrior
      a: 0.0001
      b: 1.0
  data_covar_module:
    _target_: gpytorch.kernels.MaternKernel
    ard_num_dims: 16
    lengthscale_prior:
      _target_: gpytorch.priors.NormalPrior
      loc: 0.7
      scale: 0.01
  likelihood:
    _target_: gpytorch.likelihoods.MultitaskGaussianLikelihood
    num_tasks: 2
    has_global_noise: false
    noise_constraint:
      _target_: gpytorch.constraints.GreaterThan
      lower_bound: 0.0001
seed: 0
trial_id: 0
project_name: lambo
version: v0.2.1
data_dir: data/experiments
exp_name: test
job_name: summer-puddle-110
timestamp: 2023-05-01_06-34-09
log_dir: data/experiments/test
wandb_mode: online
wandb_host: https://api.wandb.ai

GPU available: True
AdRed is non-dominated, adding to start pool
AdRed, [<lambo.utils.FoldxMutation object at 0x2aab699c5c70>]
DsRed.M1 is non-dominated, adding to start pool
DsRed.M1, [<lambo.utils.FoldxMutation object at 0x2aab69545a60>]
DsRed.T4 is non-dominated, adding to start pool
DsRed.T4, [<lambo.utils.FoldxMutation object at 0x2aab699fe580>]
RFP630 is non-dominated, adding to start pool
RFP630, [<lambo.utils.FoldxMutation object at 0x2aab6a1f4b20>]
mRouge is non-dominated, adding to start pool
mRouge, [<lambo.utils.FoldxMutation object at 0x2aab6a1f4b80>]
mScarlet is non-dominated, adding to start pool
mScarlet, [<lambo.utils.FoldxMutation object at 0x2aab6a1f4b80>]
[2023-05-01 06:34:37,890][root][ERROR] - 'NoneType' object is not iterable
Traceback (most recent call last):
  File "scripts/black_box_opt.py", line 56, in main
    metrics = optimizer.optimize(
  File "/storage/hpc/data/nmn5x/LAMBO/lambo/lambo/optimizers/lambo.py", line 75, in optimize
    is_feasible = self.bb_task.is_feasible(candidate_pool)
  File "/storage/hpc/data/nmn5x/LAMBO/lambo/lambo/tasks/base_task.py", line 72, in is_feasible
    is_feasible = np.array([len(cand) <= self.max_len for cand in candidates]).reshape(-1)
  File "/storage/hpc/data/nmn5x/LAMBO/lambo/lambo/tasks/base_task.py", line 72, in <listcomp>
    is_feasible = np.array([len(cand) <= self.max_len for cand in candidates]).reshape(-1)
  File "/storage/hpc/data/nmn5x/LAMBO/lambo/lambo/candidate.py", line 147, in __len__
    tok_idxs = self.tokenizer.encode(self.mutant_residue_seq)
  File "/storage/hpc/data/nmn5x/miniconda/envs/lambo-env-lewis/lib/python3.8/site-packages/cachetools/__init__.py", line 642, in wrapper
    v = func(*args, **kwargs)
  File "/storage/hpc/data/nmn5x/LAMBO/lambo/lambo/utils.py", line 63, in encode
    for char in seq:
TypeError: 'NoneType' object is not iterable

I would be very grateful if you guide me in this order.
Thank you!

Error when using `chem_lsbo` with `mlm_cnn`

When using command

python scripts/black_box_opt.py optimizer=lambo optimizer.encoder_obj=lanmt task=chem_lsbo tokenizer=selfies surrogate=single_task_svgp acquisition=ei encoder=mlm_cnn

, the following error occurs

[2023-03-16 00:23:19,683][root][ERROR] - cannot sample n_sample <= 0 samples
Traceback (most recent call last):
  File "scripts/black_box_opt.py", line 57, in main
    metrics = optimizer.optimize(
  File "lambo_test/lambo/lambo/optimizers/lambo.py", line 179, in optimize
    records = self.surrogate_model.fit(
  File "lambo_test/lambo/lambo/models/gp_models.py", line 416, in fit
    return fit_gp_surrogate(**fit_kwargs)
  File "lambo_test/lambo/lambo/models/gp_utils.py", line 160, in fit_gp_surrogate
    start_metrics.update(lanmt_eval_epoch(surrogate.encoder.model, val_loader, split='val'))
  File "lambo_test/lambo/lambo/models/lanmt.py", line 149, in lanmt_eval_epoch
    src_tok_idxs = corrupt_tok_idxs(tgt_tok_idxs, model.tokenizer, model.max_len_delta)
  File "lambo_test/lambo/lambo/models/lanmt.py", line 18, in corrupt_tok_idxs
    rand_idxs = sample_mask(tgt_tok_idxs, tokenizer, mask_size=max_len_delta)
  File "lambo_test/lambo/lambo/models/mlm.py", line 58, in sample_mask
    mask_idxs = torch.multinomial(mask_weights, mask_size, replacement=False)
RuntimeError: cannot sample n_sample <= 0 samples

This seems to be caused by max_len_delta: 0 argument in lambo/hydra_config/encoder/mlm_cnn.yaml which sets the mask_size to 0

Caluclation error in gpytorch

Hello!
I tried to run the model-based genetic baseline by following your sample command.
python scripts/black_box_opt.py optimizer=mb_genetic optimizer/algorithm=soga optimizer.encoder_obj=mll task=regex tokenizer=protein surrogate=multi_task_exact_gp acquisition=nehvi
However, it caused the following error.

[2022-05-10 21:34:54,070][root][ERROR] - Input is not a valid correlation matrix
Traceback (most recent call last):
  File "scripts/black_box_opt.py", line 55, in main
    metrics = optimizer.optimize(
  File "/home/keisuke-yamada/lambo/lambo/optimizers/pymoo.py", line 189, in optimize
    problem = self._create_inner_task(
  File "/home/keisuke-yamada/lambo/lambo/optimizers/pymoo.py", line 389, in _create_inner_task
    records = self.surrogate_model.fit(
  File "/home/keisuke-yamada/lambo/lambo/models/gp_models.py", line 321, in fit
    return fit_gp_surrogate(**fit_kwargs)
  File "/home/keisuke-yamada/lambo/lambo/models/gp_utils.py", line 238, in fit_gp_surrogate
    enc_sup_loss = fit_encoder_only(
  File "/home/keisuke-yamada/lambo/lambo/models/gp_utils.py", line 106, in fit_encoder_only
    loss = gp_train_step(surrogate, optimizer, inputs, targets, mll)
  File "/home/keisuke-yamada/lambo/lambo/models/gp_utils.py", line 91, in gp_train_step
    loss = -mll(output, targets).mean()
  File "/home/keisuke-yamada/lambo/.venv/src/gpytorch/gpytorch/module.py", line 30, in __call__
    outputs = self.forward(*inputs, **kwargs)
  File "/home/keisuke-yamada/lambo/.venv/src/gpytorch/gpytorch/mlls/exact_marginal_log_likelihood.py", line 63, in forward
    res = self._add_other_terms(res, params)
  File "/home/keisuke-yamada/lambo/.venv/src/gpytorch/gpytorch/mlls/exact_marginal_log_likelihood.py", line 43, in _add_other_terms
    res.add_(prior.log_prob(closure(module)).sum())
  File "/home/keisuke-yamada/lambo/.venv/src/gpytorch/gpytorch/priors/lkj_prior.py", line 134, in log_prob
    log_prob_corr = self.correlation_prior.log_prob(correlations)
  File "/home/keisuke-yamada/lambo/.venv/src/gpytorch/gpytorch/priors/lkj_prior.py", line 60, in log_prob
    raise ValueError("Input is not a valid correlation matrix")
ValueError: Input is not a valid correlation matrix

It seems like the code fails to calculate an appropriate correlation matrix in gpytorch.priors.lkj_prior.LKJCovariancePrior.log_prob. Do you have any ideas why it happens?

Thanks!

Device error

Hello,
I tried to run the model-based genetic baseline by following your command.

python scripts/black_box_opt.py optimizer=mb_genetic optimizer/algorithm=soga optimizer.encoder_obj=mll task=regex tokenizer=protein surrogate=multi_task_exact_gp acquisition=nehvi

I used GPU to run it, and GPU available: True. But there is the following error:

Traceback (most recent call last):
  File "scripts/black_box_opt.py", line 55, in main
    metrics = optimizer.optimize(
  File "/home/bcell/home/lambo/lambo/optimizers/pymoo.py", line 189, in optimize
    problem = self._create_inner_task(
  File "/home/bcell/home/lambo/lambo/optimizers/pymoo.py", line 389, in _create_inner_task
    records = self.surrogate_model.fit(
  File "/home/bcell/home/lambo/lambo/models/gp_models.py", line 321, in fit
    return fit_gp_surrogate(**fit_kwargs)
  File "/home/bcell/home/lambo/lambo/models/gp_utils.py", line 208, in fit_gp_surrogate
    enc_sup_loss = fit_encoder_only(
  File "/home/bcell/home/lambo/lambo/models/gp_utils.py", line 76, in fit_encoder_only
    loss = gp_train_step(surrogate, optimizer, inputs, targets, mll)
  File "/home/bcell/home/lambo/lambo/models/gp_utils.py", line 60, in gp_train_step
    loss = -mll(output, targets).mean()
  File "/home/bcell/anaconda3/envs/lambo-env/lib/python3.8/site-packages/gpytorch/module.py", line 30, in __call__
    outputs = self.forward(*inputs, **kwargs)
  File "/home/bcell/anaconda3/envs/lambo-env/lib/python3.8/site-packages/gpytorch/mlls/exact_marginal_log_likelihood.py", line 63, in forward
    res = self._add_other_terms(res, params)
  File "/home/bcell/anaconda3/envs/lambo-env/lib/python3.8/site-packages/gpytorch/mlls/exact_marginal_log_likelihood.py", line 43, in _add_other_terms
    res.add_(prior.log_prob(closure(module)).sum())
  File "/home/bcell/anaconda3/envs/lambo-env/lib/python3.8/site-packages/gpytorch/priors/lkj_prior.py", line 105, in log_prob
    log_prob_corr = self.correlation_prior.log_prob(correlations)
  File "/home/bcell/anaconda3/envs/lambo-env/lib/python3.8/site-packages/gpytorch/priors/lkj_prior.py", line 62, in log_prob
    return super().log_prob(X_cholesky)
  File "/home/bcell/anaconda3/envs/lambo-env/lib/python3.8/site-packages/gpytorch/priors/prior.py", line 27, in log_prob
    return super(Prior, self).log_prob(self.transform(x))
  File "/home/bcell/anaconda3/envs/lambo-env/lib/python3.8/site-packages/torch/distributions/lkj_cholesky.py", line 117, in log_prob
    unnormalized_log_pdf = torch.sum(order * diag_elems.log(), dim=-1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

How do I fix this error?
Thank you!

mask_ratio during mutation

I am trying to use the LaMBO model with MLM decoding. In the code, the MLM optimization step uses mask_ratio defined in the config, but during the mutation step in lambo.py (from line246 to line249), it doesn't seem to introduce more than one mask per sequence.

elif self.position_sampler == 'uniform':
                    mask_idxs = np.concatenate([
                        random.choice(w_idxs) for w_idxs in window_mask_idxs.values()
                    ])

Is this something intended? Do you think setting the mutation ratio same as the MLM optimization step would make the exploration more efficient, or collapse the latent feature optimization?

Thanks!

Could you release the notebook to reproduce figure 10 of the paper?

Hi Samuel!

Thank you for your repo and great work! It is reported in the paper that single-objective LaMBO can greatly outperform LSBO in the penalized logp task. I found your released notebooks reproducing figure 1 and figure 3 of the paper very helpful and it would be of great help if you could also release the notebook and the wandb loggings to reproduce the figure 10 in the appendix.

Thanks a lot in advance!

A typo on the pre-processing - pH values for RFPs

Hi, lambo authors.

First of all, thanks for making your code open source. Really interesting work!

I'm currently working on replicating your RFP experiment, and I think I've found a small typo on the preprocessing: When you get the info about the pdbs using pypdb, there's a typo on the get_pH function:

# Current implementation
def get_pH(info):
    try:
        return info['expt1_crystal_grow']['ph']
    except KeyError:
        return float('NaN')

I think you meant info['exptl_crystal_grow'][0]['p_h']. The 1 instead of the l is likely throwing a KeyError for all entries. In other words, you ran foldx with all proteins assuming a pH of 7.0, but some of them actually have different pH values. For example, 2vvh crystal's pH is 8.4.

Still I don't think it'd change the stability and SASA results all that much.

Problem with installing requirements

Hi!
I have an issue installing the required version of gpytorch and botorch. I cannot fetch the specific commit hash. Could you please check if this is a common problem or this is a problem of my network? If I skip this step and install gpytorch and botorch directly form pip, I found that I would encounter problems similar to #3 and #4 and I cannot trivially fix them. So I wonder the versions of gpytorch and botorch may matter. Thanks in advance!

sampler INF NAN

I am installing the environment according to your requirements.txt. But after installation, when I run the following command, the sampler does not sample. My screenshot of the error is attached. The location of the error is in line 307 of lambo.py under your optimizer folder. Maybe some advice would be very helpful for us to learn your paper and code. Thank you very much!

Installation and experiment replication [MacOS (M1/ARM)]

Hello,

I'm having trouble replicating the exact environment and results that's described in the readme.md and requirements.txt.
Namely, running the commands one-by-one lands in an error at the pip install -r requirements.txt --upgrade. The changes that were required to complete the installation process are the following:

Changing torchvision from 0.11.1 to 0.11.2
Removing the strict requirements from vina. There seems to be a bug on one of their __init__.py.
installing tokenizers fails unless the rust compiler is installed.
Running the examples shows that protobuf needs to be below 3.20.x.

It looks like the majority of these issues are MacOS (>=13.5.*) (M1/ARM) specific and linux64 based system don't have these issues. Here one can install torchvision==0.11.1, vina as specified.
Though the protobuf error also occurs on Linux (see below).

Once the environment is setup, if we want to run the protein optimization task, as in:

python scripts/black_box_opt.py optimizer=lambo optimizer.encoder_obj=mlm task=proxy_rfp tokenizer=protein surrogate=multi_task_exact_gp acquisition=ehvi trial_id=2 at commit 431b052 add LSBO comparison notebook, we run into nan values during the computation - see error below (for completeness I've attached the log of the complete run).

For sake of replicability we run on a Linux system, with the system setup as close to the requirements.txt as possible (I've also attached the environment as linux_env.txt)

Protobuf

TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

Experiment Run

[2023-09-28 15:07:36,464][root][ERROR] - Expected parameter logits (Tensor of shape (16, 230, 26)) of distribution Categorical(logits: torch.Size([16, 230, 26])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]],

        [[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],

test_run.log
linux_env.txt