This is the official repository of:
How does initial warm-up data influence Bayesian optimization in low-data experimental settings (AC BO Hackathon 2024)
Elton Pan (MIT), Jurgis Ruza (MIT), Pengfei Cai (MIT)
Real-world experiments in chemistry and materials science often involve very small initial datasets (10-100 data points). In this project, we propose to investigate how the 1) size and 2) distribution of the warm-up dataset influence the performance of bayesian optimization. We propose experiments on HOMO-LUMO gap minimization task using the well-known QM9 dataset.
First, a k-means clustering algorithm determines the centroids (green), resulting in clusters shown above. Stratified sampling (i.e. sampling same number of datapoints per cluster) is then performed. For example, if we want to sample 10 warmup datapoints, we can sample 2 samples per cluster (see above). We show that stratified sampling as a more efficient way to sample a warmup dataset (right, molformer-stratified
vs. molformer random
).
Here, we vary the number of datapoints from 5-200. We show that simple representations such as Morgan fingerprints (left), more warmup samples improves BO performance. However, this is not true for pretrained embeddings such as MolFormer (center), where more warmup datapoints do not necessarily improve BO performance. In fact, only 20-50 perform best for MolFormer, showing that pretrained embeddings may allow fewer warmup samples - a common scenario in real-world, low-data BO. Overall, pretrained embeddings are more efficient for optimization in chemical space (right).
The code in this repo has been tested on a Linux machine running Python 3.8.8
Run the following terminal commands
- Clone repo to local directory
git clone https://github.com/eltonpan/bayes-warmup.git
- Set up and activate conda environment
cd bayes-warmup
conda create -n bayes-warmup
conda activate bayes-warmup
pip install -r requirements.txt
- Add conda environment to Jupyter notebook
conda install -c anaconda ipykernel
python -m ipykernel install --user --name=bayes-warmup
make sure the bayes-warmup
is the environment under dropdown menu Kernel
> Change kernel
The raw data required to reproduce results in the paper can be found in the data/
folder. The BO trajectories are saved in the saving/
folder. Results are visualized in bo_trajectory_result_analysis.ipynb
(trajectories) and visualize_pca.ipynb
(PCA plot).
- Get the molecular representations (Morgan fingerprint + MolFormer embeddings) using either:
- Download
qm9_ECFP6.csv
andqm9_molformer.csv
place these 2 files in thedata/
folder (highly encouraged) - OR run
get_ecfp.py
andget_embeddings.py
(sub-optimal, takes very long, ~9 hours).
- (Optional) Get the warm-up datasets
- Run
get_morgan_splits.py
andget_molformer_splits.py
. This step is optional sincedata/
folder already has the warm-up datasets pre-computed.
- Run the BO experiments using
run_training.py
- Example 1: if you would like to run
random
sampling withmorgan
fingerprints, run:
python run_training.py --save_path ./saving/morgan/random --data_path ./data/morgan/splits/random/ --test_path ./data/qm9_ECFP6.csv
- Example 2: if you would like to run
stratified
sampling withmolformer
embedddings, run:
python run_training.py --save_path ./saving/molformer/stratified --data_path ./data/molformer/splits/stratified/ --test_path ./data/qm9_molformer.csv
The above 2 commands will store trajectories in the saving/
folder.
- Visualize results using
bo_trajectory_result_analysis.ipynb
(trajectories) andvisualize_pca.ipynb
(PCA plot).
├── all_combi_trajs.pkl: pickle file of all saved trajectories (objective values vs. iteration)
├── bo_trajectory_result_analysis.ipynb: generate trajectory plots
├── data
│ ├── molformer: splits using molformer embeddings
│ ├── morgan: splits using morgan fingerprints
│ └── qm9.csv: QM9 dataset
├── featurizers
│ ├── morgan.py: ECFP6 class
├── figures
│ ├── bo_poster.png
│ ├── bo_results.png
│ ├── bo_trajectory.gif
│ └── stratified.png
├── get_ecfp.py: get morgan fingerprints of molecules
├── get_molformer_embeddings.py: get molformer embeddings of molecules
├── get_molformer_splits.py: get splits based on molformer embeddings
├── get_morgan_splits.py: get splits based on morgan fingerprints
├── kmeans.py: functions for k-means
├── README.md
├── run_training.py: run bayesian optimization of band gaps
├── saving
│ ├── molformer: raw trajectories (best objective so far and molecules) for molformer
│ └── morgan: (best objective so far and molecules) for morgan
├── visualize_pca.ipynb: visualize BO in PCA space, generate gif
└── visualize.py: helper functions for visualizations
If you have any questions, please free free to contact us at [email protected], [email protected], [email protected]