bjing2016 / alphaflow Goto Github PK

AlphaFold Meets Flow Matching for Generating Protein Ensembles

License: MIT License

Python 98.64% Shell 0.08% Dockerfile 1.28%

alphaflow's Introduction

AlphaFlow

AlphaFlow is a modified version of AlphaFold, fine-tuned with a flow matching objective, designed for generative modeling of protein conformational ensembles. In particular, AlphaFlow aims to model:

Experimental ensembles, i.e, potential conformational states as they would be deposited in the PDB
Molecular dynamics ensembles at physiological temperatures

We also provide a similarly fine-tuned version of ESMFold called ESMFlow. Technical details and thorough benchmarking results can be found in our paper, AlphaFold Meets Flow Matching for Generating Protein Ensembles, by Bowen Jing, Bonnie Berger, Tommi Jaakkola. This repository contains all code, instructions and model weights necessary to run the method. If you have any questions, feel free to open an issue or reach out at [email protected].

June 2024 update: We have trained a 12-layer version of AlphaFlow-MD+Templates (base and distilled) which runs 2.5x times faster than the 48-layer version at a small loss in performance. We recommend considering this model if reference structures (PDB or AlphaFold) are available and runtime is of high priority.

Installation
Model weights
Running inference
Evaluation scripts
Training
Ensembles
License
Citation

Installation

In an environment with Python 3.9 (for example, conda create -n [NAME] python=3.9), run:

pip install numpy==1.21.2 pandas==1.5.3
pip install torch==1.12.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install biopython==1.79 dm-tree==0.1.6 modelcif==0.7 ml-collections==0.1.0 scipy==1.7.1 absl-py einops
pip install pytorch_lightning==2.0.4 fair-esm mdtraj wandb
pip install 'openfold @ git+https://github.com/aqlaboratory/openfold.git@103d037'

We ran installation on a machine with CUDA 11.6 and have tested with A100 and A6000 GPUs. (See this link for instructions on how to install a specific CUDA version in Conda.)

Model weights

We provide several versions of AlphaFlow (and similarly named versions of ESMFlow).

AlphaFlow-PDB—trained on PDB structures to model experimental ensembles from X-ray crystallography or cryo-EM under different conditions
AlphaFlow-MD—trained on all-atom, explicit solvent MD trajectories at 300K
AlphaFlow-MD+Templates—trained to additionally take a PDB structure as input, and models the corresponding MD ensemble at 300K

For all models, the distilled version runs significantly faster at the cost of some loss of accuracy (benchmarked in the paper).

For AlphaFlow-MD+Templates, the 12l versions have 12 instead of 48 Evoformer layers and run 2.5x times faster at a small loss in performance.

AlphaFlow models

Model	Version	Weights
AlphaFlow-PDB	base	https://alphaflow.s3.amazonaws.com/params/alphaflow_pdb_base_202402.pt
AlphaFlow-PDB	distilled	https://alphaflow.s3.amazonaws.com/params/alphaflow_pdb_distilled_202402.pt
AlphaFlow-MD	base	https://alphaflow.s3.amazonaws.com/params/alphaflow_md_base_202402.pt
AlphaFlow-MD	distilled	https://alphaflow.s3.amazonaws.com/params/alphaflow_md_distilled_202402.pt
AlphaFlow-MD+Templates	base	https://alphaflow.s3.amazonaws.com/params/alphaflow_md_templates_base_202402.pt
AlphaFlow-MD+Templates	distilled	https://alphaflow.s3.amazonaws.com/params/alphaflow_md_templates_distilled_202402.pt
AlphaFlow-MD+Templates	12l-base	https://alphaflow.s3.amazonaws.com/params/alphaflow_12l_md_templates_base_202406.pt
AlphaFlow-MD+Templates	12l-distilled	https://alphaflow.s3.amazonaws.com/params/alphaflow_12l_md_templates_distilled_202406.pt

ESMFlow models

Model	Version	Weights
ESMFlow-PDB	base	https://alphaflow.s3.amazonaws.com/params/esmflow_pdb_base_202402.pt
ESMFlow-PDB	distilled	https://alphaflow.s3.amazonaws.com/params/esmflow_pdb_distilled_202402.pt
ESMFlow-MD	base	https://alphaflow.s3.amazonaws.com/params/esmflow_md_base_202402.pt
ESMFlow-MD	distilled	https://alphaflow.s3.amazonaws.com/params/esmflow_md_distilled_202402.pt
ESMFlow-MD+Templates	base	https://alphaflow.s3.amazonaws.com/params/esmflow_md_templates_base_202402.pt
ESMFlow-MD+Templates	distilled	https://alphaflow.s3.amazonaws.com/params/esmflow_md_templates_distilled_202402.pt

Training checkpoints (from which fine-tuning can be resumed) are available upon request; please reach out if you'd like to collaborate!

Running inference

Preparing input files

Prepare a input CSV with an name and seqres entry for each row. See splits/atlas_test.csv for examples.
If running an AlphaFlow model, prepare an MSA directory and place the alignments in .a3m format at the following paths: {alignment_dir}/{name}/a3m/{name}.a3m. If you don't have the MSAs, there are two ways to generate them:
1. Query the ColabFold server with python -m scripts.mmseqs_query --split [PATH] --outdir [DIR].
2. Download UniRef30 and ColabDB according to https://github.com/sokrypton/ColabFold/blob/main/setup_databases.sh and run python -m scripts.mmseqs_search_helper --split [PATH] --db_dir [DIR] --outdir [DIR].
If running an MD+Templates model, place the template PDB files into a templates directory with filenames matching the names in the input CSV. The PDB files should include only a single chain with no residue gaps.

Running the model

The basic command for running inference with AlphaFlow is:

python predict.py --mode alphafold --input_csv [PATH] --msa_dir [DIR] --weights [PATH] --samples [N] --outpdb [DIR]

If running the PDB model, we recommend appending --self_cond --resample for improved performance.

The basic command for running inference with ESMFlow is

python predict.py --mode esmfold --input_csv [PATH] --weights [PATH] --samples [N] --outpdb [DIR]

Additional command line arguments for either model:

Use the --pdb_id argument to select (one or more) rows in the CSV. If no argument is specified, inference is run on all rows.
If running the MD model with templates, append --templates_dir [DIR].
If running any distilled model, append the arguments --noisy_first --no_diffusion.
To truncate the inference process for increased precision and reduced diversity, append (for example) --tmax 0.2 --steps 2. The default inference settings correspond to --tmax 1.0 --steps 10. See Appendix B.1 in the paper for more details.

Evaluation scripts

Our ensemble evaluations may be reproduced via the following steps:

Download the ATLAS dataset by runnig from bash scripts/download_atlas.sh from the desired root directory
Prepare the ensemble directory with a PDB file for each ATLAS target, each with 250 structures (see zipped AlphaFlow ensembles below for examples). Some results are not directly comparable for evaluations with a different number of structures.
Run python -m scripts.analyze_ensembles --atlas_dir [DIR] --pdb_dir [DIR] --num_workers [N]. This will produce an analysis file named out.pkl in the pdb_dir.
Run python -m scripts.print_analysis [PATH] [PATH] ... with an arbitrary number of paths to out.pkl files. A formatted comparison table will be printed.

Training

Downloading datasets

To download and preprocess the PDB,

Run aws s3 sync --no-sign-request s3://pdbsnapshots/20230102/pub/pdb/data/structures/divided/mmCIF pdb_mmcif from the desired directory.
Run find pdb_mmcif -name '*.gz' | xargs gunzip to extract the MMCIF files.
From the repository root, run python -m scripts.unpack_mmcif --mmcif_dir [DIR] --outdir [DIR] --num_workers [N]. This will preprocess all chains into NPZ files and create a pdb_mmcif.csv index.
Download OpenProteinSet with aws s3 sync --no-sign-request s3://openfold/ openfold from the desired directory.
Run python -m scripts.add_msa_info --openfold_dir [DIR] to produce a pdb_mmcif_msa.csv index with OpenProteinSet MSA lookup.
Run python -m scripts.cluster_chains to produce a pdb_clusters file at 40% sequence similarity (Mmseqs installation required).
Create MSAs for the PDB validation split (splits/cameo2022.csv) according to the instructions in the previous section.

To download and preprocess the ATLAS MD trajectory dataset,

Run bash scripts/download_atlas.sh from the desired directory.
From the repository root, run python -m scripts.prep_atlas --atlas_dir [DIR] --outdir [DIR] --num_workers [N]. This will preprocess the ATLAS trajectories into NPZ files.
Create MSAs for all entries in splits/atlas.csv according to the instructions in the previous section.

Running training

Before running training, download the pretrained AlphaFold and ESMFold weights into the repository root via

wget https://storage.googleapis.com/alphafold/alphafold_params_2022-12-06.tar
tar -xvf alphafold_params_2022-12-06.tar params_model_1.npz
wget https://dl.fbaipublicfiles.com/fair-esm/models/esmfold_3B_v1.pt

The basic command for training AlphaFlow is

python train.py --lr 5e-4 --noise_prob 0.8 --accumulate_grad 8 --train_epoch_len 80000 --train_cutoff 2018-05-01 --filter_chains \
    --train_data_dir [DIR] \
    --train_msa_dir [DIR] \
    --mmcif_dir [DIR] \
    --val_msa_dir [DIR] \
    --run_name [NAME] [--wandb]

where the PDB NPZ directory, the OpenProteinSet directory, the PDB mmCIF directory, and the validation MSA directory are specified. This training run produces the AlphaFlow-PDB base version. All other models are built off this checkpoint.

To continue training on ATLAS, run

python train.py --normal_validate --sample_train_confs --sample_val_confs --num_val_confs 100 --pdb_chains splits/atlas_train.csv --val_csv splits/atlas_val.csv --self_cond_prob 0.0 --noise_prob 0.9 --val_freq 10 --ckpt_freq 10 \
    --train_data_dir [DIR] \
    --train_msa_dir [DIR] \
    --ckpt [PATH] \
    --run_name [NAME] [--wandb]

where the ATLAS MSA and NPZ directories and AlphaFlow-PDB checkpoints are specified.

To instead train on ATLAS with templates, run with the additional arguments --first_as_template --extra_input --lr 1e-4 --restore_weights_only --extra_input_prob 1.0.

Distillation: to distill a model, append --distillation and supply the --ckpt [PATH] of the model to be distilled. For PDB training, we remove --accumulate_grad 8 and recommend distilling with a shorter --train_epoch_len 16000. Note that --self_cond_prob and --noise_prob will be ignored and can be omitted.

ESMFlow: run the same commands with --mode esmfold and --train_cutoff 2020-05-01.

Ensembles

We provide the ensembles sampled from the model which were used for the analyses and results reported in the paper.

AlphaFlow ensembles

Model	Version	Samples
AlphaFlow-PDB	base	https://alphaflow.s3.amazonaws.com/samples/alphaflow_pdb_base_202402.zip
AlphaFlow-PDB	distilled	https://alphaflow.s3.amazonaws.com/samples/alphaflow_pdb_distilled_202402.zip
AlphaFlow-MD	base	https://alphaflow.s3.amazonaws.com/samples/alphaflow_md_base_202402.zip
AlphaFlow-MD	distilled	https://alphaflow.s3.amazonaws.com/samples/alphaflow_md_distilled_202402.zip
AlphaFlow-MD+Templates	base	https://alphaflow.s3.amazonaws.com/samples/alphaflow_md_templates_base_202402.zip
AlphaFlow-MD+Templates	distilled	https://alphaflow.s3.amazonaws.com/samples/alphaflow_md_templates_distilled_202402.zip
AlphaFlow-MD+Templates	12l-base	https://alphaflow.s3.amazonaws.com/samples/alphaflow_12l_md_templates_base_202406.zip
AlphaFlow-MD+Templates	12l-distilled	https://alphaflow.s3.amazonaws.com/samples/alphaflow_12l_md_templates_distilled_202406.zip

ESMFlow ensembles

Model	Version	Samples
ESMFlow-PDB	base	https://alphaflow.s3.amazonaws.com/samples/esmflow_pdb_base_202402.zip
ESMFlow-PDB	distilled	https://alphaflow.s3.amazonaws.com/samples/esmflow_pdb_distilled_202402.zip
ESMFlow-MD	base	https://alphaflow.s3.amazonaws.com/samples/esmflow_md_base_202402.zip
ESMFlow-MD	distilled	https://alphaflow.s3.amazonaws.com/samples/esmflow_md_distilled_202402.zip
ESMFlow-MD+Templates	base	https://alphaflow.s3.amazonaws.com/samples/esmflow_md_templates_base_202402.zip
ESMFlow-MD+Templates	distilled	https://alphaflow.s3.amazonaws.com/samples/esmflow_md_templates_distilled_202402.zip

License

MIT. Other licenses may apply to third-party source code noted in file headers.

Citation

@misc{jing2024alphafold,
      title={AlphaFold Meets Flow Matching for Generating Protein Ensembles}, 
      author={Bowen Jing and Bonnie Berger and Tommi Jaakkola},
      year={2024},
      eprint={2402.04845},
      archivePrefix={arXiv},
      primaryClass={q-bio.BM}
}

alphaflow's People

Contributors

Stargazers

Watchers

alphaflow's Issues

missing scripts.prep_atlas.py file

It appears that the scripts.prep_atlas.py file for the Atlas data processing has been removed in the most recent update.

Generating Abnormal PDB results

Thanks for open source this amazing work AlphaFlow.
After I deployed this project to my computer, I found that the generated PDB results seems abnormal, whether using AlphaFLow or ESMFlow.
Has anyone else met this problem?

Installation fails on Cuda12

Hi,
I was wondering if there is a new wheel for cuda12 installation? I tried this on a debian 12 system with cuda12, however the setup fails to build the wheel. I also tried with cuda11.3, where it fails because debian12 has a g++ version 12 and it requires a g++ version <=10.
Thanks

TypeError: init() missing 1 required positional argument: 'no_column_attention'

Hi,
I've been trying to do some experiments using your model and scripts and running into a problem. The error arises when using the following command from a testing directory in the main project location:

python3 ../predict.py \
    --weights ../model_weights/alphaflow_pdb_distilled_202402.pt \
    --mode alphafold \
    --input_csv ghsr.csv \
    --msa_dir ./msas \
    --samples 10 \
    --outpdb ./out \
    --noisy_first \
    --no_diffusion

I get the error:

2024-02-27 08:10:02,292 [iwe547170:52263] [INFO] Loading the model
Traceback (most recent call last):
  File "/media/data/software/alphaflow/workdir/../predict.py", line 138, in <module>
    main()
  File "/home/iwe34/anaconda3/envs/alphaflow/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/media/data/software/alphaflow/workdir/../predict.py", line 84, in main
    model = model_class(**ckpt['hyper_parameters'], training=False)
  File "/media/data/software/alphaflow/alphaflow/model/wrapper.py", line 496, in __init__
    self.model = AlphaFold(config,
  File "/media/data/software/alphaflow/alphaflow/model/alphafold.py", line 77, in __init__
    self.evoformer = EvoformerStack(
TypeError: __init__() missing 1 required positional argument: 'no_column_attention'

To figure out why that happened, I modified the alphaflow/config.py file by manually adding the flag no_column_attention: False and realized later, that this config is only used when the predict.py script is called with the additional flag --original_weights=True.

However, inspecting the loaded config ckpt from the lines

if args.weights:
    ckpt = torch.load(args.weights, map_location='cpu')
    model = model_class(**ckpt['hyper_parameters'], training=False)

showed that the model weights alphaflow_pdb_distilled_202402.pt doesn't contain the no_column_attention field. It worked fine when I used the original weights params_model_1.npz (apart from getting a CUDA error, another problem).

Simple question: What am I doing wrong? Why can I provide new model weights when these could never be used because the EvoformerStack in openfold requires this argument?

Appropriate `model_config` arguments for initial training upon predicting with long sequence

Hi,

I have some long sequences that can't be predicted using alphaflow default settings.

To deal with long ones I changed arguments to predict as below.

config = model_config(
    'initial_training',
    train=False, 
    low_prec=False
    long_sequence_inference=True
)

For long sequence prediction initial training is not needed? I'm afraid it results in decrease in precision. I want conformer ensembles with wide ranges of conformations and high precision.

There seems no description about the best practice or settings to predict proteins with long sequences.

Issues with CUDA12 and or G++17

Hi,

I'm trying to install AlphaFlow on a machine with A30 GPUs with CUDA 12.1 and even tough I found a compatible pytorch version I gett the following error after running the command: pip install 'openfold @ git+https://github.com/aqlaboratory/openfold.git@103d037':

"In file included from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/torch/extension.h:5,
from openfold/utils/kernel/csrc/softmax_cuda_kernel.cu:18:
/home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/all.h:4:2: error: #error C++17 or later compatible compiler is required to use PyTorch.
4 | #error C++17 or later compatible compiler is required to use PyTorch.
| ^~~~~
In file included from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/c10/util/string_view.h:4,
from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/c10/util/StringUtil.h:6,
from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/c10/util/Exception.h:5,
from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/c10/core/Device.h:5,
from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:11,
from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/ATen/core/Tensor.h:3,
from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/ATen/Tensor.h:3,
from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/torch/csrc/autograd/function_hook.h:3,
from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/torch/csrc/autograd/cpp_hook.h:2,
from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/torch/csrc/autograd/variable.h:6,
from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/torch/csrc/autograd/autograd.h:3,
from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/autograd.h:3,
from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/all.h:7,
from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/torch/extension.h:5,
from openfold/utils/kernel/csrc/softmax_cuda_kernel.cu:18:
/home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/c10/util/C++17.h:27:2: error: #error You need C++17 to compile PyTorch
27 | #error You need C++17 to compile PyTorch
| ^~~~~
In file included from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/types.h:3,
from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader_options.h:4,
from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/base.h:3,
from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/stateful.h:4,
from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader.h:3,
from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/data.h:3,
from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/all.h:9,
from /home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/torch/extension.h:5,
from openfold/utils/kernel/csrc/softmax_cuda_kernel.cu:18:
/home/raraya/.conda/envs/alpha_flow/lib/python3.9/site-packages/torch/include/ATen/ATen.h:4:2: error: #error C++17 or later compatible compiler is required to use ATen.
4 | #error C++17 or later compatible compiler is required to use ATen.
| ^~~~~
error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1"

My gcc version is 11.3

Availability of MD Ensemble Evaluation Scripts

Hi all! Thanks for your work.

I'm reaching out to inquire about the availability of MD ensemble evaluation scripts, particularly for metrics beyond RMSD and RMSF. While these two metrics are relatively straightforward to generate, I've found challenges reproducing others like Root Mean W2-Dist and PCAs.
Could u provide guidance or scripts to assist with calculating these metrics? Really thanks for your help.

Best,
Shaoning

Training with PDBs

Could you provide a version that can be trained using PDBs?
In addition, you must specify the mdtraj version for the installation, otherwise the updated mdtraj may install other versions numpy and scipy.

Diffuse selected residues

Hi, very interesting work!

I wonder if it is possible to diffuse only the selected residues?

Best,

predict.py does not run AttributeError: module 'numpy' has no attribute 'object'.

I built the docker file from commit 2c27c69. When running predict.py the following stack trace is generated,

root@3f4467776483:/opt/alphaflow# /opt/conda/bin/python -V
Python 3.9.7

root@3f4467776483:/opt/alphaflow# /opt/conda/bin/python /opt/alphaflow/predict.py 

/opt/conda/lib/python3.9/site-packages/openfold-1.0.1-py3.9-linux-x86_64.egg/openfold/data/templates.py:88: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar.
  "template_domain_names": np.object,
Traceback (most recent call last):
  File "/opt/alphaflow/predict.py", line 29, in <module>
    from alphaflow.data.data_modules import collate_fn
  File "/opt/alphaflow/alphaflow/data/data_modules.py", line 32, in <module>
    from alphaflow.data import data_pipeline, feature_pipeline
  File "/opt/alphaflow/alphaflow/data/data_pipeline.py", line 22, in <module>
    from openfold.data import templates, parsers, mmcif_parsing
  File "/opt/conda/lib/python3.9/site-packages/openfold-1.0.1-py3.9-linux-x86_64.egg/openfold/data/templates.py", line 88, in <module>
    "template_domain_names": np.object,
  File "/opt/conda/lib/python3.9/site-packages/numpy/__init__.py", line 324, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'object'.
`np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. 
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

I note that when the container is built that numpy 1.21.2 is first installed, and later, uninstalled when mdtraj is built, leaving 1.26 instead.

 Stored in directory: /root/.cache/pip/wheels/4f/17/89/f855cce8e6394e9029e1b972cb623c8813b706d3d1ca81832f
Successfully built mdtraj
Installing collected packages: fair-esm, typing-extensions, pyparsing, numpy, astunparse, scipy, mdtraj, pytorch_lightning
  Attempting uninstall: typing-extensions
    Found existing installation: typing-extensions 3.10.0.2
    Uninstalling typing-extensions-3.10.0.2:
      Successfully uninstalled typing-extensions-3.10.0.2
  Attempting uninstall: numpy
    Found existing installation: numpy 1.21.2
    **Uninstalling numpy-1.21.2:
      Successfully uninstalled numpy-1.21.2**
  Attempting uninstall: scipy
    Found existing installation: scipy 1.7.3
    Uninstalling scipy-1.7.3:
      Successfully uninstalled scipy-1.7.3
  Attempting uninstall: pytorch_lightning
    Found existing installation: pytorch-lightning 1.5.10
    Uninstalling pytorch-lightning-1.5.10:
      Successfully uninstalled pytorch-lightning-1.5.10
Successfully installed astunparse-1.6.3 fair-esm-2.0.0 mdtraj-1.10.0 **numpy-1.26.4** pyparsing-3.1.2 pytorch_lightning-2.0.4 scipy-1.13.1 typing-extensions-4.12.2

I am going to try pinning numpy 1.21.2 through out the pip installations, please advise if there is a better/different route.

scripts.unpack_mmcif.py reference to outdated betafold?

I tried to preprocess pdbs into NPZ files using "scripts/unpack_mmcif.py"
e.g.:
$ python -m scripts.unpack_mmcif.py --mmcif_dir ../testpdb/data_dir/ --outdir ../testpdb/outesmflow/

but it tries to load betafold:
"from betafold.data.data_pipeline import DataPipeline"

and fails:
"ModuleNotFoundError: No module named 'betafold'"

I tried to substitute "betafold.data.data_pipeline" for "alphaflow.data.data_pipeline" but I run into other issues that make me believe that "unpack_mmcif.py" is an outdated file.

Can you confirm this?

Diffusion scheduling code making abnormal protein output

I believe your code has some discrepancies when compared to the pseudocode in your article.

Algorithm 1 TRAINING
Input: Training examples of structures, sequences, and
MSAs {(Si,Ai,Mi)}
for all (Si,Ai,Mi) do
Extract x1 ← BetaCarbons(Si)
Sample x0 ∼ HarmonicPrior(length(Ai))
Align x0 ← RMSDAlign(x0, x1)
Sample t ∼ Uniform[0, 1]
Interpolate xt ← t · x1 + (1 − t) · x0
Predict ˆ Si ← AlphaFold(Ai,Mi, xt, t)
Optimize loss L = FAPE2( ˆ Si, Si)

Does this pseudocode correspond to your code in wrapper.py ModelWrapper.distillation_training_step?

for t, s in zip(schedule[:-1], schedule[1:]):
output = self.teacher(batch, prev_outputs=prev_outputs)
pseudo_beta = pseudo_beta_fn(batch['aatype'], output['final_atom_positions'], None)
noisy = rmsdalign(pseudo_beta, noisy)
noisy = (s / t) * noisy + (1 - s / t) * pseudo_beta

This holds the same in ModelWrapper.inference.

The atoms in the PDB output seems to be clustered together very densely, which makes it an abnormal protein structure.

The distance between adjacent alpha carbons is not 3.8 A from the Prior

Hi,
I have a problem.
When I sample many samples(at least 10000 & Law of Large Numbers) from the Hamonic Prior, the mean distance of adjacent alpha carbons is not approximately 3.8A as stated in this paper.

when traing alphaflow from scratch, show AttributeError: 'AlphaFoldWrapper' object has no attribute 'extra_msa_stack'

Hi BoWen, i am using alphaflow train.py to trying run, but find error with below:
AttributeError: 'AlphaFoldWrapper' object has no attribute 'extra_msa_stack
and i am using:
python train.py --lr 5e-4 --noise_prob 0.8 --accumulate_grad 8 --train_epoch_len 80000 --train_cutoff 2018-05-01 --filter_chains --train_data_dir ../unpack_mmcif_out --train_msa_dir ../openfold/pdb --mmcif_dir ../pdb_mmcif --val_msa_dir ../openfold/alignment_db --run_name alphaflow_train
Is need add extra_msa_stack in AlphaFoldWrapper class or drop this line?

PPI ensembles

Any chance to adopt this for protein-protein interfaces?

TypeError: init() missing 2 required positional arguments: 'opm_first' and 'fuse_projection_weights'

Hello,I put weights, csv and a3m in folders (Especially, a3m in /cluster/home/xxx/alphaflow/splits/6DS0_A/a3m/6DS0_A.a3m), and run following code:

python predict.py --mode alphafold --input_csv /cluster/home/xxx/alphaflow/splits/6DS0_test.csv --msa_dir /cluster/home/xxx/alphaflow/splits --weights /cluster/home/xxx/alphaflow/splits/alphaflow_pdb_base_202402.pt --samples 200 --outpdb /cluster/home/xxx/alphaflow/splits/output --self_cond --resample

then I meet the error:

2024-02-22 13:45:48,506 [node83:61063] [INFO] Loading the model
Traceback (most recent call last):
  File "/cluster/home/xxx/alphaflow/predict.py", line 132, in <module>
    main()
  File "/cluster/home/xxx/.conda/envs/alphaflow/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/cluster/home/xxx/alphaflow/predict.py", line 78, in main
    model = model_class(**ckpt['hyper_parameters'], training=False)
  File "/cluster/home/xxx/alphaflow/alphaflow/model/wrapper.py", line 496, in __init__
    self.model = AlphaFold(config,
  File "/cluster/home/xxx/alphaflow/alphaflow/model/alphafold.py", line 73, in __init__
    self.extra_msa_stack = ExtraMSAStack(
TypeError: __init__() missing 2 required positional arguments: 'opm_first' and 'fuse_projection_weights'

Could you offer me some help to solve it? Thanks.

PDB as input

Hi @bjing2016 ! Extremely useful work!
I have some PDBs with jumps in sequences (i.e. excluded IDRs).
I wonder if PDB as input is possible?
If not, would appreciate if it becomes available in the future.

Command options' help messages in predict.py missing

Hi,

Thank you for your impressive work.

There are no help messages for prediction options.

parser.add_argument('--input_csv', type=str, default='splits/transporters_only.csv')
parser.add_argument('--templates_dir', type=str, default='./data')
parser.add_argument('--msa_dir', type=str, default='./alignment_dir')
parser.add_argument('--mode', choices=['alphafold', 'esmfold'], default='alphafold')
parser.add_argument('--samples', type=int, default=10)
parser.add_argument('--steps', type=int, default=10)
parser.add_argument('--outpdb', type=str, default='./outpdb/default')
parser.add_argument('--weights', type=str, default=None)
parser.add_argument('--ckpt', type=str, default=None)
parser.add_argument('--original_weights', action='store_true')
parser.add_argument('--pdb_id', nargs='*', default=[])
parser.add_argument('--subsample', type=int, default=None)
parser.add_argument('--resample', action='store_true')
parser.add_argument('--tmax', type=float, default=1.0)
parser.add_argument('--templates', action='store_true')
parser.add_argument('--no_diffusion', action='store_true', default=False)
parser.add_argument('--self_cond', action='store_true', default=False)
parser.add_argument('--noisy_first', action='store_true', default=False)
parser.add_argument('--runtime_json', type=str, default=None)
parser.add_argument('--no_overwrite', action='store_true', default=False)

Could you add these messages or describe for what each option is in README?

I could figure out some of them after reading the paper and README but some are still not clear so much.

Multimer

Hi! Great work!
Is multimer supported as in ESMFold?
I was trying to use a separation token ":" as in ESM but it doesn't seem to work.

Example Input Files Not Working

Great work on the latest version of the paper and thanks for putting this repo out.
I was trying to test the basic inference you outlined using either the ESMFlow or AlphaFlow models and weights and ran into problems at every corner. I'll detail my specific issues below but repos always get increased usage when authors provide at least one full example input line for inference, so if you provide that I'm sure it would help many people checking out your code. Thanks!

Trying ESMFlow Model

mkdir output
mkdir weights
python predict.py --mode esmfold --input_csv splits/atlas_test.csv --weights weights/esmflow_md_distilled_202402.pt --samples 5 --outpdb output/

Output

2024-02-26 12:54:34,511 [---] [INFO] Loading the model
2024-02-26 12:55:16,878 [---] [INFO] Model has been loaded
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:25<00:00, 5.08s/it]
Traceback (most recent call last):
File "/---/alphaflow/predict.py", line 132, in
main()
File "/---/miniconda3/envs/AlphaFlow/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/---/alphaflow/predict.py", line 126, in main
f.write(protein.prots_to_pdb(result))
File "/---/alphaflow/alphaflow/utils/protein.py", line 163, in prots_to_pdb
prot = to_pdb(prot)
File "/---/miniconda3/envs/AlphaFlow/lib/python3.9/site-packages/openfold/np/protein.py", line 341, in to_pdb
chain_index = prot.chain_index.astype(np.int32)
AttributeError: 'NoneType' object has no attribute 'astype'

Tried with esmflow_pdb_base_202402.pt weights as well...same result.

Trying AlphaFlow Model
Preparing the MSA

python -m scripts.mmseqs_query --split splits/atlas_test.csv --outdir output
COMPLETE: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 450/450 [elapsed: 00:02 remaining: 00:00]

SUCCESS!

Running Inference

python predict.py --mode alphafold --input_csv splits/atlas_test.csv --msa_dir output/ --weights weights/alphaflow_pdb_distilled_202402.pt --samples 5 --outpdb output/
2024-02-26 13:17:56,383 [---] [INFO] Loading the model
Traceback (most recent call last):
File "/---/alphaflow/predict.py", line 132, in
main()
File "/---/miniconda3/envs/AlphaFlow/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/---/alphaflow/predict.py", line 78, in main
model = model_class(**ckpt['hyper_parameters'], training=False)
File "/---/alphaflow/alphaflow/model/wrapper.py", line 496, in init
self.model = AlphaFold(config,
File "/---/alphaflow/alphaflow/model/alphafold.py", line 73, in init
self.extra_msa_stack = ExtraMSAStack(
TypeError: init() missing 2 required positional arguments: 'opm_first' and 'fuse_projection_weights'

Thanks again for your assistance. Looking forward to trying out this great work.

Colab notebook

Thank you for this fantastic repository! Would it be possible to provide a Google Colab demo for running the selected model? It would be extremely helpful for quick tests.

Thank you!

The size of tensor a (184) must match the size of tensor b (183) at non-singleton dimension 1

I have pasted the error below. I am attempting to use the AlphaFlow MD + Template model and I am using this model + sequence:

https://alphafold.ebi.ac.uk/entry/A0A2P6NC61

predict.py 133
main()

_contextlib.py 115 decorate_context
return func(*args, **kwargs)

predict.py 119 main
prots = model.inference(batch, as_protein=True, noisy_first=args.noisy_first,

wrapper.py 374 inference
output = self.model(batch, prev_outputs=prev_outputs)

module.py 1532 _wrapped_call_impl
return self._call_impl(*args, **kwargs)

module.py 1541 _call_impl
return forward_call(*args, **kwargs)

alphafold.py 240 forward
extra_pseudo_beta = pseudo_beta_fn(batch['aatype'], batch['extra_all_atom_positions'], None)

feats.py 38 pseudo_beta_fn
pseudo_beta = torch.where(

RuntimeError:
The size of tensor a (184) must match the size of tensor b (183) at non-singleton dimension 1
[!!] 2024-05-15 16:55:47,353 Command 'source activate AlphaFlow; python alphaflow/predict.py --mode alphafold --input_csv alphaflow_input.csv --msa_dir AlphaFlow_MSA_Results --weights alphaflow/alphaflow_md_templates_base_202402.pt --samples 10 --outpdb upload/ --templates_dir alphaflow_template' returned non-zero exit status 1. (main.py:252)

Dockerfile or CUDA 12

Hi,

Thanks for the wonderful work. I am planning on doing some conformation sampling using this work, but unfortunately it seems like the hard requirement of CUDA 11.6 is an issue. I've tried different installers on CUDA 12 and can't get it to work but unfortunately the machines I have access to are all CUDA 12.

It seems like OpenFold has a branch pl_upgrades that can support CUDA 12+ - would it be possible to do a port forward on this branch or provide a Dockerfile?

Thanks.