Giter VIP home page Giter VIP logo

a-r-j / proteinworkshop Goto Github PK

View Code? Open in Web Editor NEW
173.0 5.0 15.0 21.89 MB

Benchmarking framework for protein representation learning. Includes a large number of pre-training and downstream task datasets, models and training/task utilities. (ICLR 2024)

Home Page: https://proteins.sh/

License: MIT License

Jupyter Notebook 27.02% Python 72.94% TeX 0.04%
benchmark dataset deep-learning lightning pretraining protein protein-structure pytorch

proteinworkshop's Introduction

Protein Workshop

PyPI version Zenodo doi badge Tests Project Status: Active – The project has reached a stable, usable state and is being actively developed. License: MIT Docs Code style: black

Overview of the Protein Workshop

Documentation

This repository provides the code for the protein structure representation learning benchmark detailed in the paper Evaluating Representation Learning on the Protein Structure Universe (ICLR 2024).

In the benchmark, we implement numerous featurisation schemes, datasets for self-supervised pre-training and downstream evaluation, pre-training tasks, and auxiliary tasks.

The benchmark can be used as a working template for a protein representation learning research project, a library of drop-in components for use in your projects, or as a CLI tool for quickly running protein representation learning evaluation and pre-training configurations.

Processed datasets and pre-trained weights are made available. Downloading datasets is not required; upon first run all datasets will be downloaded and processed from their respective source.

Configuration files to run the experiments described in the manuscript are provided in the proteinworkshop/config/sweeps/ directory.

Contents

Installation

Below, we outline how one may set up a virtual environment for proteinworkshop. Note that these installation instructions currently target Linux-like systems with NVIDIA CUDA support. Note that Windows and macOS are currently not officially supported.

From PyPI

proteinworkshop is available for install from PyPI. This enables training of specific configurations via the CLI or using individual components from the benchmark, such as datasets, featurisers, or transforms, as drop-ins to other projects. Make sure to install PyTorch (specifically version 2.1.2 or newer) using its official pip installation instructions, with CUDA support as desired.

# install `proteinworkshop` from PyPI
pip install proteinworkshop

# install PyTorch Geometric using the (now-installed) CLI
workshop install pyg

# set a custom data directory for file downloads; otherwise, all data will be downloaded to `site-packages`
export DATA_PATH="where/you/want/data/" # e.g., `export DATA_PATH="proteinworkshop/data"`

However, for full exploration we recommend cloning the repository and building from source.

Building from source

With a local virtual environment activated (e.g., one created with conda create -n proteinworkshop python=3.10):

  1. Clone and install the project

    git clone https://github.com/a-r-j/ProteinWorkshop
    cd ProteinWorkshop
    pip install -e .
  2. Install PyTorch (specifically version 2.1.2 or newer) using its official pip installation instructions, with CUDA support as desired

    # e.g., to install PyTorch with CUDA 11.8 support on Linux:
    pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 torchaudio==2.1.2+cu118 --index-url https://download.pytorch.org/whl/cu118
  3. Then use the newly-installed proteinworkshop CLI to install PyTorch Geometric

    workshop install pyg
  4. Configure paths in .env (optional, will override default paths if set). See .env.example for an example.

  5. Download PDB data:

    python proteinworkshop/scripts/download_pdb_mmtf.py

Tutorials

We provide a five-part tutorial series of Jupyter notebooks to provide users with examples of how to use and extend proteinworkshop, as outlined below.

  1. Training a new model
  2. Customizing an existing dataset
  3. Adding a new dataset
  4. Adding a new model
  5. Adding a new task

Quickstart

Downloading datasets

Datasets can either be built from the source structures or downloaded from Zenodo. Datasets will be built from source the first time a dataset is used in a run (or by calling the appropriate setup() method in the corresponding datamodule). We provide a CLI tool for downloading datasets:

workshop download <DATASET_NAME>
workshop download pdb
workshop download cath
workshop download afdb_rep_v4
# etc..

If you wish to build datasets from source, we recommend first downloading the entire PDB first (in MMTF format, c. 24 Gb) to reuse shared PDB data as much as possible:

workshop download pdb
# or
python proteinworkshop/scripts/download_pdb_mmtf.py

Training a model

Launching an experiment minimally requires specification of a dataset, structural encoder, and task (devices can be specified with trainer=cpu/gpu):

workshop train dataset=cath encoder=egnn task=inverse_folding trainer=cpu env.paths.data=where/you/want/data/
# or
python proteinworkshop/train.py dataset=cath encoder=egnn task=inverse_folding trainer=cpu # or trainer=gpu

This command uses the default configurations in configs/train.yaml, which can be overwritten by equivalently named options. For instance, you can use a different input featurisation using the features option, or set the display name of your experiment on wandb using the name option:

workshop train dataset=cath encoder=egnn task=inverse_folding features=ca_bb name=MY-EXPT-NAME trainer=cpu env.paths.data=where/you/want/data/
# or
python proteinworkshop/train.py dataset=cath encoder=egnn task=inverse_folding features=ca_bb name=MY-EXPT-NAME trainer=cpu # or trainer=gpu

Finetuning a model

Finetuning a model additionally requires specification of a checkpoint.

workshop finetune dataset=cath encoder=egnn task=inverse_folding ckpt_path=PATH/TO/CHECKPOINT trainer=cpu env.paths.data=where/you/want/data/
# or
python proteinworkshop/finetune.py dataset=cath encoder=egnn task=inverse_folding ckpt_path=PATH/TO/CHECKPOINT trainer=cpu # or trainer=gpu

Running a sweep/experiment

We can make use of the hydra wandb sweeper plugin to configure experiments as sweeps, allowing searches over hyperparameters, architectures, pre-training/auxiliary tasks and datasets.

See proteinworkshop/config/sweeps/ for examples.

  1. Create the sweep with weights and biases
wandb sweep proteinworkshop/config/sweeps/my_new_sweep_config.yaml
  1. Launch job workers

With wandb:

wandb agent mywandbgroup/proteinworkshop/2wwtt7oy --count 8

Or an example SLURM submission script:

#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --array=0-32

source ~/.bashrc
source $(conda info --base)/envs/proteinworkshop/bin/activate

wandb agent mywandbgroup/proteinworkshop/2wwtt7oy --count 1

Reproduce the sweeps performed in the manuscript:

# reproduce the baseline tasks sweep (i.e., those performed without pre-training each model)
wandb sweep proteinworkshop/config/sweeps/baseline_fold.yaml
wandb agent mywandbgroup/proteinworkshop/2awtt7oy --count 8
wandb sweep proteinworkshop/config/sweeps/baseline_ppi.yaml
wandb agent mywandbgroup/proteinworkshop/2bwtt7oy --count 8
wandb sweep proteinworkshop/config/sweeps/baseline_inverse_folding.yaml
wandb agent mywandbgroup/proteinworkshop/2cwtt7oy --count 8

# reproduce the model pre-training sweep
wandb sweep proteinworkshop/config/sweeps/pre_train.yaml
wandb agent mywandbgroup/proteinworkshop/2dwtt7oy --count 8

# reproduce the pre-trained tasks sweep (i.e., those performed after pre-training each model)
wandb sweep proteinworkshop/config/sweeps/pt_fold.yaml
wandb agent mywandbgroup/proteinworkshop/2ewtt7oy --count 8
wandb sweep proteinworkshop/config/sweeps/pt_ppi.yaml
wandb agent mywandbgroup/proteinworkshop/2fwtt7oy --count 8
wandb sweep proteinworkshop/config/sweeps/pt_inverse_folding.yaml
wandb agent mywandbgroup/proteinworkshop/2gwtt7oy --count 8

Embedding a dataset

We provide a utility in proteinworkshop/embed.py for embedding a dataset using a pre-trained model. To run it:

python proteinworkshop/embed.py ckpt_path=PATH/TO/CHECKPOINT collection_name=COLLECTION_NAME

See the embed section of proteinworkshop/config/embed.yaml for additional parameters.

Visualising pre-trained model embeddings for a given dataset

We provide a utility in proteinworkshop/visualise.py for visualising the UMAP embeddings of a pre-trained model for a given dataset. To run it:

python proteinworkshop/visualise.py ckpt_path=PATH/TO/CHECKPOINT plot_filepath=VISUALISATION/FILEPATH.png

See the visualise section of proteinworkshop/config/visualise.yaml for additional parameters.

Performing attribution of a pre-trained model

We provide a utility in proteinworkshop/explain.py for performing attribution of a pre-trained model using integrated gradients.

This will write PDB files for all the structures in a dataset for a supervised task with residue-level attributions in the b_factor column. To visualise the attributions, we recommend using the Protein Viewer VSCode extension and changing the 3D representation to colour by Uncertainty/Disorder.

To run the attribution:

python proteinworkshop/explain.py ckpt_path=PATH/TO/CHECKPOINT output_dir=ATTRIBUTION/DIRECTORY

See the explain section of proteinworkshop/config/explain.yaml for additional parameters.

Verifying a config

python proteinworkshop/validate_config.py dataset=cath features=full_atom task=inverse_folding

Using proteinworkshop modules functionally

One may use the modules (e.g., datasets, models, featurisers, and utilities) of proteinworkshop functionally by importing them directly. When installing this package using PyPi, this makes building on top of the assets of proteinworkshop straightforward and convenient.

For example, to use any datamodule available in proteinworkshop:

from proteinworkshop.datasets.cath import CATHDataModule

datamodule = CATHDataModule(path="data/cath/", pdb_dir="data/pdb/", format="mmtf", batch_size=32)
datamodule.download()

train_dl = datamodule.train_dataloader()

To use any model or featuriser available in proteinworkshop:

from proteinworkshop.models.graph_encoders.dimenetpp import DimeNetPPModel
from proteinworkshop.features.factory import ProteinFeaturiser
from proteinworkshop.datasets.utils import create_example_batch

model = DimeNetPPModel(hidden_channels=64, num_layers=3)
ca_featuriser = ProteinFeaturiser(
    representation="CA",
    scalar_node_features=["amino_acid_one_hot"],
    vector_node_features=[],
    edge_types=["knn_16"],
    scalar_edge_features=["edge_distance"],
    vector_edge_features=[],
)

example_batch = create_example_batch()
batch = ca_featuriser(example_batch)

model_outputs = model(example_batch)

Read the docs for a full list of modules available in proteinworkshop.

Models

Invariant Graph Encoders

Name Source Protein Specific
GearNet Zhang et al.
DimeNet++ Gasteiger et al.
SchNet Schütt et al.
CDConv Fan et al.

Equivariant Graph Encoders

(Vector-type)

Name Source Protein Specific
GCPNet Morehead et al.
GVP-GNN Jing et al.
EGNN Satorras et al.

(Tensor-type)

Name Source Protein Specific
Tensor Field Network Corso et al.
Multi-ACE Batatia et al.

Sequence-based Encoders

Name Source Protein Specific
ESM2 Lin et al.

Datasets

To download a (processed) dataset from Zenodo, you can run

workshop download <DATASET_NAME>

where <DATASET_NAME> is given the first column in the tables below.

Otherwise, simply starting a training run will download and process the data from source.

Structure-based Pre-training Corpuses

Pre-training corpuses (with the exception of pdb, cath, and astral) are provided in FoldComp database format. This format is highly compressed, resulting in very small disk space requirements despite the large size. pdb is provided as a collection of MMTF files, which are significantly smaller in size than conventional .pdb or .cif file.

Name Description Source Size Disk Size License
astral SCOPe domain structures SCOPe/ASTRAL 1 - 2.2 Gb Publicly available
afdb_rep_v4 Representative structures identified from the AlphaFold database by FoldSeek structural clustering Barrio-Hernandez et al. 2.27M Chains 9.6 Gb GPL-3.0
afdb_rep_dark_v4 Dark proteome structures identied by structural clustering of the AlphaFold database. Barrio-Hernandez et al. ~800k 2.2 Gb GPL-3.0
afdb_swissprot_v4 AlphaFold2 predictions for SwissProt/UniProtKB Kim et al. 542k Chains 2.9 Gb GPL-3.0
afdb_uniprot_v4 AlphaFold2 predictions for UniProt Kim et al. 214M Chains 1 Tb GPL-3.0 / CC-BY 4.0
cath CATH 4.2 40% split by CATH topologies. Ingraham et al. ~18k chains 4.3 Gb CC-BY 4.0
esmatlas ESMAtlas predictions (full) Kim et al. 1 Tb GPL-3.0 / CC-BY 4.0
esmatlas_v2023_02 ESMAtlas predictions (v2023_02 release) Kim et al. 137 Gb GPL-3.0 / CC-BY 4.0
highquality_clust30 ESMAtlas High Quality predictions Kim et al. 37M Chains 114 Gb GPL-3.0 / CC-BY 4.0
igfold_paired_oas IGFold Predictions for Paired OAS Ruffolo et al. 104,994 paired Ab chains CC-BY 4.0
igfold_jaffe IGFold predictions for Jaffe2022 data Ruffolo et al. 1,340,180 paired Ab chains CC-BY 4.0
pdb Experimental structures deposited in the RCSB Protein Data Bank wwPDB consortium ~800k Chains 23 Gb CC0 1.0
Additionally, we provide several species-specific compilations (mostly reference species)
Name Description Source Size
a_thaliana Arabidopsis thaliana (thale cress) proteome AlphaFold2
c_albicans Candida albicans (a fungus) proteome AlphaFold2
c_elegans Caenorhabditis elegans (roundworm) proteome AlphaFold2
d_discoideum Dictyostelium discoideum (slime mold) proteome AlphaFold2
d_melanogaster Drosophila melanogaster (fruit fly) proteome AlphaFold2
d_rerio Danio rerio (zebrafish) proteome AlphaFold2
e_coli Escherichia coli (a bacteria) proteome AlphaFold2
g_max Glycine max (soy bean) proteome AlphaFold2
h_sapiens Homo sapiens (human) proteome AlphaFold2
m_jannaschii Methanocaldococcus jannaschii (an archaea) proteome AlphaFold2
m_musculus Mus musculus (mouse) proteome AlphaFold2
o_sativa Oryza sative (rice) proteome AlphaFold2
r_norvegicus Rattus norvegicus (brown rat) proteome AlphaFold2
s_cerevisiae Saccharomyces cerevisiae (brewer's yeast) proteome AlphaFold2
s_pombe Schizosaccharomyces pombe (a fungus) proteome AlphaFold2
z_mays Zea mays (corn) proteome AlphaFold2

Supervised Datasets

Name Description Source License
antibody_developability Antibody developability prediction Chen et al. CC-BY 3.0
atom3d_msp Mutation stability prediction Townshend et al. MIT
atom3d_ppi Protein-protein interaction prediction Townshend et al. MIT
atom3d_psr Protein structure ranking Townshend et al. MIT
atom3d_res Residue identity prediction Townshend et al. MIT
ccpdb_ligands Ligand binding residue prediction Agrawal et al. Publicly Available
ccpdb_metal Metal ion binding residue prediction Agrawal et al. Publicly Available
ccpdb_nucleic Nucleic acid binding residue prediction Agrawal et al. Publicly Available
ccpdb_nucleotides Nucleotide binding residue prediction Agrawal et al. Publicly Available
deep_sea_proteins Gene Ontology prediction (Biological Process) Sieg et al. Public domain
go-bp Gene Ontology prediction (Biological Process) Gligorijevic et al CC-BY 4.0
go-cc Gene Ontology (Cellular Component) Gligorijevic et al CC-BY 4.0
go-mf Gene Ontology (Molecular Function) Gligorijevic et al CC-BY 4.0
ec_reaction Enzyme Commission (EC) Number Prediction Hermosilla et al. MIT
fold_fold Fold prediction, split at the fold level Hou et al. CC-BY 4.0
fold_family Fold prediction, split at the family level Hou et al. CC-BY 4.0
fold_superfamily Fold prediction, split at the superfamily level Hou et al. CC-BY 4.0
masif_site Protein-protein interaction site prediction Gainza et al. Apache 2.0
metal_3d Zinc Binding Site Prediction Duerr et al. MIT
ptm Post Translational Modification Side Prediction Yan et al. CC-BY 4.0

Tasks

Self-Supervised Tasks

Name Description Source
inverse_folding Predict amino acid sequence given structure
residue_prediction Masked residue type prediction
distance_prediction Masked edge distance prediction Zhang et al.
angle_prediction Masked triplet angle prediction Zhang et al.
dihedral_angle_prediction Masked quadruplet dihedral prediction Zhang et al.
multiview_contrast Contrastive learning with multiple crops and InfoNCE loss Zhang et al.
structural_denoising Denoising of atomic coordinates with SE(3) decoders

Generic Supervised Tasks

Generic supervised tasks can be applied broadly across datasets. The labels are directly extracted from the PDB structures.

These are likely to be most frequently used with the pdb dataset class which wraps the PDB Dataset curator from Graphein.

Name Description Requires
binding_site_prediction Predict ligand binding residues HETATM ligands (for training)
ppi_site_prediction Predict protein binding residues graph_y attribute in data objects specifying the desired chain to select interactions for (for training)

Featurisation Schemes

Part of the goal of the proteinworkshop benchmark is to investigate the impact of the degree to which increasing granularity of structural detail affects performance. To achieve this, we provide several featurisation schemes for protein structures.

Invariant Node Features

N.B. All angular features are provided in [sin, cos] transformed form. E.g.: $\textrm{dihedrals} = [sin(\phi), cos(\phi), sin(\psi), cos(\psi), sin(\omega), \cos(\omega)]$, hence their dimensionality will be double the number of angles.

Name Description Dimensionality
residue_type One-hot encoding of amino acid type 21
positional_encoding Transformer-like positional encoding of sequence position 16
alpha Virtual torsion angle defined by four $C_\alpha$ atoms of residues $I_{-1}, I, I_{+1}, I_{+2}$ 2
kappa Virtual bond angle (bend angle) defined by the three $C_\alpha$ atoms of residues $I_{-2}, I, I_{+2}$ 2
dihedrals Backbone dihedral angles $(\phi, \psi, \omega)$ 6
sidechain_torsions Sidechain torsion angles $(\chi_{1-4})$ 8

Equivariant Node Features

Name Description Dimensionality
orientation Forward and backward node orientation vectors (unit-normalized) 2

Edge Construction

We predominanty support two types of edges: $k$-NN and $\epsilon$ edges.

Edge types can be specified as follows:

python proteinworkshop/train.py ... features.edge_types=[knn_16, knn_32, eps_16]

Where the suffix after knn or eps specifies $k$ (number of neighbours) or $\epsilon$ (distance threshold in angstroms).

Invariant Edge Features

Name Description Dimensionality
edge_distance Euclidean distance between source and target nodes 1
node_features Concatenated scalar node features of the source and target nodes Number of scalar node features $\times 2$
edge_type Type annotation for each edge 1
sequence_distance Sequence-based distance between source and target nodes 1
pos_emb Structured Transformer-inspired positional embedding of $i - j$ for source node $i$ and target node $j$ 16

Equivariant Edge Features

Name Description Dimensionality
edge_vectors Edge directional vectors (unit-normalized) 1

For Developers

Dependency Management

We use poetry to manage the project's underlying dependencies and to push updates to the project's PyPI package. To make changes to the project's dependencies, follow the instructions below to (1) install poetry on your local machine; (2) customize the dependencies; or (3) (de)activate the project's virtual environment using poetry:

  1. Install poetry for platform-agnostic dependency management using its installation instructions

    After installing poetry, to avoid potential keyring errors, disable its keyring usage by adding PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring to your shell's startup configuration and restarting your shell environment (e.g., echo 'export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring' >> ~/.bashrc && source ~/.bashrc for a Bash shell environment and likewise for other shell environments).

  2. Install, add, or upgrade project dependencies

      poetry install  # install the latest project dependencies
      # or
      poetry add XYZ  # add dependency `XYZ` to the project
      # or
      poetry show  # list all dependencies currently installed
      # or
      poetry lock  # standardize the (now-)installed dependencies
  3. Activate the newly-created virtual environment following poetry's usage documentation

      # activate the environment on a `posix`-like (e.g., macOS or Linux) system
      source $(poetry env info --path)/bin/activate
      # activate the environment on a `Windows`-like system
      & ((poetry env info --path) + "\Scripts\activate.ps1")
      # if desired, deactivate the environment
      deactivate

Code Formatting

To keep with the code style for the proteinworkshop repository, using the following lines, please format your commits before opening a pull request:

# assuming you are located in the `ProteinWorkshop` top-level directory
isort .
autoflake -r --in-place --remove-unused-variables --remove-all-unused-imports --ignore-init-module-imports .
black --config=pyproject.toml .

Documentation

To build a local version of the project's Sphinx documentation web pages:

# assuming you are located in the `ProteinWorkshop` top-level directory
pip install -r docs/.docs.requirements # one-time only
rm -rf docs/build/ && sphinx-build docs/source/ docs/build/ # NOTE: errors can safely be ignored

Citing ProteinWorkshop

Please consider citing proteinworkshop if it proves useful in your work.

@inproceedings{
  jamasb2024evaluating,
  title={Evaluating Representation Learning on the Protein Structure Universe},
  author={Arian R. Jamasb, Alex Morehead, Chaitanya K. Joshi, Zuobai Zhang, Kieran Didi, Simon V. Mathis, Charles Harris, Jian Tang, Jianlin Cheng, Pietro Lio, Tom L. Blundell},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
}

proteinworkshop's People

Contributors

a-r-j avatar amorehead avatar chaitjo avatar croydon-brixton avatar dependabot[bot] avatar gokceuludogan avatar kierandidi avatar linusyh avatar smathis-absci avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

proteinworkshop's Issues

Getting predictions for PTM model

Hi

Thanks for your interesting workshop! I'm working on a project where we want to predict (6) classes for each residue in a protein and have based my data on the data structure and model of the PTM dataset. I would like to obtain the predicted classes for each residue for each of my protein samples. In the script proteinworkshop/models/base.py, we have been looking at the function compute_loss() in which we identified y_hat (output) and y (labels) which both contain node_labels. However when printing these objects, it seems that all samples in a batch have been put together into one tensor:

y_hat: {'node_embedding': tensor[432, 32] n=13824 (54Kb) x∈[-142.084, 178.354] μ=0.799 σ=32.241 cuda:0, 'graph_embedding': tensor[1, 32] x∈[-2.392e+04, 3.001e+04] μ=345.133 σ=1.225e+04 cuda:0, 'node_label':
tensor[432, 6] n=2592 (10Kb) x∈[0., 1.000] μ=0.333 σ=0.470 cuda:0}

y: tensor[432, 6] n=2592 (10Kb) x∈[0., 1.000] μ=0.167 σ=0.373 cuda:0

In addition, the classes are one-hot-encoded, and we also need the dictionary containing the assigned one-hot-encoding for the classes.

Is there any way to retrieve the predicted sequence of labels for each sample (in order to compare the predictions and the actual sequence of labels)?

Thank you!

Mismatch of dimensions for PTM data

Hi
Thanks for your cool workshop!

I have been trying to run the PTM data, as a test, before I run on my own data.
After I have trained I would like to extract the model and evaluate with some of the metrics. I can see that the finetune has a test/evaluation setting - however I get an error when loading the checkpoint file that the dimensions doesnt match.

What could cause this?
Is there a way to test/evaluate without finetuning the model?

workshop train dataset=ptm encoder=schnet task=multiclass_node_classification trainer=gpu env.paths.data=/dtu/blackhole/17/126583/Post env.paths.output_dir=/dtu/blackhole/17/126583/Post/output_test

(OBS i am using epoch_000.ckpt as the training didnt finish - I ran out of memory). But that should not be the problem right?

workshop finetune dataset=ptm encoder=schnet task=multiclass_node_classification trainer=gpu env.paths.data=/dtu/blackhole/17/126583/Post env.paths.output_dir=/dtu/blackhole/17/126583/Post/output_test ckpt_path=/dtu/blackhole/17/126583/Post/output_test/checkpoints/epoch_000.ckpt

The following is printed from the finetune after the config tree:

DEBUG    Requested GPUs: None.                                                                                                                                                                                                                 config.py:248
                    WARNING  You are not using early stopping.                                                                                                                                                                                                     config.py:164
Seed set to 52
                    INFO     Instantiating datamodule:...                                                                                                                                                                                                         finetune.py:34
                    INFO     Instantiating model:...                                                                                                                                                                                                              finetune.py:39
[2023-11-29 15:36:26,347][torch.distributed.nn.jit.instantiator][INFO] - Created a temporary directory at /tmp/tmpn530817v
[2023-11-29 15:36:26,347][torch.distributed.nn.jit.instantiator][INFO] - Writing /tmp/tmpn530817v/_remote_module_non_scriptable.py
                    INFO     Instantiating encoder...                                                                                                                                                                                                                base.py:407
                    INFO     SchNetModel(hidden_channels=512, num_filters=128, num_interactions=6, num_gaussians=50, cutoff=10.0)                                                                                                                                    base.py:409
                    INFO     Instantiating decoders...                                                                                                                                                                                                               base.py:411
                    INFO     Building node_label decoder. Output dim 13                                                                                                                                                                                              base.py:245
                    INFO     {'_target_': 'proteinworkshop.models.decoders.mlp_decoder.MLPDecoder', 'hidden_dim': [128, 128], 'dropout': 0.0, 'activations': ['relu', 'relu', 'none'], 'skip': 'concat', 'out_dim': '${dataset.num_classes}', 'input':               base.py:248
                             'node_embedding'}                                                                                                                                                                                                                                  
                    INFO     Using skip connection in decoder.                                                                                                                                                                                                mlp_decoder.py:123
                    INFO     ModuleDict(                                                                                                                                                                                                                             base.py:413
                               (node_label): MLPDecoder(                                                                                                                                                                                                                        
                                 (layers): LinearSkipBlock(                                                                                                                                                                                                                     
                                   (layers): ModuleList(                                                                                                                                                                                                                        
                                     (0-1): 2 x LazyLinear(in_features=0, out_features=128, bias=True)                                                                                                                                                                          
                                     (2): LazyLinear(in_features=0, out_features=13, bias=True)                                                                                                                                                                                 
                                   )                                                                                                                                                                                                                                            
                                   (activations): ModuleList(                                                                                                                                                                                                                   
                                     (0-1): 2 x ReLU()                                                                                                                                                                                                                          
                                     (2): Identity()                                                                                                                                                                                                                            
                                   )                                                                                                                                                                                                                                            
                                   (dropout_layers): ModuleList(                                                                                                                                                                                                                
                                     (0-1): 2 x Dropout(p=0.0, inplace=False)                                                                                                                                                                                                   
                                   )                                                                                                                                                                                                                                            
                                 )                                                                                                                                                                                                                                              
                               )                                                                                                                                                                                                                                                
                             )                                                                                                                                                                                                                                                  
                    INFO     Instantiating losses...                                                                                                                                                                                                                 base.py:415
                    INFO     Using losses: {'node_label': CrossEntropyLoss()}                                                                                                                                                                                        base.py:417
                    INFO     Not using aux loss scaling                                                                                                                                                                                                              base.py:424
                    INFO     Configuring metrics...                                                                                                                                                                                                                  base.py:426
                    INFO     ['accuracy', 'f1_score', 'f1_max']                                                                                                                                                                                                      base.py:428
                    INFO     Instantiating featuriser...                                                                                                                                                                                                             base.py:430
                    INFO     ProteinFeaturiser(representation=CA, scalar_node_features=['amino_acid_one_hot'], vector_node_features=[], edge_types=['knn_16'], scalar_edge_features=['edge_distance'], vector_edge_features=[])                                      base.py:432
                    INFO     Instantiating task transform...                                                                                                                                                                                                         base.py:434
                    INFO     None                                                                                                                                                                                                                                    base.py:438
[11/29/23 15:36:53] INFO     Instantiating callbacks...                                                                                                                                                                                                           finetune.py:42
                    INFO     Instantiating callback <lightning.pytorch.callbacks.ModelCheckpoint>                                                                                                                                                                callbacks.py:31
                    INFO     Instantiating callback <lightning.pytorch.callbacks.EarlyStopping>                                                                                                                                                                  callbacks.py:31
                    INFO     Instantiating callback <lightning.pytorch.callbacks.RichModelSummary>                                                                                                                                                               callbacks.py:31
                    INFO     Instantiating callback <lightning.pytorch.callbacks.RichProgressBar>                                                                                                                                                                callbacks.py:31
                    INFO     Instantiating callback <lightning.pytorch.callbacks.LearningRateMonitor>                                                                                                                                                            callbacks.py:31
                    INFO     Instantiating callback <lightning.pytorch.callbacks.EarlyStopping>                                                                                                                                                                  callbacks.py:31
                    INFO     Instantiating loggers:...                                                                                                                                                                                                            finetune.py:47
                    INFO     Instantiating logger <lightning.pytorch.loggers.csv_logs.CSVLogger>                                                                                                                                                                   loggers.py:31
                    INFO     Instantiating trainer <lightning.pytorch.trainer.Trainer>                                                                                                                                                                            finetune.py:50
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[11/29/23 15:36:54] INFO     Initializing lazy layers...                                                                                                                                                                                                          finetune.py:59
[11/29/23 15:36:55] INFO     Found 43903 examples in train                                                                                                                                                                                                            ptm.py:223
[11/29/23 15:36:57] INFO     Downloading 789 PDBs...                                                                                                                                                                                                                  ptm.py:172
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 789/789 [00:08<00:00, 93.15it/s]
[11/29/23 15:37:06] INFO     Unavailable structures: 793                                                                                                                                                                                                              ptm.py:148
                    INFO     Found 2511 examples in test                                                                                                                                                                                                              ptm.py:223
                    INFO     Downloading 64 PDBs...                                                                                                                                                                                                                   ptm.py:172
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:03<00:00, 18.72it/s]
[11/29/23 15:37:10] INFO     Unavailable structures: 857                                                                                                                                                                                                              ptm.py:148
                    INFO     Found 2393 examples in val                                                                                                                                                                                                               ptm.py:223
                    INFO     Downloading 40 PDBs...                                                                                                                                                                                                                   ptm.py:172
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.38it/s]
[11/29/23 15:37:14] INFO     Unavailable structures: 897                                                                                                                                                                                                              ptm.py:148
                    INFO     Caching unavailable structure list to /dtu/blackhole/17/126583/Post/PostTranslationalModification/ptm_13/unavailable_structures.txt                                                                                                      ptm.py:154
                    INFO     Found 2353 examples in val                                                                                                                                                                                                               ptm.py:223
2353it [00:00, 50776.10it/s]
2353
                    INFO     All structures already processed and overwrite=False. Skipping download.                                                                                                                                                                base.py:304
                    INFO     All structures already processed and overwrite=False. Skipping download.                                                                                                                                                                base.py:335
[11/29/23 15:37:16] INFO     Unfeaturized batch: DataProteinBatch(fill_value=1e-05, atom_list=[37], coords=[18125, 37, 3], residues=[32], id=[32], residue_id=[32], residue_type=[18125], chains=[18125], node_y=[18125, 13], x=[18125],                          finetune.py:63
                             amino_acid_one_hot=[18125, 23], batch=[18125], ptr=[33])                                                                                                                                                                                           
                    INFO     Featurized batch: DataProteinBatch(fill_value=1e-05, atom_list=[37], coords=[18125, 37, 3], residues=[32], id=[32], residue_id=[32], residue_type=[18125], chains=[18125], node_y=[18125, 13], x=[18125, 23],                        finetune.py:65
                             amino_acid_one_hot=[18125, 23], batch=[18125], ptr=[33], pos=[18125, 3], edge_index=[2, 290000], edge_type=[1, 290000], num_relation=1, edge_attr=[290000, 1])                                                                                     
[11/29/23 15:37:22] INFO     Model output: {'node_embedding': tensor[18125, 32] n=580000 (2.2Mb) x∈[-0.252, 0.204] μ=0.007 σ=0.071, 'graph_embedding': tensor[32, 32] n=1024 (4Kb) x∈[-193.717, 170.779] μ=4.195 σ=40.578, 'node_label': tensor[18125, 13]        finetune.py:67
                             n=235625 (0.9Mb) x∈[-0.137, 0.115] μ=0.002 σ=0.045}                                                                                                                                                                                                
                    INFO     Loading weights from checkpoint /dtu/blackhole/17/126583/Post/output_test/checkpoints/epoch_000.ckpt...                                                                                                                              finetune.py:72
                    INFO     Loading encoder weights: OrderedDict([('embedding.weight', tensor[512, 39] n=19968 (78Kb) x∈[-0.488, 0.417] μ=-0.000 σ=0.117 cuda:0), ('embedding.bias', tensor[512] 2Kb x∈[-0.186, 0.205] μ=0.005 σ=0.083 cuda:0),                  finetune.py:80
                             ('distance_expansion.offset', tensor[50] x∈[0., 10.000] μ=5.000 σ=2.975 cuda:0), ('interactions.0.mlp.0.weight', tensor[128, 50] n=6400 (25Kb) x∈[-0.270, 0.307] μ=0.001 σ=0.108 cuda:0), ('interactions.0.mlp.0.bias', tensor[128]                
                             x∈[-0.044, 0.043] μ=0.001 σ=0.018 cuda:0), ('interactions.0.mlp.2.weight', tensor[128, 128] n=16384 (64Kb) x∈[-0.289, 0.256] μ=-0.000 σ=0.089 cuda:0), ('interactions.0.mlp.2.bias', tensor[128] x∈[-0.027, 0.042] μ=0.002 σ=0.015                 
                             cuda:0), ('interactions.0.conv.lin1.weight', tensor[128, 512] n=65536 (0.2Mb) x∈[-0.231, 0.240] μ=0.000 σ=0.063 cuda:0), ('interactions.0.conv.lin2.weight', tensor[512, 128] n=65536 (0.2Mb) x∈[-0.185, 0.201] μ=0.000 σ=0.058                    
                             cuda:0), ('interactions.0.conv.lin2.bias', tensor[512] 2Kb x∈[-0.053, 0.046] μ=-0.002 σ=0.017 cuda:0), ('interactions.0.conv.nn.0.weight', tensor[128, 50] n=6400 (25Kb) x∈[-0.270, 0.307] μ=0.001 σ=0.108 cuda:0),                                
                             ('interactions.0.conv.nn.0.bias', tensor[128] x∈[-0.044, 0.043] μ=0.001 σ=0.018 cuda:0), ('interactions.0.conv.nn.2.weight', tensor[128, 128] n=16384 (64Kb) x∈[-0.289, 0.256] μ=-0.000 σ=0.089 cuda:0),                                           
                             ('interactions.0.conv.nn.2.bias', tensor[128] x∈[-0.027, 0.042] μ=0.002 σ=0.015 cuda:0), ('interactions.0.lin.weight', tensor[512, 512] n=262144 (1Mb) x∈[-0.183, 0.191] μ=2.693e-05 σ=0.052 cuda:0), ('interactions.0.lin.bias',                  
                             tensor[512] 2Kb x∈[-0.056, 0.064] μ=-0.001 σ=0.022 cuda:0), ('interactions.1.mlp.0.weight', tensor[128, 50] n=6400 (25Kb) x∈[-0.293, 0.286] μ=-0.001 σ=0.107 cuda:0), ('interactions.1.mlp.0.bias', tensor[128] x∈[-0.059, 0.052]                  
                             μ=-0.001 σ=0.021 cuda:0), ('interactions.1.mlp.2.weight', tensor[128, 128] n=16384 (64Kb) x∈[-0.263, 0.248] μ=-0.001 σ=0.090 cuda:0), ('interactions.1.mlp.2.bias', tensor[128] x∈[-0.040, 0.032] μ=0.001 σ=0.014 cuda:0),                         
                             ('interactions.1.conv.lin1.weight', tensor[128, 512] n=65536 (0.2Mb) x∈[-0.211, 0.222] μ=-0.000 σ=0.063 cuda:0), ('interactions.1.conv.lin2.weight', tensor[512, 128] n=65536 (0.2Mb) x∈[-0.205, 0.212] μ=-6.669e-05 σ=0.059                       
                             cuda:0), ('interactions.1.conv.lin2.bias', tensor[512] 2Kb x∈[-0.047, 0.040] μ=-0.002 σ=0.015 cuda:0), ('interactions.1.conv.nn.0.weight', tensor[128, 50] n=6400 (25Kb) x∈[-0.293, 0.286] μ=-0.001 σ=0.107 cuda:0),                               
                             ('interactions.1.conv.nn.0.bias', tensor[128] x∈[-0.059, 0.052] μ=-0.001 σ=0.021 cuda:0), ('interactions.1.conv.nn.2.weight', tensor[128, 128] n=16384 (64Kb) x∈[-0.263, 0.248] μ=-0.001 σ=0.090 cuda:0),                                          
                             ('interactions.1.conv.nn.2.bias', tensor[128] x∈[-0.040, 0.032] μ=0.001 σ=0.014 cuda:0), ('interactions.1.lin.weight', tensor[512, 512] n=262144 (1Mb) x∈[-0.175, 0.210] μ=4.499e-05 σ=0.051 cuda:0), ('interactions.1.lin.bias',                  
                             tensor[512] 2Kb x∈[-0.065, 0.061] μ=-0.001 σ=0.019 cuda:0), ('interactions.2.mlp.0.weight', tensor[128, 50] n=6400 (25Kb) x∈[-0.253, 0.307] μ=-0.000 σ=0.109 cuda:0), ('interactions.2.mlp.0.bias', tensor[128] x∈[-0.055, 0.060]                  
                             μ=0.004 σ=0.022 cuda:0), ('interactions.2.mlp.2.weight', tensor[128, 128] n=16384 (64Kb) x∈[-0.250, 0.266] μ=0.000 σ=0.090 cuda:0), ('interactions.2.mlp.2.bias', tensor[128] x∈[-0.033, 0.040] μ=0.001 σ=0.015 cuda:0),                           
                             ('interactions.2.conv.lin1.weight', tensor[128, 512] n=65536 (0.2Mb) x∈[-0.204, 0.224] μ=0.000 σ=0.062 cuda:0), ('interactions.2.conv.lin2.weight', tensor[512, 128] n=65536 (0.2Mb) x∈[-0.211, 0.190] μ=0.000 σ=0.058 cuda:0),                    
                             ('interactions.2.conv.lin2.bias', tensor[512] 2Kb x∈[-0.043, 0.037] μ=-0.003 σ=0.016 cuda:0), ('interactions.2.conv.nn.0.weight', tensor[128, 50] n=6400 (25Kb) x∈[-0.253, 0.307] μ=-0.000 σ=0.109 cuda:0),                                        
                             ('interactions.2.conv.nn.0.bias', tensor[128] x∈[-0.055, 0.060] μ=0.004 σ=0.022 cuda:0), ('interactions.2.conv.nn.2.weight', tensor[128, 128] n=16384 (64Kb) x∈[-0.250, 0.266] μ=0.000 σ=0.090 cuda:0),                                            
                             ('interactions.2.conv.nn.2.bias', tensor[128] x∈[-0.033, 0.040] μ=0.001 σ=0.015 cuda:0), ('interactions.2.lin.weight', tensor[512, 512] n=262144 (1Mb) x∈[-0.180, 0.195] μ=5.915e-05 σ=0.051 cuda:0), ('interactions.2.lin.bias',                  
                             tensor[512] 2Kb x∈[-0.051, 0.049] μ=-0.001 σ=0.018 cuda:0), ('interactions.3.mlp.0.weight', tensor[128, 50] n=6400 (25Kb) x∈[-0.288, 0.302] μ=0.000 σ=0.107 cuda:0), ('interactions.3.mlp.0.bias', tensor[128] x∈[-0.069, 0.059]                   
                             μ=-0.003 σ=0.021 cuda:0), ('interactions.3.mlp.2.weight', tensor[128, 128] n=16384 (64Kb) x∈[-0.224, 0.245] μ=0.001 σ=0.089 cuda:0), ('interactions.3.mlp.2.bias', tensor[128] x∈[-0.034, 0.031] μ=-3.920e-05 σ=0.014 cuda:0),                     
                             ('interactions.3.conv.lin1.weight', tensor[128, 512] n=65536 (0.2Mb) x∈[-0.225, 0.212] μ=0.000 σ=0.062 cuda:0), ('interactions.3.conv.lin2.weight', tensor[512, 128] n=65536 (0.2Mb) x∈[-0.197, 0.224] μ=-0.000 σ=0.058 cuda:0),                   
                             ('interactions.3.conv.lin2.bias', tensor[512] 2Kb x∈[-0.043, 0.034] μ=-0.003 σ=0.015 cuda:0), ('interactions.3.conv.nn.0.weight', tensor[128, 50] n=6400 (25Kb) x∈[-0.288, 0.302] μ=0.000 σ=0.107 cuda:0),                                         
                             ('interactions.3.conv.nn.0.bias', tensor[128] x∈[-0.069, 0.059] μ=-0.003 σ=0.021 cuda:0), ('interactions.3.conv.nn.2.weight', tensor[128, 128] n=16384 (64Kb) x∈[-0.224, 0.245] μ=0.001 σ=0.089 cuda:0),                                           
                             ('interactions.3.conv.nn.2.bias', tensor[128] x∈[-0.034, 0.031] μ=-3.920e-05 σ=0.014 cuda:0), ('interactions.3.lin.weight', tensor[512, 512] n=262144 (1Mb) x∈[-0.182, 0.181] μ=7.247e-05 σ=0.050 cuda:0),                                         
                             ('interactions.3.lin.bias', tensor[512] 2Kb x∈[-0.044, 0.050] μ=-0.001 σ=0.016 cuda:0), ('interactions.4.mlp.0.weight', tensor[128, 50] n=6400 (25Kb) x∈[-0.289, 0.293] μ=-0.000 σ=0.107 cuda:0), ('interactions.4.mlp.0.bias',                    
                             tensor[128] x∈[-0.045, 0.038] μ=0.001 σ=0.017 cuda:0), ('interactions.4.mlp.2.weight', tensor[128, 128] n=16384 (64Kb) x∈[-0.236, 0.226] μ=-0.000 σ=0.089 cuda:0), ('interactions.4.mlp.2.bias', tensor[128] x∈[-0.037, 0.041]                     
                             μ=-0.001 σ=0.015 cuda:0), ('interactions.4.conv.lin1.weight', tensor[128, 512] n=65536 (0.2Mb) x∈[-0.210, 0.207] μ=-3.303e-05 σ=0.062 cuda:0), ('interactions.4.conv.lin2.weight', tensor[512, 128] n=65536 (0.2Mb) x∈[-0.185,                     
                             0.185] μ=-0.000 σ=0.057 cuda:0), ('interactions.4.conv.lin2.bias', tensor[512] 2Kb x∈[-0.046, 0.044] μ=-0.002 σ=0.017 cuda:0), ('interactions.4.conv.nn.0.weight', tensor[128, 50] n=6400 (25Kb) x∈[-0.289, 0.293] μ=-0.000 σ=0.107                
                             cuda:0), ('interactions.4.conv.nn.0.bias', tensor[128] x∈[-0.045, 0.038] μ=0.001 σ=0.017 cuda:0), ('interactions.4.conv.nn.2.weight', tensor[128, 128] n=16384 (64Kb) x∈[-0.236, 0.226] μ=-0.000 σ=0.089 cuda:0),                                  
                             ('interactions.4.conv.nn.2.bias', tensor[128] x∈[-0.037, 0.041] μ=-0.001 σ=0.015 cuda:0), ('interactions.4.lin.weight', tensor[512, 512] n=262144 (1Mb) x∈[-0.179, 0.183] μ=-4.716e-05 σ=0.049 cuda:0), ('interactions.4.lin.bias',                
                             tensor[512] 2Kb x∈[-0.045, 0.053] μ=-0.000 σ=0.014 cuda:0), ('interactions.5.mlp.0.weight', tensor[128, 50] n=6400 (25Kb) x∈[-0.252, 0.285] μ=-0.001 σ=0.108 cuda:0), ('interactions.5.mlp.0.bias', tensor[128] x∈[-0.044, 0.045]                  
                             μ=7.291e-05 σ=0.019 cuda:0), ('interactions.5.mlp.2.weight', tensor[128, 128] n=16384 (64Kb) x∈[-0.263, 0.236] μ=-0.001 σ=0.090 cuda:0), ('interactions.5.mlp.2.bias', tensor[128] x∈[-0.038, 0.029] μ=-0.001 σ=0.015 cuda:0),                     
                             ('interactions.5.conv.lin1.weight', tensor[128, 512] n=65536 (0.2Mb) x∈[-0.205, 0.199] μ=0.001 σ=0.062 cuda:0), ('interactions.5.conv.lin2.weight', tensor[512, 128] n=65536 (0.2Mb) x∈[-0.203, 0.182] μ=3.817e-05 σ=0.058 cuda:0),                
                             ('interactions.5.conv.lin2.bias', tensor[512] 2Kb x∈[-0.064, 0.051] μ=-0.003 σ=0.018 cuda:0), ('interactions.5.conv.nn.0.weight', tensor[128, 50] n=6400 (25Kb) x∈[-0.252, 0.285] μ=-0.001 σ=0.108 cuda:0),                                        
                             ('interactions.5.conv.nn.0.bias', tensor[128] x∈[-0.044, 0.045] μ=7.291e-05 σ=0.019 cuda:0), ('interactions.5.conv.nn.2.weight', tensor[128, 128] n=16384 (64Kb) x∈[-0.263, 0.236] μ=-0.001 σ=0.090 cuda:0),                                       
                             ('interactions.5.conv.nn.2.bias', tensor[128] x∈[-0.038, 0.029] μ=-0.001 σ=0.015 cuda:0), ('interactions.5.lin.weight', tensor[512, 512] n=262144 (1Mb) x∈[-0.168, 0.190] μ=3.901e-05 σ=0.048 cuda:0), ('interactions.5.lin.bias',                 
                             tensor[512] 2Kb x∈[-0.036, 0.040] μ=-0.000 σ=0.016 cuda:0), ('lin1.weight', tensor[256, 512] n=131072 (0.5Mb) x∈[-0.254, 0.245] μ=-0.000 σ=0.060 cuda:0), ('lin1.bias', tensor[256] 1Kb x∈[-0.031, 0.038] μ=-0.001 σ=0.014 cuda:0),                
                             ('lin2.weight', tensor[32, 256] n=8192 (32Kb) x∈[-0.201, 0.156] μ=-0.001 σ=0.046 cuda:0), ('lin2.bias', tensor[32] x∈[-0.074, 0.082] μ=0.001 σ=0.043 cuda:0)])                                                                                     
Error executing job with overrides: ['encoder=schnet', 'task=multiclass_node_classification', 'trainer=gpu', 'env.paths.data=/dtu/blackhole/17/126583/Post', 'env.paths.output_dir=/dtu/blackhole/17/126583/Post/output_test', 'ckpt_path=/dtu/blackhole/17/126583/Post/output_test/checkpoints/epoch_000.ckpt']
Traceback (most recent call last):
  File "/zhome/ce/4/118546/deeplearning/env/lib/python3.10/site-packages/proteinworkshop/finetune.py", line 161, in _main
    finetune(cfg)
  File "/zhome/ce/4/118546/deeplearning/env/lib/python3.10/site-packages/proteinworkshop/finetune.py", line 81, in finetune
    err = model.encoder.load_state_dict(encoder_weights, strict=False)
  File "/zhome/ce/4/118546/deeplearning/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for SchNetModel:
	size mismatch for embedding.weight: copying a param with shape torch.Size([512, 39]) from checkpoint, the shape in current model is torch.Size([512, 23]).

`process` function of the `ProteinDataset` class iterates over tuples instead of PDB codes

  • After running the command python proteinworkshop/train.py dataset=masif_site encoder=schnet task=ppi_site_prediction trainer=gpu to train a new PPI site prediction model using the MaSIF-site dataset, I received the following error since the process function of the ProteinDataset class iterates over tuples of PDB indices and codes instead of simply PDB codes. Reference:
  • Error: FileNotFoundError: (0, '4hcp') not found in raw directory. Are you sure it's downloaded and has the format mmtf?. I can confirm that proteinworkshop/data/pdb/4hcp.mmtf indeed exists locally in storage and is not an empty file.
  • Related to the MaSIF-site dataset, its name in the README.md's Supervised Datasets table needs to be corrected to masif_site instead of masif-site. Reference:
    | `masif-site` | Protein-protein interaction site prediction | [Gainza et al.](https://www.nature.com/articles/s41592-019-0666-6) | [Apache 2.0](https://github.com/LPDI-EPFL/masif/blob/master/LICENSE)

`fold_fold` dataset cannot be downloaded

When attempting to select dataset=fold_fold, I received the following file extension error for .ents:

Invalid format: ent. Must be 'pdb' or 'mmtf'.

For context, I am selecting task=multiclass_graph_classification as well.

unable to process mmtf file to pyg

Hi,
I try to convert a file at '/home/zpw97/github-projs/ProteinWorkshop/proteinworkshop/data/pdb/128d.mmtf.gz' to pyg graph.

graph = protein_to_pyg(
                path=path,
                chain_selection=self.chains[i]
                if self.chains is not None
                else "all",
                keep_insertions=True,
                store_het=self.store_het,
            )

and got the error:

can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

Using processed datasets without a local copy of the PDB triggers download of raw data

Consider the following:

workshop download deep_sea_proteins
workshop train task=inverse_folding dataset=deep_sea_proteins

Because the ProteinDataset can't find all the raw_filenames, it triggers the download loop from the PDB.

If we however run:

workshop download pdb
workshop download deep_sea_proteins
workshop train task=inverse_folding dataset=deep_sea_proteins

Everything works as expected since it sees all the raw files.

Fix

In ProteinDataset.__init__ I suggest first looking to see if all processed_filenames are present and overwrite=False then setting raw_filename=None to avoid triggering the loop.

However, if we want to overwrite=True, then we of course need to download the raw structures.

  • As part of this fix, we should consistently expose the overwrite arg to the user.

Thoughts @amorehead @chaitjo @Croydon-Brixton ?

Issues with feature-computations

Hi
I want to train a protein-model for node-prediction for a school project and find that proteinworkshop seems very promising!

I have a directory of downloaded .pdb-files from alphafold predictions which I want to use for training in my own defined loop, as this is part of the project specification.

I have been wanting to use GearNet, SchNet or DimeNet for the task, but am having issues with figuring out at which point the features for the proteins are computed due to the high level of abstraction in the forward pass of the different models. I am having a hard time keeping track of the flow of configs and data-processing through all the sub-modules. Can you clarify at which point graph-features are computed?

Ideally, I would like to create graphs and features using graphein, create protein-batches and train the models on these.
I have been trying to reproduce the approach in the schnet main-call, but am additionally facing issues with tensors being of type float instead of long, shapes being mismatched, etc. (minimal examplifying showcase here).

Is it possible to do this using graphein, or is it necessary to use the ProteinFeaturiser? If so, can I somehow apply the featuriser to pdb-files, a graphein proteinbatch or graph object, or anything along these lines?

Thank you for your help!

fail to activate poetry

Hi, I follow the instruction and successfully install poetry but can not activate it.

prot_workshop) zpw97@daisy:~/github-projs/ProteinWorkshop$ poetry --version
Poetry (version 1.6.1)
(prot_workshop) zpw97@daisy:~/github-projs/ProteinWorkshop$ source $(poetry env info --path)/bin/activate
bash: /home/zpw97/anaconda3/envs/prot_workshop/bin/activate: No such file or directory

Unable to reproduce the results on fold classification task

Hi! Thanks for this great work. I'm attempting to replicate the results of Table 2 presented in the paper, but I've noticed that the performance is different from what was reported. I applied the EGNN on the fold_classification tasks without using any auxiliary tasks. The command I used is shown below:

python proteinworkshop/train.py encoder=egnn task=multiclass_graph_classification dataset=fold_fold features=ca_seq +aux_task=none

The performance of each test set in my experiment is fold: 0.28, family: 0.91, super_family: 0.378:

image

Please let me know if there is any information I need to provide. Thanks!

Required batch attributes for GearNet encoder

The required attributes when using the GearNet encoder is stated as:

    - ``x`` Positions (shape ``[num_nodes, 3]``)
    - ``edge_index`` Edge indices (shape ``[2, num_edges]``)
    - ``edge_type`` Edge types (shape ``[num_edges]``)
    - ``edge_attr`` Edge attributes (shape ``[num_edges, num_edge_features]``)
    - ``num_nodes`` Number of nodes (int)
    - ``batch`` Batch indices (shape ``[num_nodes]``)

In the gear_net_edge_features() method, the attribute "pos" seems to be used which is not mentioned as a required attribute. Maybe I misunderstood something but is there a chance that the required attributes list should look like this:

    - ``pos`` Positions (shape ``[num_nodes, 3]``)
    - ``edge_index`` Edge indices (shape ``[2, num_edges]``)
    - ``edge_type`` Edge types (shape ``[num_edges]``)
    - ``edge_attr`` Edge attributes (shape ``[num_edges, num_edge_features]``)
    - ``num_nodes`` Number of nodes (int)
    - ``batch`` Batch indices (shape ``[num_nodes]``)

Error encountered during training on GO datasets.

Hi,

I would like first to thank you for this impressive work.

I'm experiencing a problem with training a Graph Convolutional Network (GCN) using the GO datasets in the proteinworkshop framework.

Initially, I encountered a RuntimeError related to the PytorchStreamReader during data loading. I believe this was due to environmental incompatibility. To address this, I set in_memory=False in the dataset configurations (go-mf.yaml and go-cc.yaml).

However, when I ran the training commands:

python proteinworkshop/train.py dataset=go-mf encoder=gcn task=multiclass_graph_classification trainer=gpu
python proteinworkshop/train.py dataset=go-cc encoder=gcn task=multiclass_graph_classification trainer=gpu

I faced a new issue: the input and target batch sizes did not match in both instances.

I've included the full error trace below for your reference. I would greatly appreciate any advice or suggestions to resolve this issue.

Highlight local pip install option in docs

In the documentation we describe installation from PyPI pip install proteinworkshop and local installation via poetry. However, the poetry workflow can be replaced with:

git clone https://github.com/a-r-j/ProteinWorkshop/
cd ProteinWorkshop
pip install -e .

`constants.py` not correctly detecting env vars.

  • Currently constants.py looks for .env and falls back to a default data path. This needs to changed to look for .env, followed by a set env var, followed by the default. This should also be logged.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.