sacdallago / biotrainer Goto Github PK

View Code? Open in Web Editor NEW

31.0 31.0 7.0 8.45 MB

License: Academic Free License v3.0

Python 99.43% Dockerfile 0.57%

deep-learning language-model machine-learning protein proteins

biotrainer's People

Contributors

Stargazers

Watchers

Forkers

sebief biocoder007 callmemisterowl henrikotterstedt maciejmajew heispv biocentral

biotrainer's Issues

torchmetrics missing form requirements

Traceback (most recent call last):
  File "/home/cdallago/miniconda3/envs/python3.8/bin/biotrainer", line 2, in <module>
    from biotrainer.utilities.cli import main
  File "/home/cdallago/git/biotrainer/biotrainer/__init__.py", line 5, in <module>
    import biotrainer.solvers
  File "/home/cdallago/git/biotrainer/biotrainer/solvers/__init__.py", line 2, in <module>
    from .ResidueClassificationSolver import ResidueClassificationSolver
  File "/home/cdallago/git/biotrainer/biotrainer/solvers/ResidueClassificationSolver.py", line 7, in <module>
    from .ClassificationSolver import ClassificationSolver
  File "/home/cdallago/git/biotrainer/biotrainer/solvers/ClassificationSolver.py", line 4, in <module>
    from torchmetrics import Accuracy, Precision, Recall, F1Score, SpearmanCorrCoef, MatthewsCorrCoef
ModuleNotFoundError: No module named 'torchmetrics'

Split annotations are more complicated than necessary

Currently, splits are denoted as follows:

>ID1 SET=train VALIDATION=False
...
>ID2 SET=train VALIDATION=True
...
>ID3 SET=test VALIDATION=False

this could be simplified to:

>ID1 SET=train
...
>ID2 SET=val
...
>ID3 SET=test

Add the possibility to add extra features to the h5 file

At the moment, embeddings are calculated, stored in and loaded from an embeddings.h5 file. This means that you basically get a vector for each residue (R) or sequence from the file, e.g. for word2vec it would by Rx512. Our models are able to handle an arbitrary feature input length, so it is possible to add extra features to the vector, as long as they have a numerical representation.

Adding this would hence enable the user to add different feature to the input other than embeddings (e.g. for SAV prediction, the position of the SAV in the sequence) and also open the possibility for feature augmentation methods (such as https://www.biorxiv.org/content/10.1101/2022.03.08.483422v1.full).

Issue with setuptools version

I had the following error on both Windows and MacOS.
When trying to run the sequence to class protocol. It was fixed by changing the setuptools version to 59.5.0 as per pytorch/pytorch#69894
Error Log.txt

Support multiple hyperparameters for hold_out cross validation

After the cross_validation PR will be merged, parameter search for nested cross validation will be enabled. It would be nice to extend this behaviour also to hold_out cross validation. A common use case for this might be, that one would like to train a model multiple times with different seeds, to get an error estimate for the training.

Improve user documentation

As this project focuses very much on user experience and usability, the documentation should be improved before releasing it to the public.

Several possible improvements:

Document all options for the config.yaml files
Write a "First steps" tutorial
Document all data standardizations (Missing: Residues to class, Embeddings h5 format (Shape, especially residues_to_class!), correct protocol name)
Readme: Standardization on top
Add a readme for every example in the examples directory
Refactor ADR 002 (trainer module)
Add citation
Add link to biotrainer

Add mask to input

When experimentally measuring the 3D structure of a protein sequence of length L, it can happen that not all L residues can be experimentally resolved. In such cases, we still know which amino acid was given in the input but we have no 3D coordinates for it. When we encode the 3D structure as a string of the 3 secondary elements "H" (helix), "B" ("beta-sheet), "O" (other), this means that we can not assign all L residues in the input sequence to one of those classes.
One easy solution would be to simply cut out the residues that were not resolved.
However, this should not be done as we a) create unrealisitic and potentially non-viable protein sequences when doing this and b) our language models actually benefit from those additional residues in the input as they add context.
Therefor, I used a binary mask when training secondary structure. This mask has length L and has a 1 at positions that were resolved and a 0 for positions that were unresolved. During training, I compute the loss only over those elements that have a 1 and ignore those that have a 0. During evaluation, I compute all my metrics only over the subset of residues that have a 1 in the mask and ignore the ones with 0.
As PyTorch's loss functions already have a flag 'ignore_index' which allows you to easily ignore padded elements (by default this shuold be -100), you could also simplify the above scenario by simply mapping unresolved residues also to -100. In this case, I assume your current code should already handle those as it will neither user elements with -100 for loss-compution nor for computing evaluation metrics.

Let's discuss this on Friday in more detail if you have any questions about this.

Add residue_to_value protocol

Implement residue to value protocol:

residue_to_value --> Predict a value V for each residue encoded in D dimensions in a sequence of length L. Input BxLxD --> output BxLxV

Also note #21 for the loss function and masking.

Most important point is to define a input data standardization. So definition of done:

residue_to_value protocol implemented and tested
data standardization is documented

:woman_scientist: Add LoRA layers to fine-tune protein language models for embeddings calculation

Biotrainer is using bio_embeddings to calculate embeddings for the provided sequences. Currently, this does not allow for fine-tuning existing protein language models (pLMs) such as ProtTrans for specific tasks, which might include prediction of subcellular location, secondary structure or protein-protein interaction.

While fine-tuning a full pLM on a specific task is very costly, LoRA: Low-rank adaption of large language models are one possibility to enable fine-tuning a transformer model with only a fraction of the original model's parameters.

Adding LoRA layers to biotrainer would, therefore, be a meaningful enhancement and would be in line with the overall premise of biotrainer, making protein prediction tasks easily accessible in a standardized and reproducable way. On the other side, it also requires a significant change in the sequence of operations that biotrainer performs. Currently, all embeddings are loaded or calculated once at the beginning of training. Adding fine-tuning, embeddings would have to be calculated on the fly for every epoch. A possible implementation could replace the current embeddings object with a function that is called every epoch and might be constant if no fine-tuning is applied. Still, major adaptions must be made to the dataloader module.

List of required steps (non-exhaustive):

Refactor data loading and embeddings calculation to allow for re-calculation of embeddings for every epoch
Implement LoRA layers
Add configuration option(s) to enable fine-tuning
Add validation and tests for new config option(s)
Add documentation
Add a fine-tuning example
Evaluate implementation on (parts of) the FLIP dataset

Additional material:

residue_to_class -> Remove predictions for padding

Currently, inference will work on longes sequence principle & return nonesense for added padding in batch. Gotta remove, potentially by swapping to other padding strategy as in TODO in datasets. Needs to wait merge #2 , must happen in #7

Inferencer does not set correct device

When loading the out.yml to create an Inferencer object, the device from the output variables is employed for predictions. However, if the training was done with a GPU and the model is now loaded on a system that only has a CPU available, an error is raised. The Inferencer object should automatically detect, if a GPU is available or not.

Value Error when Early stopping was engaged

Encountered Value Error when early stopping was engaged.
Info:

bio_trainer_config.txt

Error Message:
Traceback (most recent call last):
File "/mnt/project/otterstedt/biotrainer/run-biotrainer.py", line 6, in
biotrainer_main(sys.argv[1:])
File "/mnt/project/otterstedt/biotrainer/biotrainer/utilities/cli.py", line 38, in main
parse_config_file_and_execute_run(config_path)
File "/mnt/project/otterstedt/biotrainer/biotrainer/utilities/executer.py", line 126, in parse_config_file_and_execute_run
out_config = trainer.training_and_evaluation_routine()
File "/mnt/project/otterstedt/biotrainer/biotrainer/trainers/trainer.py", line 123, in training_and_evaluation_routine
test_dataset_embeddings = self._create_embeddings_dataset(test_dataset, mode="test")
File "/mnt/project/otterstedt/biotrainer/biotrainer/trainers/trainer.py", line 184, in _create_embeddings_dataset
return get_dataset(self._protocol, split)
File "/mnt/project/otterstedt/biotrainer/biotrainer/datasets/init.py", line 28, in get_dataset
return dataset(samples=samples)
File "/mnt/project/otterstedt/biotrainer/biotrainer/datasets/EmbeddingsDataset.py", line 10, in init
self.ids, self.inputs, self.targets = zip(
ValueError: not enough values to unpack (expected 3, got 0)

Add additional metrics

At the moment, biotrainer only supports accuracy for class prediction tasks. It would be nice to include more standard metrics:

Precision
Recall
F1
Add specific binary prediction mode to avoid unnecessary output there

issue with dependencies

Hello, I got access to this repository for my master practicum at Rostlab. I just wanted to mention that I had some issues with this project's dependencies. (Occurred in both Windows and Linux.) For example:

installing gensim was only possible on python 3.8 (https://stackoverflow.com/questions/66958119/error-when-installing-gensim-using-pip-install)
There were also problems with the bioembeddings extras:

"bio_embeddings.utilities.exceptions.InvalidParameterError: The extra for the protocol prottrans_t5_xl_u50 is missing. See https://docs.bioembeddings.com/#installation on how to install all extras"
So I went over to the bioembeddings repo, and followed the instructions. (pip install bio-embeddings[all]) However, I got the warning that "bio-embeddings 0.1.3 does not provide the extra 'all'", and afterwards I still got the InvalidParameterError from before.

Since the provided biotrainer configurations aren't that compatible with my ML task anyway, I decided to implement my own ML model, but for future users maybe consider to provide a Docker image or something similar :)

Metrics Calculation device handling is problematic

The MetricsCalculator currently has to move both predictions and targets to cpu, because otherwise it would conflict with the sklearn metrics calculation:

Traceback (most recent call last):
  File "biotrainer/biotrainer.py", line 6, in <module>
    biotrainer_main(sys.argv[1:])
  File "biotrainer/biotrainer/utilities/cli.py", line 24, in main
    parse_config_file_and_execute_run(arguments.config_path[0])
  File "biotrainer/biotrainer/utilities/executer.py", line 45, in parse_config_file_and_execute_run
    out_config = execute(output_dir=input_file_path / "output", **original_config)
  File "biotrainer/biotrainer/utilities/executer.py", line 28, in execute
    return get_trainer(**{**kwargs, **output_vars})
  File "biotrainer/biotrainer/trainers/__init__.py", line 14, in get_trainer
    return trainer.pipeline()
  File "biotrainer/biotrainer/trainers/Trainer.py", line 133, in pipeline
    self._do_and_log_training()
  File "biotrainer/biotrainer/trainers/Trainer.py", line 172, in _do_and_log_training
    _ = self._solver.train(self._train_loader, self._val_loader)
  File "biotrainer/biotrainer/solvers/Solver.py", line 112, in train
    iteration_result = self.training_iteration(X, y, step=len(epoch_iterations) * len(
  File "biotrainer/biotrainer/solvers/Solver.py", line 200, in training_iteration
    metrics = self.metrics_calculator.calculate_metrics(y, y_hat)
  File "biotrainer/biotrainer/solvers/MetricsCalculator.py", line 56, in calculate_metrics
    metrics_dict[metric_name] = metric_algo(y, y_hat)
  File "biotrainer/biotrainer/solvers/MetricsCalculator.py", line 17, in _residue_to_class_accuracy
    unmasked_accuracy = metrics.accuracy_score(flat_y, flat_predicted_classes, normalize=False)
  File "anaconda3/envs/bio_embeddings/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "anaconda3/envs/bio_embeddings/lib/python3.8/site-packages/sklearn/metrics/_classification.py", line 202, in accuracy_score
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
  File "anaconda3/envs/bio_embeddings/lib/python3.8/site-packages/sklearn/metrics/_classification.py", line 85, in _check_targets
    type_pred = type_of_target(y_pred)
  File "anaconda3/envs/bio_embeddings/lib/python3.8/site-packages/sklearn/utils/multiclass.py", line 261, in type_of_target
    if is_multilabel(y):
  File "anaconda3/envs/bio_embeddings/lib/python3.8/site-packages/sklearn/utils/multiclass.py", line 147, in is_multilabel
    y = np.asarray(y)
  File "anaconda3/envs/bio_embeddings/lib/python3.8/site-packages/torch/_tensor.py", line 643, in __array__
    return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Possible solutions

Just use .cpu()

Easiest solution, but I do not know how much it could impact performance on gpus.

Use pytorch - torchmetrics

This library has quite a lot of metrics to offer and handles the device automatically. Would replace sklearn.

Create tutorial how-to use a custom embedder

It would be nice to have a tutorial how to use custom embedders with biotrainer. This way, new protein language models can be used directly in biotrainer without having to calculate the embeddings before.

Improve error messages for non-matching sequence and labels files

At the moment, if there is a sequence in the sequence file that is not included in the labels file, a "cryptic" error message is thrown:

# Target Manager line 89   
# if len(sequence) != self._id2target[sequence.id].size:
           KeyError: 'Seq3'

Can be worked on together with #24 in one branch.

Per-residue predictions differ between batches and single input embeddings

We've identified an issue where per-residue predictions for protein sequences differ when processed as part of a batch versus individually. This inconsistency affects especially the CNN model and likely also the LightAttention model, particularly for residues at the end of sequences.

Key observations:

Batch processing vs. single input:
    * Predictions for the same sequence differ when processed in a batch compared to individually.
    * Differences are more pronounced for residues at the end of sequences.

Padding effects:
    * The current implementation doesn't properly handle padded sequences in batches.
    * This leads to inconsistencies, especially for shorter sequences in a batch.

Model architecture considerations:
    * Both CNN and LightAttention models are affected.
    * The issue is more noticeable in the CNN model due to its convolutional nature.

Normalization layers:
    * BatchNorm layers in the LightAttention model contributed to the discrepancy.
    * Replacing BatchNorm with LayerNorm partially mitigated the issue.

Proposed solutions:

Implement mask-aware processing:
    * Introduce a mask to identify non-padded elements in batched inputs.
    * Modify model forward passes to respect this mask.

Consistent padding strategy:
    * Implement custom padding that doesn't affect original sequence lengths.
    * Use F.pad for explicit control over padding in convolutional layers.


Consider adding a test function like this to test_inference.py:

        def test_single(self):
            r2c_dict = self.inferencer_r2c.from_embeddings(self.per_residue_embeddings,
                                                           include_probabilities=True)
    
            r2c_single_result_dict = {}
            for seq_id, emb in self.per_residue_embeddings.items():
                r2c_single_result_emb = self.inferencer_r2c.from_embeddings({seq_id: emb}, include_probabilities=True)
                r2c_single_result = r2c_single_result_emb["mapped_probabilities"][seq_id]
                r2c_single_result_dict[seq_id] = r2c_single_result
    
            prediction_errors = self._compare_predictions(r2c_dict["mapped_probabilities"],
                                                          r2c_single_result_dict)
            if len(prediction_errors) > 0:
                print(prediction_errors)
            self.assertTrue(len(prediction_errors) == 0)

Issue written together with Anthropic Claude-Sonnett 3.5

[ppi] Concat does not work for 0-dimensional (scalar) tensors

For the protein-protein interaction mode, singular values can't be concatenated by torch.concat. A reshaping like embedding1.reshape(1) would be necessary.

Config - Embeddings - Targets pipeline is inefficient

Currently, at first the config file is loaded (but not completely sanity checked yet, for example biotrainer does not care if the input files actually exist, so embeddings might be loaded, without the fasta file existing, resulting in an error that consumes much more time than would have been necessary).

After that, the embeddings are loaded or calculated. Only then, the input fasta file(s) are resolved and sanity checked. This might lead to some inefficiencies or incosistencies (#46). There are also some improvements pending, that could be achieved by a refactoring (#50).
In the development branch, there is also a new config option called "limited_sample_size", which limits the number of samples to train on. Still, embeddings have to be calculated for all sequences at the moment. Changing the pipeline will also help here.

So, I suggest the following refactoring at the moment:

Add a real module that does a complete sanity check of the config file (improving on config.py)
Do target loading and mapping before embeddings calculation
Do embeddings calculation depending on sequence - targets mapping

Update dependencies before release

Before the release of the paper, the dependencies in the pyproject.toml file should be updated once to the latest possible versions (without conflicts).

Check dataset creation from list

At the moment, we get the following warning from pytroch when creating the datasets in PredictionIOHandler (line 135ff.):

2022-05-30 12:02:48,905 WARNING biotrainer/biotrainer/trainers/PredictionIOHandler.py:136: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. 
Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:201.)
  idx: (torch.tensor(id2emb[idx]), torch.tensor(self._id2target[idx])) for idx in training_ids

Enable URLs for embedding files

Currently, biotrainer is only supporting local files. However, it would be convenient to be able to download embeddings directly from the Internet (http(s), ftp, etc.). This would also allow to use biotrainer in a more service-oriented way, because developers only rely on URLs with correct embedding files to run the AutoEval pipeline.

Implementation-wise, it should first be checked if the path is actually a file, only then download it.

elapsed_time_total is misleading

Currently, the elapsed_time_total variable in the output file indicates the value of the training time in seconds. It should be changed to indicate the complete time interval from start of the program to finish, and be supported by other variables for further information:

elapsed_time_embedding: Time for embedding including loading the embeddings file
elapsed_time_training: Time for training all splits of the model
elapsed_time_testing: Time for running the trained model on the test file

Create CI and docker image

A biotrainer docker image should be released with every new version. For this, we need a CI to run on every merged PR to the main branch.

Steps for Dockerfile:

Update Ubuntu Version
Update Cuda Version
Remove dependency on bio_embeddings (maybe add example in examples directory how to build on the Docker image)

Add random comparison baseline

As a researcher, it would be nice to have an automatic random baseline as a comparison for every run. This could be included in the final test metrics:
test set metrics: loss: 0.12, accuracy: 0.8, random_accuracy: 0.5

(Input from Prof. Rost at the lab meeting)

BatchNorm1D does not work with batches of size 1

The LightAttention model used for residues_to_class protocol uses BatchNorm1D. However, if using a batch size of 1 is not possible with BatchNorm1D. Because a batch size of 1 is an edge case anyway, we decide not to fix this problem soon. See also #34.

Please answer to this issue if you should encounter the problem, then we might look into a solution for it.

Inconsistency error

I am using the secondary structure data: http://data.bioembeddings.com/public/FLIP/fasta/secondary_structure/
I miss about 734 sequences in the embeddings file I generated
I set ignore_redundant_sequences to True

$ biotrainer biotrainer.yml 
2022-11-02 11:48:55,920 INFO auto_resume is enabled in the configuration file, but no valid checkpoint was found. Training new model from scratch.
2022-11-02 11:48:55,920 INFO Using Seed: 42
2022-11-02 11:48:56,060 INFO Embeddings file was found at /raid/cdallago/experiments/20221101_bionemo/secstruct/embeddings_file.h5. Embeddings have not been computed.
2022-11-02 11:48:56,061 INFO Loading embeddings from: /raid/cdallago/experiments/20221101_bionemo/secstruct/embeddings_file.h5
2022-11-02 11:48:59,623 INFO Read 10706 entries.
2022-11-02 11:48:59,623 INFO Time elapsed for reading embeddings: 3.6[s]
2022-11-02 11:48:59,623 INFO Number of features: 768
2022-11-02 11:49:02,571 WARNING Found 734 label(s) without a corresponding entry in the embeddings file! Because ignore_redundant_sequences flag is set, these labels are dropped for training. Data loss: 0.06856%
Traceback (most recent call last):
  File "/home/cdallago/miniconda3/envs/python3.8/bin/biotrainer", line 5, in <module>
    main()
  File "/home/cdallago/git/biotrainer/biotrainer/utilities/cli.py", line 24, in main
    parse_config_file_and_execute_run(arguments.config_path[0])
  File "/home/cdallago/git/biotrainer/biotrainer/utilities/executer.py", line 71, in parse_config_file_and_execute_run
    out_config = training_and_evaluation_routine(output_dir=str(output_dir), log_dir=str(log_dir), **config)
  File "/home/cdallago/git/biotrainer/biotrainer/trainers/trainer.py", line 81, in training_and_evaluation_routine
    train_dataset, val_dataset, test_dataset = target_manager.get_datasets(id2emb)
  File "/home/cdallago/git/biotrainer/biotrainer/trainers/TargetManager.py", line 196, in get_datasets
    self._validate_targets(id2emb)
  File "/home/cdallago/git/biotrainer/biotrainer/trainers/TargetManager.py", line 173, in _validate_targets
    id2emb.pop(seq_id)  # Remove redundant labels
KeyError: '3ite-A'

Check Residue2Class accuracy calculation

Maybe I simply do not get the full logic here but I would double check whether this actually does what you want.
My current fear would be that the not normalized (normalize=False) accuracy gives you absolute counts of all correctly predicted residues. However, masked regions are most likely not predicted correctly (put differently: I did not see that you guarantee that those are predicted correctly). In the second step you then subtract the number of masked residues ("total_to_consider" might be a misleading variable name as you count masked residues) which might underestimate your true performance.
My suggestion would be to remove padded elements before computing accuracy. This way you are sure that nothing goes wrong.

Originally posted by @mheinzinger in #2 (comment)

Create better logging output of evaluation

As @joaquimgomez noted, it is quite unclear to a user where the evaluation of the test set is stored and what the results where. So it seems like a good idea to add a little more extensive logging about that after the evaluation has taken place.

Embeddings do not get re-computed automatically

Currently, if a user already ran biotrainer and thus there already exist pre-computed embeddings, these embeddings do not get re-computed if the sequences.fasta file has changed. This might be a problem if users add new sequences or modify existing ones after their first run. This should not be a problem that occurs regularly, and fixing it will likely require some shifts in the pipeline. It should, however, be kept in mind for future improvements.

Support more visualization platforms

Many machine learning researchers are using different platforms to visualize their parameters and model output and training. At the moment, we are only supporting tensorboard. It would be possible to add support for more platforms if needed or requested.

List of suggestions:

Improve embeddings saving strategy

#96 introduced cached saving of embeddings to the hard disk during embeddings calculation to improve available RAM. At the moment, saving to disk is fixed after as a constant after 100 embeddings have been calculated. This creates some overhead and might even fail for very long sequences in the first batches. Ideally, the point when to save would be set dynamically based on available RAM and sequence lengths.

[ppi] Interaction mode not compatible with all protocols yet

The ppi interaction mode is not yet compatible with all protocols yet. sequence_to_class have been tested throughout. Other per-sequence protocols should work as well. However, for per-residue tasks (residue_to_class), changes have to be made to the architecture and handling of embeddings for interactions. For residues_to_class needs testing how the interaction mode can be applied to the LightAttention architecture.

Tasks:

residue_to_class (Implementing/Testing)
residues_to_class (Implementing/Testing)

Implement embeddings calculation "on the fly"

Currently, embeddings must be pre-calculated before the training process starts. For some users, it might be beneficial to calculate the embeddings on the fly, especially if they only require low GPU capacities.

TODOs:

Identify a relevant use-case (@sacdallago ?)
Add option to calculate embeddings on the fly in config file
Modify dataset class accordingly
Modify trainer pipeline accordingly

Do a sanity check of the inferencer module

Predictions for a secondary structure model (dataset) should about match those from the prottrans paper.

This could also be used to create a new test for the inferencer module with pre-calculated embeddings.

Another option would be to add a complete end-2-end test, using the same configuration as for the secondary structure prediction, but using one_hot_encoding embeddings for the test for performance issues.

Class weights need to be sent to the device when use_class_weights=True

When set use_class_weights to true, an exception like the one below occurs.

Execution example:

2022-12-14 21:02:44,019 INFO Creating output dir: /mnt/home/gomez/gomez_bio_embeddings/protein_structure_lm/biotrainer_configs/output
2022-12-14 21:02:44,036 INFO Creating log-directory: /mnt/home/gomez/gomez_bio_embeddings/protein_structure_lm/biotrainer_configs/output/CNN/custom_embeddings
2022-12-14 21:02:44,040 INFO auto_resume is enabled in the configuration file, but no valid checkpoint was found. Training new model from scratch.
2022-12-14 21:02:44,040 INFO Using Seed: 42
2022-12-14 21:02:44,136 INFO Embeddings file was found at /mnt/home/gomez/gomez_bio_embeddings/protein_structure_lm/3di_bert_tiny_absolute_batch32_lr-3_bind.h5. Embeddings have not been computed.
2022-12-14 21:02:44,136 INFO Loading embeddings from: /mnt/home/gomez/gomez_bio_embeddings/protein_structure_lm/3di_bert_tiny_absolute_batch32_lr-3_bind.h5
2022-12-14 21:02:46,663 INFO Read 1284 entries.
2022-12-14 21:02:46,663 INFO Time elapsed for reading embeddings: 2.5[s]
2022-12-14 21:02:46,663 INFO Number of features: 128
2022-12-14 21:02:50,209 INFO Number of classes: 8
2022-12-14 21:02:50,241 INFO Total number of sequences/residues: 224464
2022-12-14 21:02:50,241 INFO Individual class counts and weights:
2022-12-14 21:02:50,241 INFO 	0 : 205044 (0.137)
2022-12-14 21:02:50,241 INFO 	2 : 2568 (10.926)
2022-12-14 21:02:50,241 INFO 	3 : 12394 (2.264)
2022-12-14 21:02:50,241 INFO 	6 : 499 (56.228)
2022-12-14 21:02:50,241 INFO 	1 : 3762 (7.458)
2022-12-14 21:02:50,241 INFO 	4 : 93 (301.699)
2022-12-14 21:02:50,241 INFO 	5 : 80 (350.725)
2022-12-14 21:02:50,241 INFO 	7 : 24 (1169.083)
2022-12-14 21:02:59,978 WARNING /mnt/lsf-nas-1/os-shared/anaconda3/envs/gomez/lib/python3.8/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `SpearmanCorrcoef` will save all targets and predictions in the buffer. For large datasets, this may lead to large memory footprint.
  warnings.warn(*args, **kwargs)

Traceback (most recent call last):
  File "./autoeval/biotrainer/run-biotrainer.py", line 6, in <module>
    biotrainer_main(sys.argv[1:])
  File "/mnt/project/bio_embeddings/runs/gomez/protein_structure_lm/autoeval/autoeval/biotrainer/biotrainer/utilities/cli.py", line 24, in main
    parse_config_file_and_execute_run(arguments.config_path[0])
  File "/mnt/project/bio_embeddings/runs/gomez/protein_structure_lm/autoeval/autoeval/biotrainer/biotrainer/utilities/executer.py", line 71, in parse_config_file_and_execute_run
    out_config = training_and_evaluation_routine(output_dir=str(output_dir), log_dir=str(log_dir), **config)
  File "/mnt/project/bio_embeddings/runs/gomez/protein_structure_lm/autoeval/autoeval/biotrainer/biotrainer/trainers/trainer.py", line 159, in training_and_evaluation_routine
    _do_and_log_training(solver, train_loader, val_loader)
  File "/mnt/project/bio_embeddings/runs/gomez/protein_structure_lm/autoeval/autoeval/biotrainer/biotrainer/trainers/trainer.py", line 169, in _do_and_log_training
    _ = solver.train(train_loader, val_loader)
  File "/mnt/project/bio_embeddings/runs/gomez/protein_structure_lm/autoeval/autoeval/biotrainer/biotrainer/solvers/Solver.py", line 67, in train
    iteration_result = self._training_iteration(
  File "/mnt/project/bio_embeddings/runs/gomez/protein_structure_lm/autoeval/autoeval/biotrainer/biotrainer/solvers/ResidueClassificationSolver.py", line 44, in _training_iteration
    result_dict = super()._training_iteration(x, y, step, context, lengths)
  File "/mnt/project/bio_embeddings/runs/gomez/protein_structure_lm/autoeval/autoeval/biotrainer/biotrainer/solvers/Solver.py", line 226, in _training_iteration
    loss = self.loss_function(prediction, y)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/gomez/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/gomez/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1120, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/gomez/lib/python3.8/site-packages/torch/nn/functional.py", line 2824, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking arugment for argument weight in method wrapper_nll_loss2d_forward)

Update dependencies

The project dependencies should be updated. This might include an upgrade to PyTorch 2.0 and the corresponding compile() method.

Multi-class prediction

Is there a plan to include multi-class prediction in some way? It is not something urgent because only one split in FLIP from the sub-cellular location has a multi-class target, but something to take account for the future.

Adding BERT model and protocol = transformer encoder model + masked language modeling (MLM)

This is a very worthwhile effort. Are you considering adding the BERT transformer encoder model and the associated masked language modeling task for pre-training?

The task is actually the same as ResidueClassificationSolver, but it would only accept one sequence file (the output) and generate the randomly masked input on the fly. This could be done by a special type of Dataset, that's how fairseq implements this: https://github.com/facebookresearch/fairseq/blob/main/fairseq/data/mask_tokens_dataset.py

One issue I realized though is that the data might not fit into memory, so you would need to rewrite some of the logic. But at least for finetuning existing language models (which might be the main usecase) it would work even in memory.

TargetManager does not check if residues and targets have same length (for residue_to_class)

Currently, the TargetManager only reads the sequences and labels, so we have a number of residues and a number of labels in the end. If their count does not match (e.g. SEQSEQ, 11101), there is no error thrown in the TargetManager class. The error is then thrown of course in the loss calculation, where prediction and y dimensions have to match.

Of course one could argue that the user is responsible to hand over correct datasets, but it would be nice to throw an appropriate error message for better comprehension.

Look into AI Code suggestions

          Suggestions from Codium AI for the version 0.9.0 PR.

PR Code Suggestions ✨

Category	Suggestion	Score
Security	Improve security by using `yaml.safe_load()` instead of `yaml.load()` Replace the use of `yaml.load()` with `yaml.safe_load()` to avoid potential security risks associated with loading arbitrary YAML content. biotrainer/inference/inferencer.py [86] -output_vars = yaml.load(tmp_output_file, Loader=yaml.RoundTripLoader) +output_vars = yaml.safe_load(tmp_output_file) Suggestion importance[1-10]: 10 Why: This suggestion addresses a significant security concern by replacing `yaml.load()` with `yaml.safe_load()`, which is a safer method for loading YAML content.	10
Possible bug	Correct the superclass initialization in the `DeeperFNN` class In the `DeeperFNN` class, the superclass initialization is incorrectly calling `super(FNN,` `self).init()` which should be `super(DeeperFNN, self).init()`. This ensures that the `DeeperFNN` class correctly initializes its superclass. biotrainer/models/fnn.py [38] -super(FNN, self).__init__() +super(DeeperFNN, self).__init__() Suggestion importance[1-10]: 10 Why: This suggestion addresses a critical bug in the superclass initialization of the `DeeperFNN` class, ensuring proper inheritance and functionality.	10
Performance	Optimize the removal of special tokens by using `np.where()` Use list comprehension directly in the `np.delete()` function to improve code readability and efficiency. biotrainer/embedders/huggingface_transformer_embedder.py [62-63] special_tokens_mask = self._tokenizer.get_special_tokens_mask(input_id, already_has_special_tokens=True) -embedding = np.delete(embeddings[seq_num], - [index for index, mask in enumerate(special_tokens_mask) if mask != 0], axis=0) +embedding = np.delete(embeddings[seq_num], np.where(special_tokens_mask)[0], axis=0) Suggestion importance[1-10]: 9 Why: Using `np.where()` improves both readability and performance. This change makes the code more efficient and easier to understand.	9
Performance	Use `torch.no_grad()` to optimize inference performance Consider using `torch.no_grad()` context manager during inference to disable gradient computation, which can reduce memory consumption and increase computation speed. biotrainer/inference/inferencer.py [232] -inference_dict = solver.inference(dataloader, calculate_test_metrics=targets is not None) +with torch.no_grad(): + inference_dict = solver.inference(dataloader, calculate_test_metrics=targets is not None) Suggestion importance[1-10]: 9 Why: This suggestion correctly identifies a performance optimization by using `torch.no_grad()` during inference, which can reduce memory consumption and increase computation speed.	9
Robustness	Add exception handling for model state loading to manage file-related errors Implement exception handling for the `torch.load` function to manage potential errors during the loading of model states, such as file not found or corrupted files. biotrainer/solvers/solver.py [251] -state = torch.load(checkpoint_path, map_location=torch.device(self.device)) +try: + state = torch.load(checkpoint_path, map_location=torch.device(self.device)) +except FileNotFoundError: + logger.error("Checkpoint file not found.") + return +except Exception as e: + logger.error(f"Failed to load checkpoint: {str(e)}") + return Suggestion importance[1-10]: 9 Why: Adding exception handling for `torch.load` is a good practice to manage potential errors such as file not found or corrupted files. This enhances the robustness of the code.	9
Best practice	Improve type checking by using `isinstance()` Replace the direct type checks with `isinstance()` for better type checking, especially when dealing with inheritance. biotrainer/config/config_option.py [67-68] -return ("range" in str(self.value) or type(self.value) is list or - (type(self.value) is str and "[" in self.value and "]" in self.value)) +return ("range" in str(self.value) or isinstance(self.value, list) or + (isinstance(self.value, str) and "[" in self.value and "]" in self.value)) Suggestion importance[1-10]: 8 Why: Using `isinstance()` is a best practice for type checking, especially when dealing with inheritance. This change improves code robustness and readability.	8
Best practice	Use more specific exception types for clearer error handling Use a more specific exception type than the general `Exception` to provide clearer error handling. biotrainer/config/configurator.py [81-82] -except Exception as e: - raise Exception(f"Loading {embedder_name} automatically and as {tokenizer_class.__class__.__name__} failed!" - f" Please provide a custom_embedder script for your use-case.") from e +except ImportError as e: + raise ImportError(f"Loading {embedder_name} automatically and as {tokenizer_class.__class__.__name__} failed!" + f" Please provide a custom_embedder script for your use-case.") from e Suggestion importance[1-10]: 7 Why: Using a more specific exception type like `ImportError` provides clearer error handling and makes the code easier to debug. However, the improvement is minor and context-specific.	7
Enhancement	Enhance error messages for clarity and debugging Replace the manual exception raising for unknown `split_name` with a more informative error message that includes the available splits. biotrainer/inference/inferencer.py [191] -raise Exception(f"Unknown split_name {split_name} for given configuration!") +if split_name not in self.solvers_and_loaders_by_split: + available_splits = ', '.join(self.solvers_and_loaders_by_split.keys()) + raise ValueError(f"Unknown split_name '{split_name}'. Available splits are: {available_splits}") Suggestion importance[1-10]: 8 Why: The suggestion improves the clarity of error messages by including available split names, which aids in debugging and provides more informative feedback to the user.	8
	Apply dropout consistently to both feature and attention convolutions in the `LightAttention` class In the `LightAttention` class, the `dropout` operation is applied only to the output of `feature_convolution` but not to `attention_convolution`. Consistently applying dropout to both could potentially improve model performance by regularizing both features and attention mechanisms. biotrainer/models/light_attention.py [47] o = self.dropout(o) +attention = self.dropout(attention) Suggestion importance[1-10]: 8 Why: This suggestion potentially improves model performance by regularizing both features and attention mechanisms, making it a valuable enhancement.	8
	Enhance the `_early_stop` method by logging the reason for stopping Modify the `_early_stop` method to log the reason for stopping, which could be due to achieving a new minimum loss or reaching the patience limit. This enhances debugging and monitoring capabilities. biotrainer/solvers/solver.py [299] if self._stop_count == 0: + logger.info("Early stopping due to patience limit reached.") Suggestion importance[1-10]: 8 Why: Logging the reason for early stopping enhances debugging and monitoring capabilities, making it easier to understand why the training was stopped. This is a useful enhancement for tracking the training process.	8
	Improve variable naming for clarity in the `FNN` class's `forward` method In the `FNN` class, consider using a more descriptive variable name for the input tensor `x` in the `forward` method. Renaming `x` to `input_tensor` would improve code readability and make the method's purpose clearer. biotrainer/models/fnn.py [20] -def forward(self, x): +def forward(self, input_tensor): Suggestion importance[1-10]: 5 Why: While this suggestion enhances code readability, it is a minor improvement and does not address any functional issues.	5
Maintainability	Simplify dictionary creation from iterable using list comprehension Use list comprehension to simplify the creation of `embeddings_dict` from `embeddings` when it is not a dictionary. biotrainer/inference/inferencer.py [278] -embeddings_dict = {str(idx): embedding for idx, embedding in enumerate(embeddings)} +embeddings_dict = dict(enumerate(embeddings)) if not isinstance(embeddings, Dict) else embeddings Suggestion importance[1-10]: 7 Why: This suggestion simplifies the code for creating `embeddings_dict` from an iterable, improving code readability and maintainability. However, the improvement is minor.	7
	Refactor the mask calculation into a separate method in the `LightAttention` class The `mask` calculation in the `forward` method of the `LightAttention` class should be moved to a separate method to improve code readability and maintainability. This change will make the `forward` method cleaner and focus primarily on the forward pass logic. biotrainer/models/light_attention.py [43] -mask = x.sum(dim=-1) != utils.SEQUENCE_PAD_VALUE +mask = self.calculate_mask(x) +def calculate_mask(self, x): + return x.sum(dim=-1) != utils.SEQUENCE_PAD_VALUE + Suggestion importance[1-10]: 7 Why: This suggestion improves code readability and maintainability by separating concerns, but it does not address a critical issue.	7
	Refactor to separate training and validation into distinct methods for better modularity Refactor the `train` method to separate the training and validation phases into their own methods. This improves code readability and maintainability by modularizing the training process. biotrainer/solvers/solver.py [66] for epoch in range(self.start_epoch, self.number_of_epochs): + self._train_epoch(training_dataloader, epoch) + self._validate_epoch(validation_dataloader, epoch) Suggestion importance[1-10]: 7 Why: Refactoring the `train` method to separate training and validation phases improves code readability and maintainability. However, this suggestion requires additional implementation details for the new methods, which are not provided.	7
	Simplify dictionary initialization using comprehension Use dictionary comprehension to simplify the initialization of `__DATASETS` and `__COLLATE_FUNCTIONS`. biotrainer/datasets/init.py [9-22] -__DATASETS = { - Protocol.residue_to_class: ResidueEmbeddingsClassificationDataset, - Protocol.residues_to_class: ResidueEmbeddingsClassificationDataset, - Protocol.residues_to_value: ResidueEmbeddingsRegressionDataset, - Protocol.sequence_to_class: SequenceEmbeddingsClassificationDataset, - Protocol.sequence_to_value: SequenceEmbeddingsRegressionDataset, -} -__COLLATE_FUNCTIONS = { - Protocol.residue_to_class: pad_residue_embeddings, - Protocol.residues_to_class: pad_residues_embeddings, - Protocol.residues_to_value: pad_residues_embeddings, - Protocol.sequence_to_class: pad_sequence_embeddings, - Protocol.sequence_to_value: pad_sequence_embeddings, -} +__DATASETS = {protocol: dataset for protocol, dataset in zip( + [Protocol.residue_to_class, Protocol.residues_to_class, Protocol.residues_to_value, + Protocol.sequence_to_class, Protocol.sequence_to_value], + [ResidueEmbeddingsClassificationDataset, ResidueEmbeddingsClassificationDataset, ResidueEmbeddingsRegressionDataset, + SequenceEmbeddingsClassificationDataset, SequenceEmbeddingsRegressionDataset])} +__COLLATE_FUNCTIONS = {protocol: function for protocol, function in zip( + [Protocol.residue_to_class, Protocol.residues_to_class, Protocol.residues_to_value, + Protocol.sequence_to_class, Protocol.sequence_to_value], + [pad_residue_embeddings, pad_residues_embeddings, pad_residues_embeddings, + pad_sequence_embeddings, pad_sequence_embeddings])} Suggestion importance[1-10]: 6 Why: While dictionary comprehension can make the code more concise, it may also reduce readability for some developers. The improvement is more about code style and maintainability.	6

Originally posted by @CodiumAI-Agent in #92 (comment)

Add residues_to_class protocol

Implement residues_to_class protocol:

residues_to_class --> Predict a class C for all residues encoded in D dimensions in a sequence of length L. Input BxLxD --> output BxC

Should implement the Light Attention model by Hannes Stärk:
https://github.com/HannesStark/protein-localization/blob/master/models/light_attention.py#L5

MSE Loss is not capable of handling ignored targets

The PyTorch nn.MSELoss is not able to handle ignored indices (like the nn.CrossEntropyLoss).

A solution provided by @mheinzinger would be as follows:

You could easily implement this by yourself by passing "reduce=None" when initializing your loss function. As a result it will return a loss for each residue in your input instead of a single scalar. Then you can manually multiply the masked elements by 0 and average. When averaging, divide only by the number of non-masked residues.

This must be kept in mind for implementing residue(s)_to_value protocols in the future.