jkarolczak / ligand-classification Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 46.75 MB

Project examing sparse deep learning architectures for ligand classification.

Python 21.60% Jupyter Notebook 74.69% C++ 0.99% Cuda 2.23% C 0.27% Shell 0.03% HTML 0.18%

deep-learning machine-learning classification sparsity minkowski-engine ligand

ligand-classification's Issues

Abstract class for preprocessors

implement PCA transofrm (and tests!!!!)

a wise man once said:

Jako ciekawy test można wziąć dwa bloby tej samej cząsteczki i sprawdzić czy ich różnica jest mniejsza po ustandaryzowaniu obu (assert diff(b1, b2) > diff(rotated_b1, rotated_b2))
gdzie diff to pewnie np.sum(b1-b2)

as well as

My to robiliśmy na piechotę, bo na piechotę liczyliśmy macierz kowariancji, ale Państwą mogą skorzystać pewnie z jakiegoś PCA czy SVD i powinno wyjść to samo

eigenvalues, eigenvectors = np.linalg.eig(covariance)

Log enhancements

model name (python type) in eval
file name of model in eval
dataset name in eval

wipe every trace of softmax from the repository

leave no survivors, take no prisoners

Check how kmeans are cheating us

Hello folks,
we have to generate a dataset that consists of blobs surfaces instead of whole blobs. To perform it I've created branch deploy-vm. Please, let's generate the new dataset together. It's possible to do it in batches. Just checkout to the mentioned branch, specify end and start values in the cfg/generate_dataset.yaml according to the list below. If you are going to process the batch mark it on the list (thus we will not duplicate the work). You can upload generated batches online or take them on a hard drive and I will copy them at the uni.

You can track the process in the file log.txt

Batches:

LabelBinarizer serialization

ciasto marchewkowe, wegańskie (ale jajka są ok), pyszne

✍️ Writing

- Reread DB papers, understand what ligands are and write something about it
- @konradszewczyk
- @wtaisner
- Read TranLoc3D paper (including NetVlad and others related), understand why it works and write something about it
- @annprzy
- @jkarolczak
- Find other related works (their pros and cons)
- @reachfall

Implement label encoding

For labels are represented by strings. Labels are suppose to be vectors of integers indicating target class with 1s (and 0s for all others).

Log under- and oversampling sizes

Model serialization

Commit: 2ab52fd

import os

import torch
from datetime import datetime

def save_state_dict(
    model: torch.nn.Module,
    directory: str,
    epoch: int
) -> None:
    """
    Serialize model weigts.
    :param model: torch.nn.Module to save its weights
    :param directory: path to the directory to log experiment results
    :param epoch: int describing epoch
    """
    models_path = os.path.join(directory, 'models')
    os.makedirs(models_path, exist_ok=True)
    time = str(datetime.now()).replace(' ', '-')
    file_name = f'{time}-epoch-{epoch}.pt'
    file_path = os.path.join(models_path, file_name)
    torch.save(model.state_dict(), file_path)

Split number of descriptors and output classes to two hyperparameters

Test Mr Nowicki tips

log models to neptune's registry

Uniform selection

Implement first network

Implement first network to make classification.

Fix path instide src/data/LigandDataset

Now the dataset loads files from <input_dir>/blobs_full/. It should be <input_dir>/

Missing dependencies

The following packages are missing in the container:

addict
yapf
neptune-client

add MinkLoc3Dv2

https://github.com/jac99/MinkLoc3Dv2

Verify why our code is not working

Verify why the code below is not working:

import os

import numpy as np
import pandas as pd

import torch
from torch import nn
from torch import tensor
from torch.utils.data import Dataset

import MinkowskiEngine as ME

### datset and dataloader ###

class LigandDataset(Dataset):
    """A class to represent a ligands dataset."""
    def __init__(
        self, 
        annotations_file_path: str,
        labels_file_path: str = None
        ):
        """
        :param annotations_file_path: path to the directory containing directory
        'blobs_full' (which contains .npz files)
        :param labels_file_path: string with path to the file containing csv definition 
        of the dataset, default '{annotations_file_path}/cmb_blob_labels.csv', this 
        file has to contain columns 'ligands' and 'blob_map_file'
        """
        self.annotations_file_path = annotations_file_path
        if labels_file_path is None: 
            labels_file_path = os.path.join(self.annotations_file_path, 'cmb_blob_labels.csv')
        
        file_ligand_map = pd.read_csv(
            labels_file_path,
            usecols=['ligand', 'blob_map_filename']
        )
        self.file_ligand_map = file_ligand_map.set_index('blob_map_filename').to_dict()['ligand']
        self.files = list(self.file_ligand_map.keys())
        self.labels = list(self.file_ligand_map.values())

    def __len__(self):
        return len(self.files)

    def __getitem__(self, idx):
        idx = self.files[idx]
        blob_path = os.path.join(self.annotations_file_path, 'blobs_full', idx)
        blob = np.load(blob_path)['blob']
        label = tensor(np.sum(blob), dtype=torch.float32)
        blob = tensor(blob, dtype=torch.float32)
        return (blob, label)

class DataLoader:
    """A class to represent simple dataloader which doesn't perform batching."""
    def __init__(self, dataset: Dataset):
        """
        :param dataset: dataset to be loaded
        """
        self.iter = iter(dataset)

    def __iter__(self):
        return self.iter

### to minkowski tensor ###

def to_minkowski_tensor(
    batch: torch.tensor
) -> ME.SparseTensor: 
    """
    Converts torch tensor containing blob or batch of blobs into MinkowskiEngine sparse tensor.
    :param batch: torch tensor
    :return: MinkowskiEngine sparse tensor
    """
    coordinates = torch.nonzero(batch).int()
    features = []
    for idx in coordinates:
        features.append(batch[tuple(idx)])
    features = torch.tensor(features).unsqueeze(-1)
    coordinates, features = ME.utils.sparse_collate([coordinates], [features])
    coordinates, features= ME.utils.sparse_quantize(coordinates, features)
    return ME.SparseTensor(features=features, coordinates=coordinates)

### NN ###

class NN(nn.Module):
    def __init__(
        self
    ):
        nn.Module.__init__(self)
        self.linear = nn.Linear(1, 1)

    def forward(
        self, 
        x: tensor
    ):
        x = x.F.sum()
        x = self.linear(x.unsqueeze(0))
        return x


class MinkowskiNN(ME.MinkowskiNetwork):
    def __init__(
        self
    ):
        ME.MinkowskiNetwork.__init__(self, 3)

        self.linear = ME.MinkowskiLinear(
            in_features=1, 
            out_features=1,
            bias=False
        )
        self.pool = ME.MinkowskiGlobalSumPooling()

    def forward(self, x: ME.SparseTensor):
        x = self.linear(x)
        x = self.pool(x)
        return x.F.squeeze(0).squeeze(0)


### trainer ###

model = MinkowskiNN()
dataset = LigandDataset('data')
dataloader = DataLoader(dataset)
criterion = torch.nn.L1Loss()
optimizer = torch.optim.Adam(
    model.parameters(),
    lr=1e-3
)

for idx, (blob, y) in enumerate(dataloader):
    if idx >= 10000:
        break
        
    optimizer.zero_grad()
    blob = to_minkowski_tensor(blob)
    y_hat = model(blob) # forward pass
    loss = criterion(y, y_hat)
    loss.backward()
    optimizer.step()

    if not idx % 100:
        #print([param for param in model.parameters()]) 
        print(f'iteration:{idx:>8}', f'loss: {loss.item():.4f}', f'groundtruth: {y.item():6.4f}', f'prediction: {y_hat.item():.4f}')
    
    if not idx % 500:
        print([param for param in model.parameters()]) 

print([param for param in model.parameters()])

Prepare docker container

Prepare docker container for deployment.

Python dependencies:

sklearn
pandas
torch
torchmetrics
MinkowskiEngine

fix seed in DataLoaders

according to docs, make sure that random state in dataloaders in scripts will be fixed to a constant value

test MinkLoc3Dv2 with pretrained weights

Implement moving model to a GPU (if a GPU is available)

Testing new stuff

model with softmax (on uniform) - LIGANDS-323
kmeans dataset - LIGANDS-340
undersampling
oversampling - LIGANDS-333
mixed sampling - LIGANDS-325

Implement regular data loader

For now our data loader yields batches containing single blob. It's suppose to yield many (like 8, 16 or 32) blobs in a batch.

Remove scaling

Add logging information about dataset

Find a network architecture

fix memory leaks on GPU version in MinkNet.py and PoC.py

Implement measures computation

Top 20 accuracy
Top 10 accuracy
Top 5 accuracy
Macro-averaged recall
Cohen's kappa
Micro-averaged recall
Micro-averaged precision
Micro-averaged F1

Ania popatrz na kmeansy

Clustering

move random selection from dataset to preprocessing pipeline

Oversampling

Integrate with neptune.ai

add git workflow to the project

add basic CI/CD functionality into the repository:

add tests to pipeline module (use pytest)
find a tool and deploy it, so that after each push to branch, a pipeline running tests will be run; should tests fail, merge should be forbidden

Use deterministic backend

torch.use_deterministic_algorithms(True)

wegańskie kremówki

Przynieść wegańskie kremówki na kolejne spotkanie

Listen sped up songs

fix error visible in run #398

run 398

Undersampling

Generate K-Means dataset

We have to generate a dataset that consists of blobs after a few k-means iterations instead of whole blobs. To perform checkout to the branch mentioned in the comment, specify end and start values in the cfg/generate_dataset.yaml according to the list below. If you are going to process the batch mark it on the list (thus we will not duplicate the work).

Batches:

jkarolczak / ligand-classification Goto Github PK

ligand-classification's People

Contributors

Stargazers

Watchers

ligand-classification's Issues

Recommend Projects

Recommend Topics

Recommend Org