Giter VIP home page Giter VIP logo

ligand-classification's People

Contributors

annprzy avatar dabrze avatar jkarolczak avatar konradszewczyk avatar reachfall avatar wtaisner avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

ligand-classification's Issues

implement PCA transofrm (and tests!!!!)

a wise man once said:

Jako ciekawy test można wziąć dwa bloby tej samej cząsteczki i sprawdzić czy ich różnica jest mniejsza po ustandaryzowaniu obu (assert diff(b1, b2) > diff(rotated_b1, rotated_b2))
gdzie diff to pewnie np.sum(b1-b2)

as well as

My to robiliśmy na piechotę, bo na piechotę liczyliśmy macierz kowariancji, ale Państwą mogą skorzystać pewnie z jakiegoś PCA czy SVD i powinno wyjść to samo

eigenvalues, eigenvectors = np.linalg.eig(covariance)

Log enhancements

  • model name (python type) in eval
  • file name of model in eval
  • dataset name in eval

Generate blob surface dataset

Hello folks,
we have to generate a dataset that consists of blobs surfaces instead of whole blobs. To perform it I've created branch deploy-vm. Please, let's generate the new dataset together. It's possible to do it in batches. Just checkout to the mentioned branch, specify end and start values in the cfg/generate_dataset.yaml according to the list below. If you are going to process the batch mark it on the list (thus we will not duplicate the work). You can upload generated batches online or take them on a hard drive and I will copy them at the uni.

You can track the process in the file log.txt

Batches:

  • - start: 0, end: 50000
  • - start: 50000, end: 100000
  • - start: 100000, end: 150000
  • - start: 150000, end: 200000
  • - start: 200000, end: 250000
  • - start: 250000, end: 300000
  • - start: 300000, end: 350000
  • - start: 350000, end: 400000
  • - start: 400000, end: 450000
  • - start: 450000, end: 500000
  • - start: 500000, end: 550000
  • - start: 550000, end: 500000
  • - start: 600000, end: 650000
  • - start: 650000, end: -1

Implement label encoding

For labels are represented by strings. Labels are suppose to be vectors of integers indicating target class with 1s (and 0s for all others).

Model serialization

Commit: 2ab52fd

import os

import torch
from datetime import datetime

def save_state_dict(
    model: torch.nn.Module,
    directory: str,
    epoch: int
) -> None:
    """
    Serialize model weigts.
    :param model: torch.nn.Module to save its weights
    :param directory: path to the directory to log experiment results
    :param epoch: int describing epoch
    """
    models_path = os.path.join(directory, 'models')
    os.makedirs(models_path, exist_ok=True)
    time = str(datetime.now()).replace(' ', '-')
    file_name = f'{time}-epoch-{epoch}.pt'
    file_path = os.path.join(models_path, file_name)
    torch.save(model.state_dict(), file_path)

Missing dependencies

The following packages are missing in the container:

  • addict
  • yapf
  • neptune-client

Verify why our code is not working

Verify why the code below is not working:

import os

import numpy as np
import pandas as pd

import torch
from torch import nn
from torch import tensor
from torch.utils.data import Dataset

import MinkowskiEngine as ME

### datset and dataloader ###

class LigandDataset(Dataset):
    """A class to represent a ligands dataset."""
    def __init__(
        self, 
        annotations_file_path: str,
        labels_file_path: str = None
        ):
        """
        :param annotations_file_path: path to the directory containing directory
        'blobs_full' (which contains .npz files)
        :param labels_file_path: string with path to the file containing csv definition 
        of the dataset, default '{annotations_file_path}/cmb_blob_labels.csv', this 
        file has to contain columns 'ligands' and 'blob_map_file'
        """
        self.annotations_file_path = annotations_file_path
        if labels_file_path is None: 
            labels_file_path = os.path.join(self.annotations_file_path, 'cmb_blob_labels.csv')
        
        file_ligand_map = pd.read_csv(
            labels_file_path,
            usecols=['ligand', 'blob_map_filename']
        )
        self.file_ligand_map = file_ligand_map.set_index('blob_map_filename').to_dict()['ligand']
        self.files = list(self.file_ligand_map.keys())
        self.labels = list(self.file_ligand_map.values())

    def __len__(self):
        return len(self.files)

    def __getitem__(self, idx):
        idx = self.files[idx]
        blob_path = os.path.join(self.annotations_file_path, 'blobs_full', idx)
        blob = np.load(blob_path)['blob']
        label = tensor(np.sum(blob), dtype=torch.float32)
        blob = tensor(blob, dtype=torch.float32)
        return (blob, label)

class DataLoader:
    """A class to represent simple dataloader which doesn't perform batching."""
    def __init__(self, dataset: Dataset):
        """
        :param dataset: dataset to be loaded
        """
        self.iter = iter(dataset)

    def __iter__(self):
        return self.iter

### to minkowski tensor ###

def to_minkowski_tensor(
    batch: torch.tensor
) -> ME.SparseTensor: 
    """
    Converts torch tensor containing blob or batch of blobs into MinkowskiEngine sparse tensor.
    :param batch: torch tensor
    :return: MinkowskiEngine sparse tensor
    """
    coordinates = torch.nonzero(batch).int()
    features = []
    for idx in coordinates:
        features.append(batch[tuple(idx)])
    features = torch.tensor(features).unsqueeze(-1)
    coordinates, features = ME.utils.sparse_collate([coordinates], [features])
    coordinates, features= ME.utils.sparse_quantize(coordinates, features)
    return ME.SparseTensor(features=features, coordinates=coordinates)

### NN ###

class NN(nn.Module):
    def __init__(
        self
    ):
        nn.Module.__init__(self)
        self.linear = nn.Linear(1, 1)

    def forward(
        self, 
        x: tensor
    ):
        x = x.F.sum()
        x = self.linear(x.unsqueeze(0))
        return x


class MinkowskiNN(ME.MinkowskiNetwork):
    def __init__(
        self
    ):
        ME.MinkowskiNetwork.__init__(self, 3)

        self.linear = ME.MinkowskiLinear(
            in_features=1, 
            out_features=1,
            bias=False
        )
        self.pool = ME.MinkowskiGlobalSumPooling()

    def forward(self, x: ME.SparseTensor):
        x = self.linear(x)
        x = self.pool(x)
        return x.F.squeeze(0).squeeze(0)


### trainer ###

model = MinkowskiNN()
dataset = LigandDataset('data')
dataloader = DataLoader(dataset)
criterion = torch.nn.L1Loss()
optimizer = torch.optim.Adam(
    model.parameters(),
    lr=1e-3
)

for idx, (blob, y) in enumerate(dataloader):
    if idx >= 10000:
        break
        
    optimizer.zero_grad()
    blob = to_minkowski_tensor(blob)
    y_hat = model(blob) # forward pass
    loss = criterion(y, y_hat)
    loss.backward()
    optimizer.step()

    if not idx % 100:
        #print([param for param in model.parameters()]) 
        print(f'iteration:{idx:>8}', f'loss: {loss.item():.4f}', f'groundtruth: {y.item():6.4f}', f'prediction: {y_hat.item():.4f}')
    
    if not idx % 500:
        print([param for param in model.parameters()]) 

print([param for param in model.parameters()])

Prepare docker container

Prepare docker container for deployment.

Python dependencies:

  • sklearn
  • pandas
  • torch
  • torchmetrics
  • MinkowskiEngine

Implement regular data loader

For now our data loader yields batches containing single blob. It's suppose to yield many (like 8, 16 or 32) blobs in a batch.

Implement measures computation

  • Top 20 accuracy
  • Top 10 accuracy
  • Top 5 accuracy
  • Macro-averaged recall
  • Cohen's kappa
  • Micro-averaged recall
  • Micro-averaged precision
  • Micro-averaged F1

add git workflow to the project

add basic CI/CD functionality into the repository:

  • add tests to pipeline module (use pytest)
  • find a tool and deploy it, so that after each push to branch, a pipeline running tests will be run; should tests fail, merge should be forbidden

Generate K-Means dataset

We have to generate a dataset that consists of blobs after a few k-means iterations instead of whole blobs. To perform checkout to the branch mentioned in the comment, specify end and start values in the cfg/generate_dataset.yaml according to the list below. If you are going to process the batch mark it on the list (thus we will not duplicate the work).

Batches:

  • - start: 0, end: 50000
  • - start: 50000, end: 100000
  • - start: 100000, end: 150000
  • - start: 150000, end: 200000
  • - start: 200000, end: 250000
  • - start: 250000, end: 300000
  • - start: 300000, end: 350000
  • - start: 350000, end: 400000
  • - start: 400000, end: 450000
  • - start: 450000, end: 500000
  • - start: 500000, end: 550000
  • - start: 550000, end: 500000
  • - start: 600000, end: 650000
  • - start: 650000, end: -1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.