catalyst-team / catalyst Goto Github PK

Accelerated deep learning R&D

License: Apache License 2.0

Python 98.77% Shell 0.99% Makefile 0.09% Dockerfile 0.16%

deep-learning reinforcement-learning machine-learning computer-vision pytorch python distributed-computing infrastructure research reproducibility

catalyst's Introduction

Accelerated Deep Learning R&D

Catalyst is a PyTorch framework for Deep Learning Research and Development. It focuses on reproducibility, rapid experimentation, and codebase reuse so you can create something new rather than write yet another train loop.
Break the cycle – use the Catalyst!

Catalyst at PyTorch Ecosystem Day 2021

Catalyst at PyTorch Developer Day 2021

Getting started

pip install -U catalyst

import os
from torch import nn, optim
from torch.utils.data import DataLoader
from catalyst import dl, utils
from catalyst.contrib.datasets import MNIST

model = nn.Sequential(nn.Flatten(), nn.Linear(28 * 28, 10))
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.02)
loaders = {
    "train": DataLoader(MNIST(os.getcwd(), train=True), batch_size=32),
    "valid": DataLoader(MNIST(os.getcwd(), train=False), batch_size=32),
}

runner = dl.SupervisedRunner(
    input_key="features", output_key="logits", target_key="targets", loss_key="loss"
)

# model training
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    loaders=loaders,
    num_epochs=1,
    callbacks=[
        dl.AccuracyCallback(input_key="logits", target_key="targets", topk=(1, 3, 5)),
        dl.PrecisionRecallF1SupportCallback(input_key="logits", target_key="targets"),
    ],
    logdir="./logs",
    valid_loader="valid",
    valid_metric="loss",
    minimize_valid_metric=True,
    verbose=True,
)

# model evaluation
metrics = runner.evaluate_loader(
    loader=loaders["valid"],
    callbacks=[dl.AccuracyCallback(input_key="logits", target_key="targets", topk=(1, 3, 5))],
)

# model inference
for prediction in runner.predict_loader(loader=loaders["valid"]):
    assert prediction["logits"].detach().cpu().numpy().shape[-1] == 10

# model post-processing
model = runner.model.cpu()
batch = next(iter(loaders["valid"]))[0]
utils.trace_model(model=model, batch=batch)
utils.quantize_model(model=model)
utils.prune_model(model=model, pruning_fn="l1_unstructured", amount=0.8)
utils.onnx_export(model=model, batch=batch, file="./logs/mnist.onnx", verbose=True)

Step-by-step Guide

Start with Catalyst — A PyTorch Framework for Accelerated Deep Learning R&D introduction.
Try notebook tutorials or check minimal examples for first deep dive.
Read blog posts with use-cases and guides.
Learn machine learning with our "Deep Learning with Catalyst" course.
And finally, join our slack if you want to chat with the team and contributors.

Getting started
- Step-by-step Guide
Table of Contents
Overview
Community

Overview

Catalyst helps you implement compact but full-featured Deep Learning pipelines with just a few lines of code. You get a training loop with metrics, early-stopping, model checkpointing, and other features without the boilerplate.

Installation

Generic installation:

pip install -U catalyst

Specialized versions, extra requirements might apply

pip install catalyst[ml]         # installs ML-based Catalyst
pip install catalyst[cv]         # installs CV-based Catalyst
# master version installation
pip install git+https://github.com/catalyst-team/catalyst@master --upgrade
# all available extensions are listed here:
# https://github.com/catalyst-team/catalyst/blob/master/setup.py

Catalyst is compatible with: Python 3.7+. PyTorch 1.4+.
Tested on Ubuntu 16.04/18.04/20.04, macOS 10.15, Windows 10, and Windows Subsystem for Linux.

Documentation

master
22.02
2021 edition
2020 edition
- 20.12
- 20.11
- 20.10
- 20.09
- 20.08.2
- 20.07 (dev blog: 20.07 release)
- 20.06
- 20.05, 20.05.1
- 20.04, 20.04.1, 20.04.2

Minimal Examples

Introduction tutorial "Customizing what happens in train"
Demo with customization examples
Reinforcement Learning with Catalyst
And more

CustomRunner – PyTorch for-loop decomposition

import os
from torch import nn, optim
from torch.nn import functional as F
from torch.utils.data import DataLoader
from catalyst import dl, metrics
from catalyst.contrib.datasets import MNIST

model = nn.Sequential(nn.Flatten(), nn.Linear(28 * 28, 10))
optimizer = optim.Adam(model.parameters(), lr=0.02)

train_data = MNIST(os.getcwd(), train=True)
valid_data = MNIST(os.getcwd(), train=False)
loaders = {
    "train": DataLoader(train_data, batch_size=32),
    "valid": DataLoader(valid_data, batch_size=32),
}

class CustomRunner(dl.Runner):
    def predict_batch(self, batch):
        # model inference step
        return self.model(batch[0].to(self.engine.device))

    def on_loader_start(self, runner):
        super().on_loader_start(runner)
        self.meters = {
            key: metrics.AdditiveMetric(compute_on_call=False)
            for key in ["loss", "accuracy01", "accuracy03"]
        }

    def handle_batch(self, batch):
        # model train/valid step
        # unpack the batch
        x, y = batch
        # run model forward pass
        logits = self.model(x)
        # compute the loss
        loss = F.cross_entropy(logits, y)
        # compute the metrics
        accuracy01, accuracy03 = metrics.accuracy(logits, y, topk=(1, 3))
        # log metrics
        self.batch_metrics.update(
            {"loss": loss, "accuracy01": accuracy01, "accuracy03": accuracy03}
        )
        for key in ["loss", "accuracy01", "accuracy03"]:
            self.meters[key].update(self.batch_metrics[key].item(), self.batch_size)
        # run model backward pass
        if self.is_train_loader:
            self.engine.backward(loss)
            self.optimizer.step()
            self.optimizer.zero_grad()

    def on_loader_end(self, runner):
        for key in ["loss", "accuracy01", "accuracy03"]:
            self.loader_metrics[key] = self.meters[key].compute()[0]
        super().on_loader_end(runner)

runner = CustomRunner()
# model training
runner.train(
    model=model,
    optimizer=optimizer,
    loaders=loaders,
    logdir="./logs",
    num_epochs=5,
    verbose=True,
    valid_loader="valid",
    valid_metric="loss",
    minimize_valid_metric=True,
)
# model inference
for logits in runner.predict_loader(loader=loaders["valid"]):
    assert logits.detach().cpu().numpy().shape[-1] == 10

ML - linear regression

import torch
from torch.utils.data import DataLoader, TensorDataset
from catalyst import dl

# data
num_samples, num_features = int(1e4), int(1e1)
X, y = torch.rand(num_samples, num_features), torch.rand(num_samples)
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, num_workers=1)
loaders = {"train": loader, "valid": loader}

# model, criterion, optimizer, scheduler
model = torch.nn.Linear(num_features, 1)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters())
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, [3, 6])

# model training
runner = dl.SupervisedRunner()
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    loaders=loaders,
    logdir="./logdir",
    valid_loader="valid",
    valid_metric="loss",
    minimize_valid_metric=True,
    num_epochs=8,
    verbose=True,
)

ML - multiclass classification

import torch
from torch.utils.data import DataLoader, TensorDataset
from catalyst import dl

# sample data
num_samples, num_features, num_classes = int(1e4), int(1e1), 4
X = torch.rand(num_samples, num_features)
y = (torch.rand(num_samples,) * num_classes).to(torch.int64)

# pytorch loaders
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, num_workers=1)
loaders = {"train": loader, "valid": loader}

# model, criterion, optimizer, scheduler
model = torch.nn.Linear(num_features, num_classes)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, [2])

# model training
runner = dl.SupervisedRunner(
    input_key="features", output_key="logits", target_key="targets", loss_key="loss"
)
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    loaders=loaders,
    logdir="./logdir",
    num_epochs=3,
    valid_loader="valid",
    valid_metric="accuracy03",
    minimize_valid_metric=False,
    verbose=True,
    callbacks=[
        dl.AccuracyCallback(input_key="logits", target_key="targets", num_classes=num_classes),
        # uncomment for extra metrics:
        # dl.PrecisionRecallF1SupportCallback(
        #     input_key="logits", target_key="targets", num_classes=num_classes
        # ),
        # dl.AUCCallback(input_key="logits", target_key="targets"),
        # catalyst[ml] required ``pip install catalyst[ml]``
        # dl.ConfusionMatrixCallback(
        #     input_key="logits", target_key="targets", num_classes=num_classes
        # ),
    ],
)

ML - multilabel classification

import torch
from torch.utils.data import DataLoader, TensorDataset
from catalyst import dl

# sample data
num_samples, num_features, num_classes = int(1e4), int(1e1), 4
X = torch.rand(num_samples, num_features)
y = (torch.rand(num_samples, num_classes) > 0.5).to(torch.float32)

# pytorch loaders
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, num_workers=1)
loaders = {"train": loader, "valid": loader}

# model, criterion, optimizer, scheduler
model = torch.nn.Linear(num_features, num_classes)
criterion = torch.nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters())
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, [2])

# model training
runner = dl.SupervisedRunner(
    input_key="features", output_key="logits", target_key="targets", loss_key="loss"
)
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    loaders=loaders,
    logdir="./logdir",
    num_epochs=3,
    valid_loader="valid",
    valid_metric="accuracy01",
    minimize_valid_metric=False,
    verbose=True,
    callbacks=[
        dl.BatchTransformCallback(
            transform=torch.sigmoid,
            scope="on_batch_end",
            input_key="logits",
            output_key="scores"
        ),
        dl.AUCCallback(input_key="scores", target_key="targets"),
        # uncomment for extra metrics:
        # dl.MultilabelAccuracyCallback(input_key="scores", target_key="targets", threshold=0.5),
        # dl.MultilabelPrecisionRecallF1SupportCallback(
        #     input_key="scores", target_key="targets", threshold=0.5
        # ),
    ]
)

ML - multihead classification

import torch
from torch import nn, optim
from torch.utils.data import DataLoader, TensorDataset
from catalyst import dl

# sample data
num_samples, num_features, num_classes1, num_classes2 = int(1e4), int(1e1), 4, 10
X = torch.rand(num_samples, num_features)
y1 = (torch.rand(num_samples,) * num_classes1).to(torch.int64)
y2 = (torch.rand(num_samples,) * num_classes2).to(torch.int64)

# pytorch loaders
dataset = TensorDataset(X, y1, y2)
loader = DataLoader(dataset, batch_size=32, num_workers=1)
loaders = {"train": loader, "valid": loader}

class CustomModule(nn.Module):
    def __init__(self, in_features: int, out_features1: int, out_features2: int):
        super().__init__()
        self.shared = nn.Linear(in_features, 128)
        self.head1 = nn.Linear(128, out_features1)
        self.head2 = nn.Linear(128, out_features2)

    def forward(self, x):
        x = self.shared(x)
        y1 = self.head1(x)
        y2 = self.head2(x)
        return y1, y2

# model, criterion, optimizer, scheduler
model = CustomModule(num_features, num_classes1, num_classes2)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())
scheduler = optim.lr_scheduler.MultiStepLR(optimizer, [2])

class CustomRunner(dl.Runner):
    def handle_batch(self, batch):
        x, y1, y2 = batch
        y1_hat, y2_hat = self.model(x)
        self.batch = {
            "features": x,
            "logits1": y1_hat,
            "logits2": y2_hat,
            "targets1": y1,
            "targets2": y2,
        }

# model training
runner = CustomRunner()
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    loaders=loaders,
    num_epochs=3,
    verbose=True,
    callbacks=[
        dl.CriterionCallback(metric_key="loss1", input_key="logits1", target_key="targets1"),
        dl.CriterionCallback(metric_key="loss2", input_key="logits2", target_key="targets2"),
        dl.MetricAggregationCallback(metric_key="loss", metrics=["loss1", "loss2"], mode="mean"),
        dl.BackwardCallback(metric_key="loss"),
        dl.OptimizerCallback(metric_key="loss"),
        dl.SchedulerCallback(),
        dl.AccuracyCallback(
            input_key="logits1", target_key="targets1", num_classes=num_classes1, prefix="one_"
        ),
        dl.AccuracyCallback(
            input_key="logits2", target_key="targets2", num_classes=num_classes2, prefix="two_"
        ),
        # catalyst[ml] required ``pip install catalyst[ml]``
        # dl.ConfusionMatrixCallback(
        #     input_key="logits1", target_key="targets1", num_classes=num_classes1, prefix="one_cm"
        # ),
        # dl.ConfusionMatrixCallback(
        #     input_key="logits2", target_key="targets2", num_classes=num_classes2, prefix="two_cm"
        # ),
        dl.CheckpointCallback(
            logdir="./logs/one",
            loader_key="valid", metric_key="one_accuracy01", minimize=False, topk=1
        ),
        dl.CheckpointCallback(
            logdir="./logs/two",
            loader_key="valid", metric_key="two_accuracy03", minimize=False, topk=3
        ),
    ],
    loggers={"console": dl.ConsoleLogger(), "tb": dl.TensorboardLogger("./logs/tb")},
)

ML – RecSys

import torch
from torch.utils.data import DataLoader, TensorDataset
from catalyst import dl

# sample data
num_users, num_features, num_items = int(1e4), int(1e1), 10
X = torch.rand(num_users, num_features)
y = (torch.rand(num_users, num_items) > 0.5).to(torch.float32)

# pytorch loaders
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, num_workers=1)
loaders = {"train": loader, "valid": loader}

# model, criterion, optimizer, scheduler
model = torch.nn.Linear(num_features, num_items)
criterion = torch.nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters())
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, [2])

# model training
runner = dl.SupervisedRunner(
    input_key="features", output_key="logits", target_key="targets", loss_key="loss"
)
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    loaders=loaders,
    num_epochs=3,
    verbose=True,
    callbacks=[
        dl.BatchTransformCallback(
            transform=torch.sigmoid,
            scope="on_batch_end",
            input_key="logits",
            output_key="scores"
        ),
        dl.CriterionCallback(input_key="logits", target_key="targets", metric_key="loss"),
        # uncomment for extra metrics:
        # dl.AUCCallback(input_key="scores", target_key="targets"),
        # dl.HitrateCallback(input_key="scores", target_key="targets", topk=(1, 3, 5)),
        # dl.MRRCallback(input_key="scores", target_key="targets", topk=(1, 3, 5)),
        # dl.MAPCallback(input_key="scores", target_key="targets", topk=(1, 3, 5)),
        # dl.NDCGCallback(input_key="scores", target_key="targets", topk=(1, 3, 5)),
        dl.BackwardCallback(metric_key="loss"),
        dl.OptimizerCallback(metric_key="loss"),
        dl.SchedulerCallback(),
        dl.CheckpointCallback(
            logdir="./logs", loader_key="valid", metric_key="loss", minimize=True
        ),
    ]
)

CV - MNIST classification

import os
from torch import nn, optim
from torch.utils.data import DataLoader
from catalyst import dl
from catalyst.contrib.datasets import MNIST

model = nn.Sequential(nn.Flatten(), nn.Linear(28 * 28, 10))
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.02)

train_data = MNIST(os.getcwd(), train=True)
valid_data = MNIST(os.getcwd(), train=False)
loaders = {
    "train": DataLoader(train_data, batch_size=32),
    "valid": DataLoader(valid_data, batch_size=32),
}

runner = dl.SupervisedRunner()
# model training
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    loaders=loaders,
    num_epochs=1,
    logdir="./logs",
    valid_loader="valid",
    valid_metric="loss",
    minimize_valid_metric=True,
    verbose=True,
# uncomment for extra metrics:
#     callbacks=[
#         dl.AccuracyCallback(input_key="logits", target_key="targets", num_classes=10),
#         dl.PrecisionRecallF1SupportCallback(
#             input_key="logits", target_key="targets", num_classes=10
#         ),
#         dl.AUCCallback(input_key="logits", target_key="targets"),
#         # catalyst[ml] required ``pip install catalyst[ml]``
#         dl.ConfusionMatrixCallback(
#             input_key="logits", target_key="targets", num_classes=num_classes
#         ),
#     ]
)

CV - MNIST segmentation

import os
import torch
from torch import nn
from torch.utils.data import DataLoader
from catalyst import dl
from catalyst.contrib.datasets import MNIST
from catalyst.contrib.losses import IoULoss


model = nn.Sequential(
    nn.Conv2d(1, 1, 3, 1, 1), nn.ReLU(),
    nn.Conv2d(1, 1, 3, 1, 1), nn.Sigmoid(),
)
criterion = IoULoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.02)

train_data = MNIST(os.getcwd(), train=True)
valid_data = MNIST(os.getcwd(), train=False)
loaders = {
    "train": DataLoader(train_data, batch_size=32),
    "valid": DataLoader(valid_data, batch_size=32),
}

class CustomRunner(dl.SupervisedRunner):
    def handle_batch(self, batch):
        x = batch[self._input_key]
        x_noise = (x + torch.rand_like(x)).clamp_(0, 1)
        x_ = self.model(x_noise)
        self.batch = {self._input_key: x, self._output_key: x_, self._target_key: x}

runner = CustomRunner(
    input_key="features", output_key="scores", target_key="targets", loss_key="loss"
)
# model training
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    loaders=loaders,
    num_epochs=1,
    callbacks=[
        dl.IOUCallback(input_key="scores", target_key="targets"),
        dl.DiceCallback(input_key="scores", target_key="targets"),
        dl.TrevskyCallback(input_key="scores", target_key="targets", alpha=0.2),
    ],
    logdir="./logdir",
    valid_loader="valid",
    valid_metric="loss",
    minimize_valid_metric=True,
    verbose=True,
)

CV - MNIST metric learning

import os
from torch.optim import Adam
from torch.utils.data import DataLoader
from catalyst import dl
from catalyst.contrib.data import HardTripletsSampler
from catalyst.contrib.datasets import MnistMLDataset, MnistQGDataset
from catalyst.contrib.losses import TripletMarginLossWithSampler
from catalyst.contrib.models import MnistSimpleNet
from catalyst.data.sampler import BatchBalanceClassSampler


# 1. train and valid loaders
train_dataset = MnistMLDataset(root=os.getcwd())
sampler = BatchBalanceClassSampler(
    labels=train_dataset.get_labels(), num_classes=5, num_samples=10, num_batches=10
)
train_loader = DataLoader(dataset=train_dataset, batch_sampler=sampler)

valid_dataset = MnistQGDataset(root=os.getcwd(), gallery_fraq=0.2)
valid_loader = DataLoader(dataset=valid_dataset, batch_size=1024)

# 2. model and optimizer
model = MnistSimpleNet(out_features=16)
optimizer = Adam(model.parameters(), lr=0.001)

# 3. criterion with triplets sampling
sampler_inbatch = HardTripletsSampler(norm_required=False)
criterion = TripletMarginLossWithSampler(margin=0.5, sampler_inbatch=sampler_inbatch)

# 4. training with catalyst Runner
class CustomRunner(dl.SupervisedRunner):
    def handle_batch(self, batch) -> None:
        if self.is_train_loader:
            images, targets = batch["features"].float(), batch["targets"].long()
            features = self.model(images)
            self.batch = {"embeddings": features, "targets": targets,}
        else:
            images, targets, is_query = \
                batch["features"].float(), batch["targets"].long(), batch["is_query"].bool()
            features = self.model(images)
            self.batch = {"embeddings": features, "targets": targets, "is_query": is_query}

callbacks = [
    dl.ControlFlowCallbackWrapper(
        dl.CriterionCallback(input_key="embeddings", target_key="targets", metric_key="loss"),
        loaders="train",
    ),
    dl.ControlFlowCallbackWrapper(
        dl.CMCScoreCallback(
            embeddings_key="embeddings",
            labels_key="targets",
            is_query_key="is_query",
            topk=[1],
        ),
        loaders="valid",
    ),
    dl.PeriodicLoaderCallback(
        valid_loader_key="valid", valid_metric_key="cmc01", minimize=False, valid=2
    ),
]

runner = CustomRunner(input_key="features", output_key="embeddings")
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    callbacks=callbacks,
    loaders={"train": train_loader, "valid": valid_loader},
    verbose=False,
    logdir="./logs",
    valid_loader="valid",
    valid_metric="cmc01",
    minimize_valid_metric=False,
    num_epochs=10,
)

CV - MNIST GAN

import os
import torch
from torch import nn
from torch.utils.data import DataLoader
from catalyst import dl
from catalyst.contrib.datasets import MNIST
from catalyst.contrib.layers import GlobalMaxPool2d, Lambda

latent_dim = 128
generator = nn.Sequential(
    # We want to generate 128 coefficients to reshape into a 7x7x128 map
    nn.Linear(128, 128 * 7 * 7),
    nn.LeakyReLU(0.2, inplace=True),
    Lambda(lambda x: x.view(x.size(0), 128, 7, 7)),
    nn.ConvTranspose2d(128, 128, (4, 4), stride=(2, 2), padding=1),
    nn.LeakyReLU(0.2, inplace=True),
    nn.ConvTranspose2d(128, 128, (4, 4), stride=(2, 2), padding=1),
    nn.LeakyReLU(0.2, inplace=True),
    nn.Conv2d(128, 1, (7, 7), padding=3),
    nn.Sigmoid(),
)
discriminator = nn.Sequential(
    nn.Conv2d(1, 64, (3, 3), stride=(2, 2), padding=1),
    nn.LeakyReLU(0.2, inplace=True),
    nn.Conv2d(64, 128, (3, 3), stride=(2, 2), padding=1),
    nn.LeakyReLU(0.2, inplace=True),
    GlobalMaxPool2d(),
    nn.Flatten(),
    nn.Linear(128, 1),
)

model = nn.ModuleDict({"generator": generator, "discriminator": discriminator})
criterion = {"generator": nn.BCEWithLogitsLoss(), "discriminator": nn.BCEWithLogitsLoss()}
optimizer = {
    "generator": torch.optim.Adam(generator.parameters(), lr=0.0003, betas=(0.5, 0.999)),
    "discriminator": torch.optim.Adam(discriminator.parameters(), lr=0.0003, betas=(0.5, 0.999)),
}
train_data = MNIST(os.getcwd(), train=False)
loaders = {"train": DataLoader(train_data, batch_size=32)}

class CustomRunner(dl.Runner):
    def predict_batch(self, batch):
        batch_size = 1
        # Sample random points in the latent space
        random_latent_vectors = torch.randn(batch_size, latent_dim).to(self.engine.device)
        # Decode them to fake images
        generated_images = self.model["generator"](random_latent_vectors).detach()
        return generated_images

    def handle_batch(self, batch):
        real_images, _ = batch
        batch_size = real_images.shape[0]

        # Sample random points in the latent space
        random_latent_vectors = torch.randn(batch_size, latent_dim).to(self.engine.device)

        # Decode them to fake images
        generated_images = self.model["generator"](random_latent_vectors).detach()
        # Combine them with real images
        combined_images = torch.cat([generated_images, real_images])

        # Assemble labels discriminating real from fake images
        labels = \
            torch.cat([torch.ones((batch_size, 1)), torch.zeros((batch_size, 1))]).to(self.engine.device)
        # Add random noise to the labels - important trick!
        labels += 0.05 * torch.rand(labels.shape).to(self.engine.device)

        # Discriminator forward
        combined_predictions = self.model["discriminator"](combined_images)

        # Sample random points in the latent space
        random_latent_vectors = torch.randn(batch_size, latent_dim).to(self.engine.device)
        # Assemble labels that say "all real images"
        misleading_labels = torch.zeros((batch_size, 1)).to(self.engine.device)

        # Generator forward
        generated_images = self.model["generator"](random_latent_vectors)
        generated_predictions = self.model["discriminator"](generated_images)

        self.batch = {
            "combined_predictions": combined_predictions,
            "labels": labels,
            "generated_predictions": generated_predictions,
            "misleading_labels": misleading_labels,
        }


runner = CustomRunner()
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    loaders=loaders,
    callbacks=[
        dl.CriterionCallback(
            input_key="combined_predictions",
            target_key="labels",
            metric_key="loss_discriminator",
            criterion_key="discriminator",
        ),
        dl.BackwardCallback(metric_key="loss_discriminator"),
        dl.OptimizerCallback(
            optimizer_key="discriminator",
            metric_key="loss_discriminator",
        ),
        dl.CriterionCallback(
            input_key="generated_predictions",
            target_key="misleading_labels",
            metric_key="loss_generator",
            criterion_key="generator",
        ),
        dl.BackwardCallback(metric_key="loss_generator"),
        dl.OptimizerCallback(
            optimizer_key="generator",
            metric_key="loss_generator",
        ),
    ],
    valid_loader="train",
    valid_metric="loss_generator",
    minimize_valid_metric=True,
    num_epochs=20,
    verbose=True,
    logdir="./logs_gan",
)

# visualization (matplotlib required):
# import matplotlib.pyplot as plt
# %matplotlib inline
# plt.imshow(runner.predict_batch(None)[0, 0].cpu().numpy())

CV - MNIST VAE

import os
import torch
from torch import nn, optim
from torch.nn import functional as F
from torch.utils.data import DataLoader
from catalyst import dl, metrics
from catalyst.contrib.datasets import MNIST

LOG_SCALE_MAX = 2
LOG_SCALE_MIN = -10

def normal_sample(loc, log_scale):
    scale = torch.exp(0.5 * log_scale)
    return loc + scale * torch.randn_like(scale)

class VAE(nn.Module):
    def __init__(self, in_features, hid_features):
        super().__init__()
        self.hid_features = hid_features
        self.encoder = nn.Linear(in_features, hid_features * 2)
        self.decoder = nn.Sequential(nn.Linear(hid_features, in_features), nn.Sigmoid())

    def forward(self, x, deterministic=False):
        z = self.encoder(x)
        bs, z_dim = z.shape

        loc, log_scale = z[:, : z_dim // 2], z[:, z_dim // 2 :]
        log_scale = torch.clamp(log_scale, LOG_SCALE_MIN, LOG_SCALE_MAX)

        z_ = loc if deterministic else normal_sample(loc, log_scale)
        z_ = z_.view(bs, -1)
        x_ = self.decoder(z_)

        return x_, loc, log_scale

class CustomRunner(dl.IRunner):
    def __init__(self, hid_features, logdir, engine):
        super().__init__()
        self.hid_features = hid_features
        self._logdir = logdir
        self._engine = engine

    def get_engine(self):
        return self._engine

    def get_loggers(self):
        return {
            "console": dl.ConsoleLogger(),
            "csv": dl.CSVLogger(logdir=self._logdir),
            "tensorboard": dl.TensorboardLogger(logdir=self._logdir),
        }

    @property
    def num_epochs(self) -> int:
        return 1

    def get_loaders(self):
        loaders = {
            "train": DataLoader(MNIST(os.getcwd(), train=False), batch_size=32),
            "valid": DataLoader(MNIST(os.getcwd(), train=False), batch_size=32),
        }
        return loaders

    def get_model(self):
        model = self.model if self.model is not None else VAE(28 * 28, self.hid_features)
        return model

    def get_optimizer(self, model):
        return optim.Adam(model.parameters(), lr=0.02)

    def get_callbacks(self):
        return {
            "backward": dl.BackwardCallback(metric_key="loss"),
            "optimizer": dl.OptimizerCallback(metric_key="loss"),
            "checkpoint": dl.CheckpointCallback(
                self._logdir,
                loader_key="valid",
                metric_key="loss",
                minimize=True,
                topk=3,
            ),
        }

    def on_loader_start(self, runner):
        super().on_loader_start(runner)
        self.meters = {
            key: metrics.AdditiveMetric(compute_on_call=False)
            for key in ["loss_ae", "loss_kld", "loss"]
        }

    def handle_batch(self, batch):
        x, _ = batch
        x = x.view(x.size(0), -1)
        x_, loc, log_scale = self.model(x, deterministic=not self.is_train_loader)

        loss_ae = F.mse_loss(x_, x)
        loss_kld = (
            -0.5 * torch.sum(1 + log_scale - loc.pow(2) - log_scale.exp(), dim=1)
        ).mean()
        loss = loss_ae + loss_kld * 0.01

        self.batch_metrics = {"loss_ae": loss_ae, "loss_kld": loss_kld, "loss": loss}
        for key in ["loss_ae", "loss_kld", "loss"]:
            self.meters[key].update(self.batch_metrics[key].item(), self.batch_size)

    def on_loader_end(self, runner):
        for key in ["loss_ae", "loss_kld", "loss"]:
            self.loader_metrics[key] = self.meters[key].compute()[0]
        super().on_loader_end(runner)

    def predict_batch(self, batch):
        random_latent_vectors = torch.randn(1, self.hid_features).to(self.engine.device)
        generated_images = self.model.decoder(random_latent_vectors).detach()
        return generated_images

runner = CustomRunner(128, "./logs", dl.CPUEngine())
runner.run()
# visualization (matplotlib required):
# import matplotlib.pyplot as plt
# %matplotlib inline
# plt.imshow(runner.predict_batch(None)[0].cpu().numpy().reshape(28, 28))

AutoML - hyperparameters optimization with Optuna

import os
import optuna
import torch
from torch import nn
from torch.utils.data import DataLoader
from catalyst import dl
from catalyst.contrib.datasets import MNIST


def objective(trial):
    lr = trial.suggest_loguniform("lr", 1e-3, 1e-1)
    num_hidden = int(trial.suggest_loguniform("num_hidden", 32, 128))

    train_data = MNIST(os.getcwd(), train=True)
    valid_data = MNIST(os.getcwd(), train=False)
    loaders = {
        "train": DataLoader(train_data, batch_size=32),
        "valid": DataLoader(valid_data, batch_size=32),
    }
    model = nn.Sequential(
        nn.Flatten(), nn.Linear(784, num_hidden), nn.ReLU(), nn.Linear(num_hidden, 10)
    )
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()

    runner = dl.SupervisedRunner(input_key="features", output_key="logits", target_key="targets")
    runner.train(
        model=model,
        criterion=criterion,
        optimizer=optimizer,
        loaders=loaders,
        callbacks={
            "accuracy": dl.AccuracyCallback(
                input_key="logits", target_key="targets", num_classes=10
            ),
            # catalyst[optuna] required ``pip install catalyst[optuna]``
            "optuna": dl.OptunaPruningCallback(
                loader_key="valid", metric_key="accuracy01", minimize=False, trial=trial
            ),
        },
        num_epochs=3,
    )
    score = trial.best_score
    return score

study = optuna.create_study(
    direction="maximize",
    pruner=optuna.pruners.MedianPruner(
        n_startup_trials=1, n_warmup_steps=0, interval_steps=1
    ),
)
study.optimize(objective, n_trials=3, timeout=300)
print(study.best_value, study.best_params)

Config API - minimal example

runner:
  _target_: catalyst.runners.SupervisedRunner
  model:
    _var_: model
    _target_: torch.nn.Sequential
    args:
      - _target_: torch.nn.Flatten
      - _target_: torch.nn.Linear
        in_features: 784  # 28 * 28
        out_features: 10
  input_key: features
  output_key: &output_key logits
  target_key: &target_key targets
  loss_key: &loss_key loss

run:
  # ≈ stage 1
  - _call_: train  # runner.train(...)

    criterion:
      _target_: torch.nn.CrossEntropyLoss

    optimizer:
      _target_: torch.optim.Adam
      params:  # model.parameters()
        _var_: model.parameters
      lr: 0.02

    loaders:
      train:
        _target_: torch.utils.data.DataLoader
        dataset:
          _target_: catalyst.contrib.datasets.MNIST
          root: data
          train: y
        batch_size: 32

      &valid_loader_key valid:
        &valid_loader
        _target_: torch.utils.data.DataLoader
        dataset:
          _target_: catalyst.contrib.datasets.MNIST
          root: data
          train: n
        batch_size: 32

    callbacks:
      - &accuracy_metric
        _target_: catalyst.callbacks.AccuracyCallback
        input_key: *output_key
        target_key: *target_key
        topk: [1,3,5]
      - _target_: catalyst.callbacks.PrecisionRecallF1SupportCallback
        input_key: *output_key
        target_key: *target_key

    num_epochs: 1
    logdir: logs
    valid_loader: *valid_loader_key
    valid_metric: *loss_key
    minimize_valid_metric: y
    verbose: y

  # ≈ stage 2
  - _call_: evaluate_loader  # runner.evaluate_loader(...)
    loader: *valid_loader
    callbacks:
      - *accuracy_metric

catalyst-run --config example.yaml

Tests

All Catalyst code, features, and pipelines are fully tested. We also have our own catalyst-codestyle and a corresponding pre-commit hook. During testing, we train a variety of different models: image classification, image segmentation, text classification, GANs, and much more. We then compare their convergence metrics in order to verify the correctness of the training procedure and its reproducibility. As a result, Catalyst provides fully tested and reproducible best practices for your deep learning research and development.

Blog Posts

Talks

Community

Accelerated with Catalyst

Research Papers

Blog Posts

Competitions

Kaggle Quick, Draw! Doodle Recognition Challenge - 11th place
Catalyst.RL - NeurIPS 2018: AI for Prosthetics Challenge – 3rd place
Kaggle Google Landmark 2019 - 30th place
iMet Collection 2019 - FGVC6 - 24th place
ID R&D Anti-spoofing Challenge - 14th place
NeurIPS 2019: Recursion Cellular Image Classification - 4th place
MICCAI 2019: Automatic Structure Segmentation for Radiotherapy Planning Challenge 2019
- 3rd place solution for Task 3: Organ-at-risk segmentation from chest CT scans
- and 4th place solution for Task 4: Gross Target Volume segmentation of lung cancer
Kaggle Seversteal steel detection - 5th place
RSNA Intracranial Hemorrhage Detection - 5th place
APTOS 2019 Blindness Detection – 7th place
Catalyst.RL - NeurIPS 2019: Learn to Move - Walk Around – 2nd place
xView2 Damage Assessment Challenge - 3rd place

Toolkits

Catalyst.RL – A Distributed Framework for Reproducible RL Research by Scitator
Catalyst.Classification - Comprehensive classification pipeline with Pseudo-Labeling by Bagxi and Pdanilov
Catalyst.Segmentation - Segmentation pipelines - binary, semantic and instance, by Bagxi
Catalyst.Detection - Anchor-free detection pipeline by Avi2011class and TezRomacH
Catalyst.GAN - Reproducible GANs pipelines by Asmekal
Catalyst.Neuro - Brain image analysis project, in collaboration with TReNDS Center
MLComp – Distributed DAG framework for machine learning with UI by Lightforever
Pytorch toolbelt - PyTorch extensions for fast R&D prototyping and Kaggle farming by BloodAxe
Helper functions - An assorted collection of helper functions by Ternaus
BERT Distillation with Catalyst by elephantmipt

Other

CamVid Segmentation Example - Example of semantic segmentation for CamVid dataset
Notebook API tutorial for segmentation in Understanding Clouds from Satellite Images Competition
Catalyst.RL - NeurIPS 2019: Learn to Move - Walk Around – starter kit
Catalyst.RL - NeurIPS 2019: Animal-AI Olympics - starter kit
Inria Segmentation Example - An example of training segmentation model for Inria Sattelite Segmentation Challenge
iglovikov_segmentation - Semantic segmentation pipeline using Catalyst
Logging Catalyst Runs to Comet - An example of how to log metrics, hyperparameters and more from Catalyst runs to Comet

See other projects at the GitHub dependency graph.

If your project implements a paper, a notable use-case/tutorial, or a Kaggle competition solution, or if your code simply presents interesting results and uses Catalyst, we would be happy to add your project to the list above! Do not hesitate to send us a PR with a brief description of the project similar to the above.

Contribution Guide

We appreciate all contributions. If you are planning to contribute back bug-fixes, there is no need to run that by us; just send a PR. If you plan to contribute new features, new utility functions, or extensions, please open an issue first and discuss it with us.

Please see the Contribution Guide for more information.
By participating in this project, you agree to abide by its Code of Conduct.

User Feedback

We've created [email protected] as an additional channel for user feedback.

If you like the project and want to thank us, this is the right place.
If you would like to start a collaboration between your team and Catalyst team to improve Deep Learning R&D, you are always welcome.
If you don't like Github Issues and prefer email, feel free to email us.
Finally, if you do not like something, please, share it with us, and we can see how to improve it.

We appreciate any type of feedback. Thank you!

Acknowledgments

Since the beginning of the Сatalyst development, a lot of people have influenced it in a lot of different ways.

Catalyst.Team

Catalyst.Contributors

Trusted by

Awecom
Researchers at the Center for Translational Research in Neuroimaging and Data Science (TReNDS)
Deep Learning School
Researchers at Emory University
Evil Martians
Researchers at the Georgia Institute of Technology
Researchers at Georgia State University
Helios
HPCD Lab
iFarm
Kinoplan
Researchers at the Moscow Institute of Physics and Technology
Neuromation
Poteha Labs
Provectus
Researchers at the Skolkovo Institute of Science and Technology
SoftConstruct
Researchers at Tinkoff
Researchers at Yandex.Research

Citation

Please use this bibtex if you want to cite this repository in your publications:

@misc{catalyst,
    author = {Kolesnikov, Sergey},
    title = {Catalyst - Accelerated deep learning R&D},
    year = {2018},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/catalyst-team/catalyst}},
}

catalyst's People

Contributors

Stargazers

Watchers

Forkers

belskikh onisimchukv gridl stacywebb puppywst intuitionmachine codeaudit little1tow anirband cclauss shafiahmed mckenzypg n01z3 delip ngxbac panda4us iamymind qitong h-alk hexfaker hulalazz dbrainio erikajob91 kelvinson jingweiz arunkumarramanan cxz alexgrinch luck1ess ivbelkin mikkohypponen jalajthanaki angleboy8 vikascartgithub yangyuren03 intuitionfabricorg oklander eycab fedor-chervinskii deep-warrior ferrine gazay qrltrader tlwzzy manik-hossain velikodniy mave5 stalkermustang khrulkovv wook133 alevitskyy daniellutscher pokidyshev lightforever 162 russ76 codeslord donbobka erlemar dimaxano ram-iyer foromik nicobach asmekal murari-goswami jaykimbravekjh felix-neko roymachinelearning kyuhwas olgaiv39 battyone devforfu dipet vikasmech useric dbusai pandinosaurus jamshaidsohail5 ahatamiz bigmb evezhier thohidu nethask virajmehta akterskii dalide ternaus yulits sergeyshilin quetzalcohuatl jchen42703 cristysui yuv4r4j dsblank xelibrion andrey-avdeev ppleskov rdpli qubvel numb3r33

catalyst's Issues

[feature] weight decay autoremove for biases

[feature] FP16 & distributed training improved support

@todo:

refactor and simplify FP16 support
add distributed training support

good start here

Memory Leak Issue

I just want to inform that I am countering with Memory leak issue.
I am not 100% sure that It comes from catalyst. But I am sure that when I use my own code written by myself (without catalyst), there is no leaking issue. I tried to pre-procedure this issue from your examples, but it runs very fast and we could not see the leak.

Following is my leak descriptions:

Image classification task with ~17M images
The memory leak occurs every epoch and it happens at the next epoch again.
EX:
At the beginning of epoch 0, my memory is:
- RAM: 13G/16G
- Swap: 0/50G
At the end of epoch 0, my memory is:
- RAM: 16G/16G
- Swap: 25G/50G
When epoch 0 ends, the memories (RAM + Swap) are released, (RAM: 13G/16G, SwapL: 0/50G). The same phenomena occurs at the next epoch 1.

I am debugging to see where the issue comes from. Could you please double check with your bigger data?

[feature] task-specific callbacks

Proposal by @BloodAxe, use task-specific callbacks, like:

BinaryClassificationMetricsCallback
MulticlassClassificationMetricsCallback
MultiLabelClassificationMetricsCallback
BinarySegmentationMetricsCallback
SemanticSegmentationMetricsCallback
ObjectDetectionMetricsCallback

With internal metrics definition like:

SemanticSegmentationMetricsCallback(
    need_confusion_matrix=True, 
    need_mAP=True, 
    need_IoU=False)

Move metric managment to separate class

Currently RunnerState owns all metric management

Move it to MetricState

'async' is a reserved word in Python 3.7

'async' is a reserved word in Python 3.7. Cuda has shifted to 'non_blocking' instead. pytorch/pytorch#4999

flake8 testing of https://github.com/Scitator/prometheus on Python 3.7.0

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./losses/unet.py:8:23: E999 SyntaxError: invalid syntax
    return x.cuda(async=True) if torch.cuda.is_available() else x
                      ^
1     E999 SyntaxError: invalid syntax
1

[fix] fix registry exception for duplicate factory

@todo:

return exception for duplicate factory
check that all registries usage are correct: both in __init__.py and model.py for example

refactor legacy csv parsers

[feature] add pytorch.SWA support

add Pytorch.SWA to catalyst optimizers

[feature] OneCycle general schduler

idea:

assume we have training stage process 0.0 -> 1.0
on 0 -> {warm_fraction}, increase lr from init_lr to max_lr
on {warm_fraction} -> {cool_fraction}, use linear/cosine decay from max_lr to min_lr
on {cool_fraction} -> 1.0, use min_lr
batch or epoch lr schduling

the same thing with momentum

[feature] parse tensorboard logs without tensorflow

@todo:

rewrite tensorboard logs parsing
remove tensorflow dependency

add style transfer runner

add registry for models

[container] Advanced dockerization with Nvidia GPU support

Create folder docker
Move there cpu based Docker file
Create DockerFile based on Nvidia image
Update Readme with installation guide, hardware\software requirements and usage

Segmentation quickstartnotebook: TypeError: tensor is not a torch image.

I have several png images with corresponding masks:
Getting data:

data_dir = './data_objects'

def load_data(root_dir):
    data = []

    for stage in ['train']:
        for content in ['images', 'segmentation']:
            
            # construct path to each image
            directory = os.path.join(root_dir,  content)
            fps = [os.path.join(directory, filename) for filename in os.listdir(directory)]

            # read images

            images = [imread(filepath) for filepath in fps]
 
            # if images have different sizes you have to resize them before:
            resized_images = [resize(image, (64, 64)) for image in images]
            
            # stack to one np.array 
            np_images = np.stack(resized_images, axis=0)
            
            data.append(np_images)
            
    return data
x_train, y_train  = load_data(data_dir)
y_train = y_train.reshape(19, 64, 64, 1)

Making data looks like in the example notebook:

x_train, X_test, y_train, y_test = train_test_split(
         x_train, y_train, test_size=0.33, random_state=42)

train_data = list(zip(x_train, y_train))
valid_data = list(zip(X_test, y_test))

train_data[0][0].shape, train_data[0][1].shape, len(train_data)
(64, 64, 3), (64, 64, 1), 12

Calling train returns error:

0/10 * Epoch (train):   0% 0/3 [00:00<?, ?it/s]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-43-69a57e657d9d> in <module>()
     26     logdir=logdir,
     27     num_epochs=num_epochs,
---> 28     verbose=True
     29 )

~/anaconda3/lib/python3.6/site-packages/catalyst/dl/experiments/runner.py in train(self, model, criterion, optimizer, loaders, logdir, callbacks, scheduler, num_epochs, valid_loader, main_metric, minimize_metric, verbose, state_kwargs, check)
    269             state_kwargs=state_kwargs
    270         )
--> 271         self.run_experiment(experiment, check=check)
    272 
    273     def infer(

~/anaconda3/lib/python3.6/site-packages/catalyst/dl/experiments/runner.py in run_experiment(self, experiment, check)
    192         self.experiment = experiment
    193         for stage in self.experiment.stages:
--> 194             self._run_stage(stage)
    195         return self
    196 

~/anaconda3/lib/python3.6/site-packages/catalyst/dl/experiments/runner.py in _run_stage(self, stage)
    173 
    174             self._run_event("epoch_start")
--> 175             self._run_epoch(loaders)
    176             self._run_event("epoch_end")
    177 

~/anaconda3/lib/python3.6/site-packages/catalyst/dl/experiments/runner.py in _run_epoch(self, loaders)
    158             self._run_event("loader_start")
    159             with torch.set_grad_enabled(self.state.need_backward):
--> 160                 self._run_loader(loaders[loader_name])
    161             self._run_event("loader_end")
    162 

~/anaconda3/lib/python3.6/site-packages/catalyst/dl/experiments/runner.py in _run_loader(self, loader)
    119         self.state.timer.start("base/data_time")
    120 
--> 121         for i, batch in enumerate(loader):
    122             batch = self._batch2device(batch, self.device)
    123             self.state.timer.stop("base/data_time")

~/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    635                 self.reorder_dict[idx] = batch
    636                 continue
--> 637             return self._process_next_batch(batch)
    638 
    639     next = __next__  # Python 2 compatibility

~/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _process_next_batch(self, batch)
    656         self._put_indices()
    657         if isinstance(batch, ExceptionWrapper):
--> 658             raise batch.exc_type(batch.exc_msg)
    659         return batch
    660 

TypeError: Traceback (most recent call last):
  File "/home/dex/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/dex/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/dex/anaconda3/lib/python3.6/site-packages/catalyst/data/dataset.py", line 74, in __getitem__
    dict_ = self.dict_transform(dict_)
  File "/home/dex/anaconda3/lib/python3.6/site-packages/torchvision-0.2.0-py3.6.egg/torchvision/transforms/transforms.py", line 42, in __call__
    img = t(img)
  File "/home/dex/anaconda3/lib/python3.6/site-packages/catalyst/data/augmentor.py", line 25, in __call__
    ] = self.augment_fn(dict_[self.dict_key], **self.default_kwargs)
  File "/home/dex/anaconda3/lib/python3.6/site-packages/torchvision-0.2.0-py3.6.egg/torchvision/transforms/transforms.py", line 118, in __call__
    return F.normalize(tensor, self.mean, self.std)
  File "/home/dex/anaconda3/lib/python3.6/site-packages/torchvision-0.2.0-py3.6.egg/torchvision/transforms/functional.py", line 158, in normalize
    raise TypeError('tensor is not a torch image.')
TypeError: tensor is not a torch image.

Then I tried adding more transforms, but failed to fix that problem, for example something like this:

data_transform = transforms.Compose([
    Augmentor(
        dict_key="features",
        augment_fn=lambda x: \
            torch.from_numpy(x.copy().astype(np.float32) / 255.).unsqueeze_(0)),
    Augmentor(
        dict_key="features",
        augment_fn=transforms.ToPILImage()),  #transforms.ToTensor()),
    Augmentor(
        dict_key="features",
        augment_fn=transforms.Normalize(
            (0.5, 0.5, 0.5),
            (0.5, 0.5, 0.5))),
    Augmentor(
        dict_key="targets",
        augment_fn=lambda x: \
            torch.from_numpy(x.copy().astype(np.float32) / 255.).unsqueeze_(0)),
    Augmentor(
        dict_key="targets",
        augment_fn= transforms.ToPILImage())#transforms.ToTensor())
])

What is wrong here?

[feature] add jit.trace support

Add automated model tracing & packing.

docs (site, etc)

[feature] hyperopt intergration

Use Ax with Catalyst.Experiments for automatic hyperparameters search.

Docs for checkpoints logic

Hey, great framework!

I can't find in docs and don't see from API how to easily change checkpointer behavior - which metric it should monitor for checkpoints, max/min etc.

And I'm not sure about checkpoints at all - where should I find saved models, what is the format, are they enabled by default, what is the logic for saving them right now. And can I customize all of that?

some bugs in LRFinder

In case of default value for LRFinder parameter n_steps=None training terminates with

File "/home/ivb/Repos/contrib/catalyst/catalyst/dl/callbacks/schedulers.py", line 188, in on_loader_start
    self.n_steps = self.n_steps or len(state.loader)
AttributeError: 'RunnerState' object has no attribute 'loader'

To reproduce just replace sheduler section in examples/cifar_simple/config.yml with

scheduler:
  callback: LRFinder
  final_lr: 10

Also doc for LRFinder contains unused parameter init_lr.

[solved] ResnetEncoder 'frozen' argument doesn't work

Passing frozen=True to catalyst.contrib.models.ResnetEncoder does not actually make encoder not trainable. Simple code to check that tensor value is changing during training:

import torch
from catalyst.contrib.models import ResnetEncoder

class Net(torch.nn.Module):
    
    def __init__(self):
        super().__init__()
        self.enc = ResnetEncoder(
            arch="resnet18", 
            pooling="GlobalAvgPool2d"
        )
        self.logits = torch.nn.Linear(
            self.enc.out_features, 1)
    
    def forward(self, x):
        return self.logits(self.enc(x))
    
    
model = Net()

old_value = model.enc.feature_net[0].weight[0,0,0,0].item()

inputs = torch.randn(8, 3, 224, 224)
targets = torch.ones(8, 1)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.BCEWithLogitsLoss()

optimizer.zero_grad()
loss = criterion(model(inputs), targets)
loss.backward()
optimizer.step()

new_value = model.enc.feature_net[0].weight[0,0,0,0].item()

print(old_value, new_value)

It will print different values. It is so because nn.Module has no attribute 'requires_grad', only nn.Parameter has it. I fix it in my fork, but I don't see 'dev' branch to open PR.

[fix] fix json metric logger

@todo:

add json formatter to Logger
check (import json; data=json.load(open(f'{logdir}/metrics.json')); print(data[-1]))
add this check to travis CI

CI: check examples

KeyError: 'loss'

Raises this exception after first epoch.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-31-eecc50de0067> in <module>
      3     callbacks=callbacks,
      4     logdir=logdir,
----> 5     epochs=n_epochs, verbose=True)

~/miniconda3/lib/python3.6/site-packages/catalyst/dl/runner.py in train(self, loaders, callbacks, state_params, epochs, start_epoch, verbose, logdir)
    214             mode="train",
    215             verbose=verbose,
--> 216             logdir=logdir,
    217         )
    218 

~/miniconda3/lib/python3.6/site-packages/catalyst/dl/runner.py in run(self, loaders, callbacks, state_params, epochs, start_epoch, mode, verbose, logdir)
    180                 self.run_event(callbacks=callbacks, event="on_loader_end")
    181 
--> 182             self.run_event(callbacks=callbacks, event="on_epoch_end")
    183 
    184         self.run_event(callbacks=callbacks, event=f"on_{mode}_end")

~/miniconda3/lib/python3.6/site-packages/catalyst/dl/runner.py in run_event(self, callbacks, event)
     93         :param event:
     94         """
---> 95         getattr(self.state, f"{event}_pre")(state=self.state)
     96         for callback in callbacks.values():
     97             getattr(callback, event)(state=self.state)

~/miniconda3/lib/python3.6/site-packages/catalyst/dl/state.py in on_epoch_end_pre(state)
    146                 valid_loader=state.valid_loader,
    147                 main_metric=state.main_metric,
--> 148                 minimize=state.minimize_metric)
    149         valid_metrics = {
    150             key: value

~/miniconda3/lib/python3.6/site-packages/catalyst/dl/callbacks/utils.py in process_epoch_metrics(epoch_metrics, best_metrics, valid_loader, main_metric, minimize)
     27         if best_metrics is None \
     28         else (minimize != (
---> 29             valid_metrics[main_metric] > best_metrics[main_metric]))
     30     best_metrics = valid_metrics if is_best else best_metrics
     31     return best_metrics, valid_metrics, is_best

KeyError: 'loss'

I'm using these callbacks:

callbacks["loss"] = LossCallback()
callbacks["optimizer"] = OptimizerCallback()
callbacks["precision"] = PrecisionCallback(
    precision_args=[1])

callbacks["scheduler"] = SchedulerCallback(
    reduce_metric="precision01")

callbacks["saver"] = CheckpointCallback()
callbacks["logger"] = Logger()
callbacks["tflogger"] = TensorboardLogger()

Data and processing are taken from here:: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html

[solved] IOU metric (IouCallback) is bigger than 1

I have been using IOUcallback with default parameter to train the Unet model:

runner.train(
    model=model, 
    main_metric = 'iou',
    minimize_metric = False,
    criterion=criterion,
    optimizer=optimizer,
    loaders=loaders,
    logdir=logdir,
    scheduler = scheduler,
    callbacks=[ 
        IouCallback(),
    ],
    num_epochs=num_epochs,
    verbose=True
)

As result metric log looks this way:

[2019-05-26 13:25:18,540] 
0/10 * Epoch 0 (train): _base/lr=0.0010 | _base/momentum=0.9000 | _timers/_fps=14.2914 | _timers/batch_time=0.4415 | _timers/data_time=0.3346 | _timers/model_time=0.1068 | iou=0.2843 | loss=1.2827
0/10 * Epoch 0 (valid): _base/lr=0.0010 | _base/momentum=0.9000 | _timers/_fps=49.2591 | _timers/batch_time=0.1080 | _timers/data_time=0.0271 | _timers/model_time=0.0809 | iou=0.6478 | loss=0.5063
[2019-05-26 13:25:50,207] 
1/10 * Epoch 1 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.3666 | _timers/batch_time=0.3442 | _timers/data_time=0.3376 | _timers/model_time=0.0065 | iou=0.5010 | loss=0.4697
1/10 * Epoch 1 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=46.6509 | _timers/batch_time=0.0335 | _timers/data_time=0.0289 | _timers/model_time=0.0045 | iou=0.9374 | loss=-1.6844
[2019-05-26 13:26:21,913] 
2/10 * Epoch 2 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.1218 | _timers/batch_time=0.3469 | _timers/data_time=0.3409 | _timers/model_time=0.0059 | iou=0.5980 | loss=0.0623
2/10 * Epoch 2 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=44.7021 | _timers/batch_time=0.0343 | _timers/data_time=0.0293 | _timers/model_time=0.0050 | iou=1.0979 | loss=-2.1567
[2019-05-26 13:26:52,914] 
3/10 * Epoch 3 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.3602 | _timers/batch_time=0.3504 | _timers/data_time=0.3443 | _timers/model_time=0.0061 | iou=0.5408 | loss=0.0550
3/10 * Epoch 3 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=48.0609 | _timers/batch_time=0.0321 | _timers/data_time=0.0276 | _timers/model_time=0.0044 | iou=0.9644 | loss=-1.6202
[2019-05-26 13:27:24,687] 
4/10 * Epoch 4 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.0270 | _timers/batch_time=0.3544 | _timers/data_time=0.3484 | _timers/model_time=0.0059 | iou=0.7157 | loss=-0.8354
4/10 * Epoch 4 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=45.5536 | _timers/batch_time=0.0348 | _timers/data_time=0.0301 | _timers/model_time=0.0046 | iou=0.9381 | loss=-1.9926
[2019-05-26 13:27:57,148] 
5/10 * Epoch 5 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=11.9987 | _timers/batch_time=0.3576 | _timers/data_time=0.3516 | _timers/model_time=0.0059 | iou=0.6653 | loss=-0.7207
5/10 * Epoch 5 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=47.2183 | _timers/batch_time=0.0330 | _timers/data_time=0.0285 | _timers/model_time=0.0044 | iou=1.0183 | loss=-2.6983
[2019-05-26 13:28:28,836] 
6/10 * Epoch 6 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.2940 | _timers/batch_time=0.3451 | _timers/data_time=0.3389 | _timers/model_time=0.0061 | iou=0.7951 | loss=-1.6150
6/10 * Epoch 6 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=45.1915 | _timers/batch_time=0.0339 | _timers/data_time=0.0290 | _timers/model_time=0.0048 | iou=1.3360 | loss=-5.5529
[2019-05-26 13:28:59,628] 
7/10 * Epoch 7 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.3576 | _timers/batch_time=0.3424 | _timers/data_time=0.3360 | _timers/model_time=0.0063 | iou=0.7155 | loss=-1.0354
7/10 * Epoch 7 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=44.9911 | _timers/batch_time=0.0338 | _timers/data_time=0.0288 | _timers/model_time=0.0049 | iou=1.0820 | loss=-1.8514
[2019-05-26 13:29:30,871] 
8/10 * Epoch 8 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.5222 | _timers/batch_time=0.3394 | _timers/data_time=0.3337 | _timers/model_time=0.0057 | iou=0.6650 | loss=-0.8128
8/10 * Epoch 8 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=49.4219 | _timers/batch_time=0.0318 | _timers/data_time=0.0272 | _timers/model_time=0.0046 | iou=1.2081 | loss=-6.1373
[2019-05-26 13:30:01,903] 
9/10 * Epoch 9 (train): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=12.5503 | _timers/batch_time=0.3436 | _timers/data_time=0.3380 | _timers/model_time=0.0056 | iou=0.7887 | loss=-1.5298
9/10 * Epoch 9 (valid): _base/lr=0.0001 | _base/momentum=0.9000 | _timers/_fps=45.4443 | _timers/batch_time=0.0339 | _timers/data_time=0.0291 | _timers/model_time=0.0047 | iou=1.2653 | loss=-3.6054

As you can see some of them higher than 1, is it sum of IOU or I'm doing something wrong?

Several quickstart notebooks

Classification
Object-detection
Segmentation

[devops] Move dependencies to PipEnv

Now:

pip freeze -> requirements.txt - doesn't contains interpreter's version, hash sums e.t.c.
ToDo:
install pipenv
pipenv init && create Pipfile and Pipfile.lock
test

P.S. pipenv repo - https://github.com/pypa/pipenv

Redundant pooling layers

Global pooling layers are already easy to implement using adaptive pooling in pytorch: https://pytorch.org/docs/stable/nn.html#pooling-layers, so they can be removed from e.g.

https://github.com/Scitator/catalyst/blob/e55719e74717754c79d190e1aeeb6335bee24402/modules/pooling.py#L16

[feature] change precision metrics

The metrics in the function precision is not actually precision, rather an accuracy. Same for average_precision and mean_average_precision. More detailed here

Suggestion:

name functions properly as accuracy, average_accuracy, and mean_average_accuracy;
add correct precision and recall metrics.

[feature] add pytorch.tensorboard support

Use pytorch.tensorboard instead of tensorboardX.

type object 'UtilsFactory' has no attribute 'create_tflogger'

It looks like UtilsFactory is missing a method.
The error is in this line: https://github.com/Scitator/catalyst/blob/master/utils/factory.py#L50

[BUG] CheckpointCallback is wrong

Hi,
I did my experiment, I dont know why best checkpoints are not saved. I tried to pdb and found a bug.

Settings

metrics: map05
minimize_metric: False

Scenario

From opoch 0->10, valid map05 = 0 for all
From epoch 11, valid map05 starts increasing.
From epoch 11, the best checkpoints are not saved

Following are console logs and checkpoints folder screenshot
From epoch 0->10:

....
8/100 * Epoch (train): _fps=758.1664 | base/batch_time=0.0454 | base/data_time=0.0043 | base/lr=0.0001 | base/model_time=0.0411 | base/momentum=0.9000 | loss=28.4960 | map05=0.0000
8/100 * Epoch (valid): _fps=753.5475 | base/batch_time=0.0635 | base/data_time=0.0245 | base/lr=0.0001 | base/model_time=0.0390 | base/momentum=0.9000 | loss=30.4348 | map05=0.0000
9/100 * Epoch (train): 100% 133/133 [00:06<00:00, 22.16it/s, _fps=733.518, loss=26.232, map05=0.000]
9/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 15.60it/s, _fps=787.242, loss=28.358, map05=0.000]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) checkpoint_metric
9
(Pdb) c
[2019-03-21 10:20:30,871] 
9/100 * Epoch (train): _fps=765.4649 | base/batch_time=0.0453 | base/data_time=0.0045 | base/lr=0.0001 | base/model_time=0.0408 | base/momentum=0.9000 | loss=27.7296 | map05=0.0000
9/100 * Epoch (valid): _fps=752.4676 | base/batch_time=0.0630 | base/data_time=0.0239 | base/lr=0.0001 | base/model_time=0.0391 | base/momentum=0.9000 | loss=29.9424 | map05=0.0000
10/100 * Epoch (train): 100% 133/133 [00:06<00:00, 20.10it/s, _fps=779.565, loss=24.555, map05=0.000]
10/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 15.95it/s, _fps=822.962, loss=27.833, map05=0.000]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) c
[2019-03-21 10:20:46,505] 
10/100 * Epoch (train): _fps=758.8756 | base/batch_time=0.0458 | base/data_time=0.0047 | base/lr=0.0001 | base/model_time=0.0411 | base/momentum=0.9000 | loss=26.9246 | map05=0.0004
10/100 * Epoch (valid): _fps=754.0560 | base/batch_time=0.0616 | base/data_time=0.0225 | base/lr=0.0001 | base/model_time=0.0391 | base/momentum=0.9000 | loss=29.3501 | map05=0.0000
11/100 * Epoch (train): 100% 133/133 [00:06<00:00, 22.27it/s, _fps=785.593, loss=26.380, map05=0.000]
11/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 14.73it/s, _fps=774.013, loss=27.053, map05=0.000]
...

From epoch 11:

11/100 * Epoch (train): _fps=761.4851 | base/batch_time=0.0455 | base/data_time=0.0045 | base/lr=0.0001 | base/model_time=0.0410 | base/momentum=0.9000 | loss=26.1523 | map05=0.0014
11/100 * Epoch (valid): _fps=751.8519 | base/batch_time=0.0668 | base/data_time=0.0277 | base/lr=0.0001 | base/model_time=0.0391 | base/momentum=0.9000 | loss=28.8740 | map05=0.0000
12/100 * Epoch (train): 100% 133/133 [00:06<00:00, 20.17it/s, _fps=765.323, loss=26.579, map05=0.000]
12/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 15.92it/s, _fps=824.337, loss=26.422, map05=0.016]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) checkpoint_metric
0.0011160714285714285
(Pdb) c
[2019-03-21 10:21:46,089] 
12/100 * Epoch (train): _fps=758.7695 | base/batch_time=0.0456 | base/data_time=0.0046 | base/lr=0.0001 | base/model_time=0.0410 | base/momentum=0.9000 | loss=25.5722 | map05=0.0015
12/100 * Epoch (valid): _fps=762.2009 | base/batch_time=0.0616 | base/data_time=0.0229 | base/lr=0.0001 | base/model_time=0.0387 | base/momentum=0.9000 | loss=28.3761 | map05=0.0011
13/100 * Epoch (train): 100% 133/133 [00:06<00:00, 19.99it/s, _fps=783.159, loss=23.634, map05=0.000]
13/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 16.25it/s, _fps=807.500, loss=26.013, map05=0.016]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) c
[2019-03-21 10:21:55,990] 
13/100 * Epoch (train): _fps=760.6421 | base/batch_time=0.0461 | base/data_time=0.0051 | base/lr=0.0001 | base/model_time=0.0410 | base/momentum=0.9000 | loss=24.9090 | map05=0.0021
13/100 * Epoch (valid): _fps=758.2683 | base/batch_time=0.0604 | base/data_time=0.0215 | base/lr=0.0001 | base/model_time=0.0389 | base/momentum=0.9000 | loss=27.9700 | map05=0.0011
14/100 * Epoch (train): 100% 133/133 [00:06<00:00, 22.07it/s, _fps=770.857, loss=26.029, map05=0.000]
14/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 15.66it/s, _fps=811.955, loss=25.584, map05=0.031]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) c
[2019-03-21 10:22:08,043] 
14/100 * Epoch (train): _fps=762.4881 | base/batch_time=0.0452 | base/data_time=0.0042 | base/lr=0.0001 | base/model_time=0.0410 | base/momentum=0.9000 | loss=24.1805 | map05=0.0037
14/100 * Epoch (valid): _fps=752.5557 | base/batch_time=0.0627 | base/data_time=0.0236 | base/lr=0.0001 | base/model_time=0.0391 | base/momentum=0.9000 | loss=27.5076 | map05=0.0022
15/100 * Epoch (train): 100% 133/133 [00:06<00:00, 20.10it/s, _fps=760.488, loss=22.381, map05=0.000]
15/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 16.09it/s, _fps=787.039, loss=25.456, map05=0.031]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) c
[2019-03-21 10:22:20,274] 
15/100 * Epoch (train): _fps=754.6090 | base/batch_time=0.0455 | base/data_time=0.0044 | base/lr=0.0001 | base/model_time=0.0411 | base/momentum=0.9000 | loss=23.6162 | map05=0.0060
15/100 * Epoch (valid): _fps=739.9028 | base/batch_time=0.0610 | base/data_time=0.0212 | base/lr=0.0001 | base/model_time=0.0398 | base/momentum=0.9000 | loss=27.1939 | map05=0.0033
16/100 * Epoch (train): 100% 133/133 [00:06<00:00, 19.69it/s, _fps=778.896, loss=24.231, map05=0.000]
16/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 15.89it/s, _fps=807.296, loss=24.941, map05=0.031]
> /home/ngxbac/anaconda3/envs/general/lib/python3.6/site-packages/catalyst/dl/callbacks/base.py(83)save_checkpoint()
-> last_item = self.top_best_metrics.pop(-1)
(Pdb) c
[2019-03-21 10:22:45,175] 
16/100 * Epoch (train): _fps=756.6095 | base/batch_time=0.0466 | base/data_time=0.0054 | base/lr=0.0001 | base/model_time=0.0412 | base/momentum=0.9000 | loss=23.0421 | map05=0.0070
16/100 * Epoch (valid): _fps=749.7147 | base/batch_time=0.0618 | base/data_time=0.0225 | base/lr=0.0001 | base/model_time=0.0393 | base/momentum=0.9000 | loss=26.7421 | map05=0.0045
17/100 * Epoch (train): 100% 133/133 [00:06<00:00, 19.92it/s, _fps=771.987, loss=23.759, map05=0.000]
17/100 * Epoch (valid): 100% 14/14 [00:00<00:00, 15.72it/s, _fps=805.377, loss=24.638, map05=0.062]

Checkpoints folder:
Imgur

Problem

I found the problem is here

From epoch 0->11:

checkpoint_metric = 0
checkpoint_metric = checkpoint_metric or epoch
=> checkpoint_metric = epoch

checkpoint_metric now is a number greater than 1.

From epoch > 11:

checkpoint_metric = 0.xxxx 
checkpoint_metric = checkpoint_metric or epoch 
=> checkpoint_metric = 0.xxxx

checkpoint_metric now is a float number less than 1
=> The next best checkpoints will be removed after sorting.

Following is the log of self.top_best_metrics

[('/media/ngxbac/DATA/logs_aivivn/face-recognition/arcface/checkpoints//stage1.11.pth', 11), ('/media/ngxbac/DATA/logs_aivivn/face-recognition/arcface/checkpoints//stage1.10.pth', 10), ('/media/ngxbac/DATA/logs_aivivn/face-recognition/arcface/checkpoints//stage1.9.pth', 9), ('/media/ngxbac/DATA/logs_aivivn/face-recognition/arcface/checkpoints//stage1.8.pth', 8), ('/media/ngxbac/DATA/logs_aivivn/face-recognition/arcface/checkpoints//stage1.7.pth', 7), ('/media/ngxbac/DATA/logs_aivivn/face-recognition/arcface/checkpoints//stage1.18.pth', 0.006324404761904762)]

=> The last checkpoint is removed after

            last_item = self.top_best_metrics.pop(-1)
            last_filepath = last_item[0]
            os.remove(last_filepath)

[BUG] RL: Observation Buffer IndexError

Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "catalyst/rl/offpolicy/scripts/run_samplers.py", line 162, in run_sampler
    sampler.run()
  File "/home/fedor/catalyst/catalyst/rl/offpolicy/sampler.py", line 317, in run
    self.buffer.push_transition(transition)
  File "/home/fedor/catalyst/catalyst/rl/offpolicy/sampler.py", line 107, in push_transition
    self.observations[self.pointer + 1] = s_tp1
IndexError: index 5000 is out of bounds for axis 0 with size 5000

From the code it seems like there is nothing preventing from buffer overflow. I cant try to fix it if someone will approve that it's really that easy as it seems or tell me if I am missing something. I would just increase self.observations size by one.

UPD: Oh gosh it keeps iterating further, seems like buffer size is not the limit

Should this repo be 100% flake8 compliant?

Very few issues to address https://travis-ci.com/Scitator/prometheus/jobs/149428974#L465

[feature] python dependencies logging

For better experiment reproducibility:

add something like pip/conda freeze in the beginning of catalyst-dl run
or just after the logdir creation

sometimes package versions matters.

Out of memory during validation step

I got OOM error during validation step. Here is the log

0/10 * Epoch (train): 100% 32/32 [00:41<00:00,  2.39s/it, _fps=11.486, loss=0.898] 
0/10 * Epoch (valid):   3% 1/32 [00:01<00:46,  1.50s/it, _fps=21.531, loss=1.656]
Traceback (most recent call last):
....
RuntimeError: CUDA error: out of memory

My model uses only 80% of GPU during training. However, in validation step, It is out of memory. That is so weired since I though validation consums less memory than training.
I am not sure it is normal or not. But I guess, probably, GPU does not have time to release GPU before going to validation step.

I also tried to add some callback to freeze GPU:

class FreeGPU(Callback):

    def on_stage_start(self, state):
        torch.cuda.empty_cache()

    def on_loader_start(self, state):
        torch.cuda.empty_cache()

    def on_loader_end(self, state):
        torch.cuda.empty_cache()

    def on_stage_end(self, state):
        torch.cuda.empty_cache()

    def on_epoch_start(self, state):
        torch.cuda.empty_cache()

    def on_epoch_end(self, state):
        torch.cuda.empty_cache()

It does not help at all.
Do you have any ideas?.
P/S: It is not the first time I face to this problem. The only thing I can prevent it is reducing the batch size. But, It hurts the performance.

Refactor Algorithms

default/softmax/quantile as parameter
prepare_for trainer/sampler to algorithm class method
ALGO -> ALGORITHM (let's use full name)
ddpg/sac/td3 -> rl/offpolicy/algorithms
example

update how to contribute guide

Suggestion: Add some automated testing like Travis CI, Circle CI, Appveyor

They are all free for Open Source projects like this one.

https://github.com/marketplace/category/continuous-integration

Bug: ZeroDivisionError while computing FPS counter

Hi.

I've got some weird unhanded exception during simple train run:

  File "C:/Develop/pytorch-toolbelt/examples/canny-edge-detector-in-cnn/example_canny_cnn.py", line 127, in <module>
    #     metrics=["loss", "precision01", "precision03", "base/lr"])
  File "C:/Develop/pytorch-toolbelt/examples/canny-edge-detector-in-cnn/example_canny_cnn.py", line 117, in main
    ],
  File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\experiments\runner.py", line 271, in train
    self.run_experiment(experiment, check=check)
  File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\experiments\runner.py", line 194, in run_experiment
    self._run_stage(stage)
  File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\experiments\runner.py", line 175, in _run_stage
    self._run_epoch(loaders)
  File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\experiments\runner.py", line 160, in _run_epoch
    self._run_loader(loaders[loader_name])
  File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\experiments\runner.py", line 132, in _run_loader
    self._run_event("batch_end")
  File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\experiments\runner.py", line 97, in _run_event
    getattr(self.state, f"on_{event}_post")()
  File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\state.py", line 153, in on_batch_end_post
    self._handle_runner_metrics()
  File "C:\Anaconda3\envs\kaggle\lib\site-packages\catalyst\dl\state.py", line 114, in _handle_runner_metrics
    self.batch_size / self.timer.elapsed["base/batch_time"]
ZeroDivisionError: float division by zero

Before it happened, training logs looked totally fine:

C:\Anaconda3\envs\kaggle\python.exe C:/Develop/pytorch-toolbelt/examples/canny-edge-detector-in-cnn/example_canny_cnn.py
0/100 * Epoch (train): 100% 87/87 [01:37<00:00,  2.28it/s, _fps=16383.000, jaccard=0.000, loss=0.068]
0/100 * Epoch (valid): 100% 10/10 [00:39<00:00,  4.53s/it, _fps=240.926, jaccard=0.000, loss=0.085]
[2019-04-03 14:35:00,371] 
0/100 * Epoch (train): _fps=13657.5198 | base/batch_time=0.7826 | base/data_time=0.7441 | base/lr=0.0010 | base/model_time=0.0385 | base/momentum=0.9000 | jaccard=0.0074 | loss=0.0802
0/100 * Epoch (valid): _fps=8236.0160 | base/batch_time=3.8731 | base/data_time=3.7668 | base/lr=0.0010 | base/model_time=0.1063 | base/momentum=0.9000 | jaccard=0.0001 | loss=0.0837
1/100 * Epoch (train): 100% 87/87 [01:36<00:00,  2.72it/s, _fps=16395.258, jaccard=0.001, loss=0.061]
1/100 * Epoch (valid): 100% 10/10 [00:37<00:00,  3.78s/it, _fps=16383.000, jaccard=0.004, loss=0.077]
[2019-04-03 14:37:15,192] 
1/100 * Epoch (train): _fps=13985.1989 | base/batch_time=0.8083 | base/data_time=0.8061 | base/lr=0.0010 | base/model_time=0.0022 | base/momentum=0.9000 | jaccard=0.0001 | loss=0.0642
1/100 * Epoch (valid): _fps=6082.4207 | base/batch_time=3.6415 | base/data_time=3.6382 | base/lr=0.0010 | base/model_time=0.0032 | base/momentum=0.9000 | jaccard=0.0033 | loss=0.0761
2/100 * Epoch (train): 100% 87/87 [01:34<00:00,  2.51it/s, _fps=16382.500, jaccard=0.186, loss=0.049]
2/100 * Epoch (valid): 100% 10/10 [00:37<00:00,  6.22s/it, _fps=16382.000, jaccard=0.188, loss=0.064]
[2019-04-03 14:39:29,013] 
2/100 * Epoch (train): _fps=13115.4018 | base/batch_time=0.7891 | base/data_time=0.7886 | base/lr=0.0010 | base/model_time=0.0005 | base/momentum=0.9000 | jaccard=0.0762 | loss=0.0538
2/100 * Epoch (valid): _fps=12309.1793 | base/batch_time=3.6892 | base/data_time=3.6892 | base/lr=0.0010 | base/model_time=0.0000 | base/momentum=0.9000 | jaccard=0.1834 | loss=0.0630
3/100 * Epoch (train): 100% 87/87 [01:35<00:00,  2.46it/s, _fps=16387.001, jaccard=0.267, loss=0.043]
3/100 * Epoch (valid): 100% 10/10 [00:37<00:00,  3.76s/it, _fps=16382.750, jaccard=0.273, loss=0.056]
[2019-04-03 14:41:42,811] 
3/100 * Epoch (train): _fps=13093.6248 | base/batch_time=0.7970 | base/data_time=0.7961 | base/lr=0.0010 | base/model_time=0.0009 | base/momentum=0.9000 | jaccard=0.2246 | loss=0.0447
3/100 * Epoch (valid): _fps=13118.9414 | base/batch_time=3.6233 | base/data_time=3.6233 | base/lr=0.0010 | base/model_time=0.0000 | base/momentum=0.9000 | jaccard=0.2659 | loss=0.0549
4/100 * Epoch (train): 100% 87/87 [01:36<00:00,  2.62it/s, _fps=16249.120, jaccard=0.354, loss=0.038]
4/100 * Epoch (valid): 100% 10/10 [00:39<00:00,  3.93s/it, _fps=16383.000, jaccard=0.356, loss=0.049]
[2019-04-03 14:43:58,902] 
4/100 * Epoch (train): _fps=13008.1172 | base/batch_time=0.8044 | base/data_time=0.8031 | base/lr=0.0010 | base/model_time=0.0014 | base/momentum=0.9000 | jaccard=0.3206 | loss=0.0382
4/100 * Epoch (valid): _fps=13161.2595 | base/batch_time=3.7907 | base/data_time=3.7907 | base/lr=0.0010 | base/model_time=0.0000 | base/momentum=0.9000 | jaccard=0.3512 | loss=0.0484
5/100 * Epoch (train): 100% 87/87 [01:36<00:00,  2.57it/s, _fps=16387.001, jaccard=0.437, loss=0.032]
5/100 * Epoch (valid): 100% 10/10 [00:37<00:00,  3.78s/it, _fps=16384.000, jaccard=0.431, loss=0.044]
[2019-04-03 14:46:13,698] 
5/100 * Epoch (train): _fps=14208.6617 | base/batch_time=0.8076 | base/data_time=0.8067 | base/lr=0.0010 | base/model_time=0.0009 | base/momentum=0.9000 | jaccard=0.4012 | loss=0.0333
5/100 * Epoch (valid): _fps=11602.1947 | base/batch_time=3.6362 | base/data_time=3.6362 | base/lr=0.0010 | base/model_time=0.0000 | base/momentum=0.9000 | jaccard=0.4264 | loss=0.0436
6/100 * Epoch (train): 100% 87/87 [01:36<00:00,  2.82it/s, _fps=16382.750, jaccard=0.477, loss=0.030]
6/100 * Epoch (valid): 100% 10/10 [00:37<00:00,  4.33s/it, _fps=19348.791, jaccard=0.481, loss=0.041]
[2019-04-03 14:48:28,650] 
6/100 * Epoch (train): _fps=14899.4017 | base/batch_time=0.8033 | base/data_time=0.8029 | base/lr=0.0010 | base/model_time=0.0004 | base/momentum=0.9000 | jaccard=0.4614 | loss=0.0299
6/100 * Epoch (valid): _fps=10905.9308 | base/batch_time=3.6788 | base/data_time=3.6784 | base/lr=0.0010 | base/model_time=0.0003 | base/momentum=0.9000 | jaccard=0.4788 | loss=0.0399
7/100 * Epoch (train): 100% 87/87 [01:37<00:00,  2.58it/s, _fps=16387.001, jaccard=0.509, loss=0.030]
7/100 * Epoch (valid): 100% 10/10 [00:38<00:00,  4.41s/it, _fps=16382.500, jaccard=0.506, loss=0.039]
[2019-04-03 14:50:45,999] 
7/100 * Epoch (train): _fps=14034.5155 | base/batch_time=0.8213 | base/data_time=0.8204 | base/lr=0.0010 | base/model_time=0.0009 | base/momentum=0.9000 | jaccard=0.5000 | loss=0.0278
7/100 * Epoch (valid): _fps=10034.7244 | base/batch_time=3.7788 | base/data_time=3.7787 | base/lr=0.0010 | base/model_time=0.0001 | base/momentum=0.9000 | jaccard=0.5043 | loss=0.0382
8/100 * Epoch (train): 100% 87/87 [01:36<00:00,  2.66it/s, _fps=16351.813, jaccard=0.533, loss=0.028]
8/100 * Epoch (valid): 100% 10/10 [00:40<00:00,  4.00s/it, _fps=16210.115, jaccard=0.527, loss=0.038]
[2019-04-03 14:53:03,136] 
8/100 * Epoch (train): _fps=13962.9568 | base/batch_time=0.8072 | base/data_time=0.8066 | base/lr=0.0010 | base/model_time=0.0006 | base/momentum=0.9000 | jaccard=0.5228 | loss=0.0264
8/100 * Epoch (valid): _fps=11596.8021 | base/batch_time=3.8571 | base/data_time=3.8570 | base/lr=0.0010 | base/model_time=0.0001 | base/momentum=0.9000 | jaccard=0.5253 | loss=0.0369
9/100 * Epoch (train): 100% 87/87 [01:36<00:00,  2.50it/s, _fps=16382.750, jaccard=0.535, loss=0.025]
9/100 * Epoch (valid): 100% 10/10 [00:37<00:00,  3.08s/it, _fps=16382.750, jaccard=0.541, loss=0.037]
[2019-04-03 14:55:18,959] 
9/100 * Epoch (train): _fps=13944.8145 | base/batch_time=0.8145 | base/data_time=0.8135 | base/lr=0.0010 | base/model_time=0.0008 | base/momentum=0.9000 | jaccard=0.5375 | loss=0.0256
9/100 * Epoch (valid): _fps=9995.8059 | base/batch_time=3.6704 | base/data_time=3.6704 | base/lr=0.0010 | base/model_time=0.0000 | base/momentum=0.9000 | jaccard=0.5390 | loss=0.0359
10/100 * Epoch (train): 100% 87/87 [01:35<00:00,  2.55it/s, _fps=16382.750, jaccard=0.560, loss=0.024]
10/100 * Epoch (valid):  80% 8/10 [00:37<00:12,  6.05s/it, _fps=16383.000, jaccard=0.552, loss=0.036]

For reference, train script: https://github.com/BloodAxe/pytorch-toolbelt/blob/feature/example-canny-cnn/examples/canny-edge-detector-in-cnn/example_canny_cnn.py

Environment:
OS: Windows 10
Python: 3.6
Catalyst: 19.3
PyTorch: 1.0.1

Multiple dataset source for tag2label

It will be nice if I can choose not one input dir here https://github.com/catalyst-team/catalyst/blob/master/catalyst/contrib/scripts/tag2label.py

For example it can looks like:
catalyst-contrib tag2label --in-dir=dataset1,dataset2

Can you provide some examples of reinforcement learning?

foe example : A pointer-generation model with Actor-Critic model.

Problem with plot metrics

Seems like need numpy update to >=1.16.0
and in requirements https://github.com/catalyst-team/catalyst/blob/master/requirements.txt#L2 i see 1.14.6

Problem was on 1.15.3 version, and update resolve problem (check PR)

Feature reauest - add n_steps in train for train loop debug

Some improvements of LRFinder

As typical use case for LRFinder just set some large value for final_lr, say 10, it would be convenient to stop iterating in case of divergence. And probably add default value for final_lr.
If this sounds good, I'll contribute.

[solved] inconsistence learning rate print

Hi guys!
I tried to run notebook example again (code: https://gist.github.com/Daiver/b4f9115a9e33a1ca233d0defbabee6d9 but i set verbose to False) on ubuntu machine

Script gives me following output

Python version 3.6.3 (default, Oct  3 2017, 21:45:48) 
[GCC 7.2.0]
Catalyst version: 0.6
Files already downloaded and verified
Files already downloaded and verified
[2019-01-17 11:39:03,741] 0 * Epoch (train) metrics: base/data_time: 0.00477 | base/batch_time: 0.00520 | base/sample_per_second: 6180.13137 | precision01: 41.76663 | precision03: 75.73776 | precision05: 89.51935 | lr: 0.00100 | momentum: 0.90000 | loss: 1.58647
[2019-01-17 11:39:03,741] 0 * Epoch (valid) metrics: base/data_time: 0.00465 | base/batch_time: 0.00503 | base/sample_per_second: 6365.33178 | precision01: 50.07987 | precision03: 81.99880 | precision05: 93.29073 | lr: 0.00100 | momentum: 0.90000 | loss: 1.37917
[2019-01-17 11:39:03,741] 

[2019-01-17 11:39:16,967] 1 * Epoch (train) metrics: base/data_time: 0.00477 | base/batch_time: 0.00519 | base/sample_per_second: 6168.41000 | precision01: 53.47689 | precision03: 84.46497 | precision05: 94.58573 | lr: 0.00100 | momentum: 0.90000 | loss: 1.28700
[2019-01-17 11:39:16,967] 1 * Epoch (valid) metrics: base/data_time: 0.00473 | base/batch_time: 0.00513 | base/sample_per_second: 6246.95064 | precision01: 55.16174 | precision03: 85.15375 | precision05: 94.75839 | lr: 0.00100 | momentum: 0.90000 | loss: 1.24638
[2019-01-17 11:39:16,967] 

[2019-01-17 11:39:30,171] 2 * Epoch (train) metrics: base/data_time: 0.00476 | base/batch_time: 0.00518 | base/sample_per_second: 6184.96553 | precision01: 58.61124 | precision03: 86.94818 | precision05: 95.64539 | lr: 0.00100 | momentum: 0.90000 | loss: 1.15438
[2019-01-17 11:39:30,171] 2 * Epoch (valid) metrics: base/data_time: 0.00476 | base/batch_time: 0.00516 | base/sample_per_second: 6210.20164 | precision01: 58.80591 | precision03: 87.11062 | precision05: 95.48722 | lr: 0.00100 | momentum: 0.90000 | loss: 1.14073
[2019-01-17 11:39:30,171] 

Epoch     3: reducing learning rate of group 0 to 5.0000e-04.
[2019-01-17 11:39:43,269] 3 * Epoch (train) metrics: base/data_time: 0.00472 | base/batch_time: 0.00514 | base/sample_per_second: 6235.06455 | precision01: 61.89419 | precision03: 88.56366 | precision05: 96.18722 | lr: 0.00100 | momentum: 0.90000 | loss: 1.07171
[2019-01-17 11:39:43,270] 3 * Epoch (valid) metrics: base/data_time: 0.00464 | base/batch_time: 0.00503 | base/sample_per_second: 6367.18977 | precision01: 61.06230 | precision03: 87.85942 | precision05: 95.62700 | lr: 0.00100 | momentum: 0.90000 | loss: 1.10485
[2019-01-17 11:39:43,270] 

[2019-01-17 11:39:56,491] 4 * Epoch (train) metrics: base/data_time: 0.00476 | base/batch_time: 0.00518 | base/sample_per_second: 6176.92439 | precision01: 66.00688 | precision03: 90.30910 | precision05: 97.08893 | lr: 0.00050 | momentum: 0.90000 | loss: 0.95941
[2019-01-17 11:39:56,491] 4 * Epoch (valid) metrics: base/data_time: 0.00474 | base/batch_time: 0.00513 | base/sample_per_second: 6240.70324 | precision01: 63.70807 | precision03: 89.10743 | precision05: 96.40575 | lr: 0.00050 | momentum: 0.90000 | loss: 1.03544
[2019-01-17 11:39:56,491] 

[2019-01-17 11:40:09,791] 5 * Epoch (train) metrics: base/data_time: 0.00480 | base/batch_time: 0.00523 | base/sample_per_second: 6126.94727 | precision01: 67.57038 | precision03: 91.11484 | precision05: 97.31086 | lr: 0.00050 | momentum: 0.90000 | loss: 0.91745
[2019-01-17 11:40:09,792] 5 * Epoch (valid) metrics: base/data_time: 0.00476 | base/batch_time: 0.00516 | base/sample_per_second: 6216.97721 | precision01: 63.73802 | precision03: 89.18730 | precision05: 96.29593 | lr: 0.00050 | momentum: 0.90000 | loss: 1.03911
[2019-01-17 11:40:09,792] 

Epoch     6: reducing learning rate of group 0 to 2.5000e-04.
[2019-01-17 11:40:23,136] 6 * Epoch (train) metrics: base/data_time: 0.00480 | base/batch_time: 0.00522 | base/sample_per_second: 6135.11402 | precision01: 68.87796 | precision03: 91.71065 | precision05: 97.54479 | lr: 0.00050 | momentum: 0.90000 | loss: 0.88481
[2019-01-17 11:40:23,136] 6 * Epoch (valid) metrics: base/data_time: 0.00483 | base/batch_time: 0.00523 | base/sample_per_second: 6131.59957 | precision01: 63.83786 | precision03: 89.16733 | precision05: 96.18610 | lr: 0.00050 | momentum: 0.90000 | loss: 1.03159
[2019-01-17 11:40:23,136] 

[2019-01-17 11:40:36,215] 7 * Epoch (train) metrics: base/data_time: 0.00471 | base/batch_time: 0.00512 | base/sample_per_second: 6255.08350 | precision01: 70.90931 | precision03: 92.46041 | precision05: 97.84269 | lr: 0.00025 | momentum: 0.90000 | loss: 0.82662
[2019-01-17 11:40:36,215] 7 * Epoch (valid) metrics: base/data_time: 0.00465 | base/batch_time: 0.00504 | base/sample_per_second: 6357.74667 | precision01: 64.66653 | precision03: 89.55671 | precision05: 96.43570 | lr: 0.00025 | momentum: 0.90000 | loss: 1.01120
[2019-01-17 11:40:36,216] 

[2019-01-17 11:40:49,560] 8 * Epoch (train) metrics: base/data_time: 0.00476 | base/batch_time: 0.00518 | base/sample_per_second: 6183.07309 | precision01: 71.50512 | precision03: 92.77031 | precision05: 97.91667 | lr: 0.00025 | momentum: 0.90000 | loss: 0.80761
[2019-01-17 11:40:49,560] 8 * Epoch (valid) metrics: base/data_time: 0.00497 | base/batch_time: 0.00540 | base/sample_per_second: 5983.79479 | precision01: 65.85463 | precision03: 89.98602 | precision05: 96.55551 | lr: 0.00025 | momentum: 0.90000 | loss: 0.99543
[2019-01-17 11:40:49,560] 

Epoch     9: reducing learning rate of group 0 to 1.2500e-04.
[2019-01-17 11:41:03,294] 9 * Epoch (train) metrics: base/data_time: 0.00495 | base/batch_time: 0.00540 | base/sample_per_second: 5959.02559 | precision01: 72.08093 | precision03: 93.00224 | precision05: 98.04063 | lr: 0.00025 | momentum: 0.90000 | loss: 0.79061
[2019-01-17 11:41:03,294] 9 * Epoch (valid) metrics: base/data_time: 0.00479 | base/batch_time: 0.00519 | base/sample_per_second: 6174.30215 | precision01: 64.95607 | precision03: 89.87620 | precision05: 96.50559 | lr: 0.00025 | momentum: 0.90000 | loss: 1.01265
[2019-01-17 11:41:03,294] 

Top best models:
./logs/cifar_simple_notebook/checkpoint.None.8.pth.tar	0.9954
./logs/cifar_simple_notebook/checkpoint.None.7.pth.tar	1.0112
./logs/cifar_simple_notebook/checkpoint.None.9.pth.tar	1.0127
./logs/cifar_simple_notebook/checkpoint.None.6.pth.tar	1.0316
./logs/cifar_simple_notebook/checkpoint.None.4.pth.tar	1.0354

Everything looks ok, except one thing

Epoch     3: reducing learning rate of group 0 to 5.0000e-04.
[2019-01-17 11:39:43,269] 3 * Epoch (train) metrics: base/data_time: 0.00472 | base/batch_time: 0.00514 | base/sample_per_second: 6235.06455 | precision01: 61.89419 | precision03: 88.56366 | precision05: 96.18722 | lr: 0.00100 | momentum: 0.90000 | loss: 1.07171

lr should be decreased (at epoch 3), but epoch (3) summary shows me old lr.
Summary of next epoch is ok.

[2019-01-17 11:39:56,491] 4 * Epoch (train) metrics: base/data_time: 0.00476 | base/batch_time: 0.00518 | base/sample_per_second: 6176.92439 | precision01: 66.00688 | precision03: 90.30910 | precision05: 97.08893 | lr: 0.00050 | momentum: 0.90000 | loss: 0.95941

Same stuff for other epochs

features support

Does it support multiple learning rates and tta?

[container] Simple dockerizatino without GPU support

Dockerfile with all required dependencies, build on ubuntu image
DockerIgnore file
Docker compose file for simplification
Docker hub repository creation
Docker readme

"float division by zero" exception during notebook-example on anaconda/windows

Hi guys, thank you for great project.

I tried to run notebook-example over my Anaconda environment and faced with "float division by zero exception"
This is a code which causes exception: https://gist.github.com/Daiver/b4f9115a9e33a1ca233d0defbabee6d9 (basically notebook-example copy-pasted inside one .py file)

This is stacktrace:

C:\Users\Daiver\Anaconda3\python.exe C:/Users/Daiver/PycharmProjects/untitled/main.py
Python version 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
Catalyst version: 0.6
Files already downloaded and verified
Files already downloaded and verified
0 * Epoch (train):   0% 1/1563 [00:00<15:06,  1.72it/s, base/batch_time=0.01562, base/data_time=0.01562, base/sample_per_second=2048.43759, loss=2.32288, lr=0.00100, momentum=0.90000, precision01=3.12500, precision03=18.75000, precision05=56.25000]Traceback (most recent call last):
  File "C:/Users/Daiver/PycharmProjects/untitled/main.py", line 113, in <module>
    main()
  File "C:/Users/Daiver/PycharmProjects/untitled/main.py", line 109, in main
    epochs=n_epochs, verbose=True)
  File "C:\Users\Daiver\PycharmProjects\untitled\catalyst\dl\runner.py", line 210, in train
    verbose=verbose
  File "C:\Users\Daiver\PycharmProjects\untitled\catalyst\dl\runner.py", line 159, in run
    self.run_event(callbacks=callbacks, event="on_batch_end")
  File "C:\Users\Daiver\PycharmProjects\untitled\catalyst\dl\runner.py", line 92, in run_event
    getattr(self.state, f"{event}_pre")(state=self.state)
  File "C:\Users\Daiver\PycharmProjects\untitled\catalyst\dl\state.py", line 203, in on_batch_end_pre
    state.batch_size / elapsed_time
ZeroDivisionError: float division by zero


Process finished with exit code 1

It can be fixed by adding zero check on elapsed_time but i have no idea, why elapsed_time is zero

My python/catalyst versions

Python version 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
Catalyst version: 0.6

Catalyst was installed by clonning current repo (master branch, last commit 892d5e5 "Merge pull request #56 from dbrainio/master")

[solved] How to train with KFold

Thank for your awesome library.
I am wondering if this library supports training with KFold Cross-validation?

catalyst-team / catalyst Goto Github PK

catalyst's Introduction

Getting started

Step-by-step Guide

Table of Contents

Overview

Installation

Documentation

Minimal Examples

Tests

Community

Accelerated with Catalyst

Contribution Guide

User Feedback

Acknowledgments

Catalyst.Team

Catalyst.Contributors

Trusted by

Citation

catalyst's People

Contributors

Stargazers

Watchers

Forkers

catalyst's Issues

Settings

Scenario

Problem

Recommend Projects

Recommend Topics

Recommend Org