Giter VIP home page Giter VIP logo

ml2021-coreset-selection's Introduction

Core-Set Selection for Effective Data Selection

This repository contains the source code of the Skoltech ML2021 course project "Core-Set Selection for Effective Data Selection".

Abstract

Core-set selection techniques aim to select a subset of a given training data set, with the goal of maximizing the subsets utility in training a neural network. In this repository, we analyze the core-set selection methods GLISTER-ONLINE and Greedy K-Centers and compare them to a random core-set selection as the baseline quality measure. We replicated the results of previous papers and studied how well the produced core-sets generalize across different deep neural network architectures. In our findings, the tested core-set algorithms produced core-sets that were well generalizable across multiple neural networks. Furthermore, no significant accuracy gains could be achieved by using GLISTER-ONLINE or Greedy K-Centers over random core-set sampling.

Structure and contents

  • src/
    • greedy_k_centers contains the implementation of the Greedy K-Centers algorithm taken from the original paper we based our work on, and added some utility functions.
    • Glister contains the implementation of all GLISTER algorithms. We extracted one solution from the code of the original paper and implemented one from pseudocode given in the original paper. We also added some utility functions.
  • dataset/ contains a data manager. Since we are using CIFAR-10 throughout this repository, we unified access to the data set and wrote a data manager for maintaining access to the data set, so we don't have copies stored in every folder.
  • results/ contains the main results of our measurements, together with visualization code that we implemented in jupyter notebooks.
  • submodules/ contains other repositories we used, linked as submodules to this repo.

Getting started

Dependencies

In order to use the repository, you need to have the following packages installed:

  • pytorch 1.8.0
  • torchvision 0.9.0

It is also highly recommended to use CUDA GPU cores for most of the tasks described below, as they are computationally very demanding.

Creating the latent space from CIFAR-10 for Greedy K-Centers

To do this, make sure the model weights in submodules/PyTorch_CIFAR10 are downloaded. You can go there and execute the download_weights.py script to download them.

Go to src/greedy_k_centers and execute the following:

>>> from latent_space_generator import generate_latent_space
>>> generate_latent_space(dataset_type='fullset')

With the parameter, either 'fullset' or 'subset' can be selected. This creates the latent space either from the full CIFAR-10 dataset (50'000 samples), or from the subset (20'000 samples). The result is saved in a CSV file in the same directory.

The resulting file will be several hundets of Megabytes in size, that's why it is not included in this repository. It can be might be added upon request.

Creating core-set indices

Greedy K-Centers

To do this, make sure you have the latent space created.

Go to src/greedy_k_centers and execute the following:

>>> from k_centers_coreset_generator import generate_coreset_indices
>>> generate_coreset_indices(dataset_type='fullset', frac_of_full_set=0.5)

As with the latent space creation, you can either create the core-set for the full CIFAR-10 data set (50'000), or for the data subset (20'000).

The frac_of_full_set parameter describes the size of the core-set with regards to the full dataset (or subset).

GLISTER

Go to src/Glister/GlisterImage and execute the following:

>>> from glister_coreset_generator import generate_coreset_indices
>>> generate_coreset_indices(dataset_type='subset', frac_of_full_set=0.1)

Again, you can choose between fullset and subset. In our experience, we only used subset, as generating the GLISTER indices on the full set was unfeasible. Also, in GLISTER, frac_of_full_set only allows 3 settings: 0.1, 0.3 and 0.5.

For both Greedy K-Centers and GLISTER, the resulting core-sets will be saved as CSV files directly in the same directory.

Training models

If you have generated the core-set indices of either Greedy K-Center, GLISTER, or both, you can train models and check the achieved accuracy.

For this, the models from the submodules/PyTorch_CIFAR10 submodule will be used.

Go to src/experiments/models_generalization and execute the following:

>>> from training_utilities import train_and_save_model
>>> train_and_save_model(model_name='resnet', coreset_selector='k-centers', coreset_percentage='0.1', trainset_size='subset', device='cuda')
  • model_name lets you choose between resnet, mobilenet, vgg and densenet
  • coreset_selector accepts glister, k-centers and random
  • coreset_percentage accepts 0.1, 0.3 and 0.5
  • trainset_size can be fullset or subset, as above
  • device lets you select the device for the training. We would strongly recommend only executing this command if you have a CUDA core available.

This function trains the model in a very verbose way. The resulting accuracies over the 100 epochs are logged in accuracy_results/.

Working with Jupyter Notebooks

Some jupyter notebooks are sprinkled across this repository, as we like to work with them for visualization and interactive purposes. If you want to use them, we'd suggest launching the notebook in the root of the repository, as to not break any path dependencies that might be there.

Report

The report for this project can be found here:

https://www.overleaf.com/7232121456hvjtgyvnfwcd

Project team

  • Julia Tukmacheva
  • Vladimir Omelyusik
  • Roland Konlechner

Relevant papers

ml2021-coreset-selection's People

Contributors

juniperjey avatar rolkon avatar v-marco avatar

Watchers

 avatar

Forkers

juniperjey

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.