Giter VIP home page Giter VIP logo

binary-dataset-splitter's Introduction

Binary-Dataset-Splitter

Creates a file structure to split a dataset into train/test/validation for binary classification (malign/benign, in this script).

This was made to be a useful script for some simple ML tasks I did for my machine learning class, and it helped me learn a bit about file manipulation with Python. This is not intended to be a super-useful utility for ML practictioners - in fact, it actually doesn't make much sense the more I think about it.

This should NOT be used for the most popular methods cross-validation and K-fold cross validation, which is recommended. https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7

Popular ML libraries automatically support this, such as tensorflow, e.g. https://medium.com/fenwicks/tutorial-5-cross-validation-with-tensorflow-flowers-34f7ac36230b

Usage

Run this in a folder containing only this script as well as the folders all_benign and all_malign containing the samples. Optionally, if you have created data augmentation samples, you may include these in folders labeled augmented, which should be within all_benign and all_malign.

This script will split them into train/test/validation and it will also generate .csv files accordingly (list of every file in the split directory)

Note that when this script runs, it REMOVES any currently-present folders named code and split_samples, so do not keep important data here, as it will be overwritten by this script!

Also, if augmentation is enabled, and the file names are the same as a file in the main folder, the non-augmented file will take precendence. For example, an augmented file named 1.png exists, and so does a non-augmented file named 1.png. If these are both sent to the train directory, the one that appears in the directory will be the NON-AUGMENTED version!

Example directory before starting:

- all_malign
    - sample_1.png
    - sample_2.png
- all_benign
    - sample_1.png
    - sample_2.png
- this script (binary_dataset_splitter.py)

And the resulting directory after running this script:

- all_malign
    - sample_1.png
    - sample_2.png
- all_benign
    - sample_1.png
    - sample_2.png
- this script (binary_dataset_splitter.py)
- code
    - benign_test.csv
    - benign_train.csv
    - benign_validation.csv
    - malign_test.csv
    - malign_train.csv
    - malign_validation.csv
- split_samples
    - test
        - benign
            - some samples...
        - malign
            - some samples...
    - train
        - benign
            - some samples...
        - malign
            - some samples...
    - validation
        - benign
            - some samples...
        - malign
            - some samples...

TODO - implement below features if necessary in the future

  • Speed up the script
  • Add print statements to indicate progress
  • Use a config file (text file or YAML) instead of directly editing this file
  • Change labels benign and malignant to user-defined labels
  • Extend capabilities for multiple classes instead of just binary

binary-dataset-splitter's People

Contributors

jessedeppisch avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.