Giter VIP home page Giter VIP logo

icedata's Introduction

logo

Datasets Hub for the IceVision Framework


Note: We Need Your Help If you find this work useful, please let other people know by starring it, and sharing it. Thank you!

tests docs codecov PyPI version black license

Discord


Contributors

Documentation

Installation

pip install icedata

For more installation options, check our extensive documentation.

Important: We currently only support Linux/MacOS.

Why IceData?

  • IceData is a dataset hub for the IceVision Framework

  • It includes community maintained datasets and parsers and has out-of-the-box support for common annotation formats (COCO, VOC, etc.)

  • It provides an overview of each included dataset with a description, an annotation example, and other helpful information

  • It makes end-to-end training straightforward thanks to IceVision's unified API

  • It enables practioners to get moving with object detection technology quickly

Datasets

Source

The Datasets class is designed to simplify loading and parsing a wide range of computer vision datasets.

Main Features:

  • Caches data so you don't need to download it over and over

  • Lightweight and fast

  • Transparent and pythonic API

  • Out-of-the-box parsers convert common dataset annotation formats into the unified IceVision Data Format

IceData provides several ready-to-use datasets that use both common annotation formats such as COCO and VOC as well as other annotation formats such WheatParser used in the Kaggle Global Wheat Competition

Usage

Object detection datasets use multiple annotation formats (COCO, VOC, and others). IceVision makes it easy to work across all of them with its easy-to-use and extend parsers.

COCO and VOC compatible datasets

For COCO or VOC compatible datasets - especially ones that are not include in IceData - it is easiest to use the IceData COCO or VOC parser.

Example: Raccoon - a dataset using the VOC parser

# Imports
from icevision.all import *
import icedata


# WARNING: Make sure you have already cloned the raccoon dataset using the command shown here above
# Set images and annotations directories
data_dir = Path("raccoon_dataset")
images_dir = data_dir / "images"
annotations_dir = data_dir / "annotations"

# Define the class_map
class_map = ClassMap(["raccoon"])

# Create a parser for dataset using the predefined icevision VOC parser
parser = parsers.voc(
    annotations_dir=annotations_dir, images_dir=images_dir, class_map=class_map
)

# Parse the annotations to create the train and validation records
train_records, valid_records = parser.parse()
show_records(train_records[:3], ncols=3, class_map=class_map)

!!! info "Note" Notice how we use the predifined parsers.voc() function:

**parser = parsers.voc(
annotations_dir=annotations_dir, images_dir=images_dir, class_map=class_map
)**

Datasets included in IceData

Datasets included in IceData always have their own parser. It can be invoked with icedata.datasetname.parser(...).

Example: The IceData Fridge dataset

Please check out the fridge folder for more information on how this dataset is structured.

# Imports
from icevision.all import *
import icedata

# Load the Fridge Objects dataset
data_dir = icedata.fridge.load()

# Get the class_map
class_map = icedata.fridge.class_map()

# Parse the annotations
parser = icedata.fridge.parser(data_dir, class_map)
train_records, valid_records = parser.parse()

# Show images with their boxes and labels
show_records(train_records[:3], ncols=3, class_map=class_map)

!!! info "Note" Notice how we use the parser associated with the fridge dataset icedata.fridge.parser():

**parser = icedata.fridge.parser(data_dir, class_map)**

Datasets with a new annotation format

Sometimes, you will need to define a new annotation format for you dataset. Additional information can be found in the documentation. In this case, we strongly recommend you following the file structure and naming conventions used in the examples such as the Fridge dataset, or the PETS dataset.

image

Disclaimer

Inspired from the excellent HuggingFace Datasets project, icedata is a utility library that downloads and prepares computer vision datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have a license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the its license.

If you are a dataset owner and wish to update any of the information in IceData (description, citation, etc.), or do not want your dataset to be included, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!

If you are interested in learning more about responsible AI practices, including fairness, please see Google AI's Responsible AI Practices.

icedata's People

Contributors

adamfarquhar avatar ai-fast-track avatar buckley-w-david avatar burntcarrot avatar davanstrien avatar frapochetti avatar fstroth avatar ganesh3 avatar jpoberhauser avatar lgvaz avatar ribenamaplesyrup avatar yrodriguezmd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

icedata's Issues

Add COCO, VOC, and Birds README

๐Ÿ““ Documentation Update

What part of documentation was unclear or wrong?
Add the COCO, VOC, and Birds README, and create their corresponding documentation

Rename parsers.py to parser.py

The old datasets use the parsers.py convention but the template uses parser.py (which I believe is a better name)

So we need to rename the files to parser.py and update their respective __init__.py file.

Add a destination directory to `load_data()`

๐Ÿš€ Feature

Is your feature request related to a problem? Please describe.
Presently, load_data() automatically saves the downloaded data to the /root/.icevision/data directory. The new features allows the user to choose the destination directory.

Describe the solution you'd like
Add an new argument called dest_dir to load_data() to let the user chosse the destination directory. dest_dir should be initiliazed to None in order to preserve the compatibility with existing API. If the user does not provide the dest_dir argument (dest_dir =None), the load_data() automatically saves the downloaded data to the /root/.icevision/data directory

Updates install guide with poetry

๐Ÿ““ Documentation Update

Updates the installation instruction and other relevant parts of the documentation with poetry.

For this library, it's important to use a poetry version > 1.1.x which is currently only in pre-release.

To update poetry to this version do:

poetry self update --preview

Add a README template

๐Ÿ““ Documentation Update

We need to a README template that we will be added when generating a new dataset using template

Simple automatic tests for templates

๐Ÿš€ Feature

We eventually will have automatic tests to check the validity of the data structure, if the download link is working correctly and etc.

But let's start with simple tests, simply checking if the new added dataset can be imported and all the necessary functions are present is enough for now.

voc parser

๐Ÿ› Bug

When running the starting codes for the VOC, runs into error at the parser step.

To Reproduce
Steps to reproduce the behavior:

  1. Go to https://airctic.github.io/icedata/examples/voc_exp/
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

Screenshots
If applicable, add screenshots to help explain your problem.
Screen Shot 2021-08-19 at 11 50 55 AM
Screen Shot 2021-08-19 at 11 51 05 AM

Desktop (please complete the following information):
Mac

Additional context
Add any other context about the problem here.

Template generation for new datasets

๐Ÿš€ Feature

Automate the process for adding a new dataset.

All datasets follow a very similar structure, a simple script can be made to automatically generate the initial skeleton for implementing a new dataset.

Update notebooks in docs

๐Ÿ““ Documentation Update

We need to update the notebooks found in the current version of the icedata documentation

Slow import times

๐Ÿš€ Feature

Importing icedata takes a lot of time, this happens because interally we are importing unecessary stuff from icevision (heavy libraries like pytorch).

I think the problem mostly comes from icevision.imports

It might be a good idea to curate our own imports on icedata.imports to alliviate this issue.

Add PETS README

๐Ÿ““ Documentation Update

Add PETS README as well as its corresponding documentation

Add docs to dataset template generator

๐Ÿš€ Feature

Automatically create/update the necessary documentation files when using the generator.

autogen.py
Automate the readme creation, currently we have to copy each file individually:

    # Copy Birds README
    shutil.copyfile(icedata_dir / "icedata/datasets/birds/README.md", dest_dir / "birds.md")

    # Copy COCO README
    shutil.copyfile(icedata_dir / "icedata/datasets/coco/README.md", dest_dir / "coco.md")

mkdocs.yml
Update the Datasets section:

  - Datasets:
    - Birds: birds.md
    - COCO: coco.md

Automatically generates sample data

๐Ÿš€ Feature

When creating a new dataset it's not necessary to manually add data to sample_data, can we somehow create a script that automates the process?

I'm not sure this can be possible for all annotation formats, but we can start with the most common ones: COCO and VOC

pip install is not working

๐Ÿ› Bug

Describe the bug
icedata cannot be pip installed.

Solution
Add setup.py and settings.ini

Fastcore 1.1 breaking changes

๐Ÿ› Bug

Describe the bug
Fastcore 1.1 contains a breaking change described here:

  • Remove Path.{read,write} (use Path.{read_text,write_text} instead) and change Path.{load,save} to functions load_pickle and save_pickle (#121)

We need to update our calls to Path.read to fix this issue.

Add Fridge README

๐Ÿ““ Documentation Update

Add Fridge README as well as its corresponding documentation

How to filter predictions of specified classes

๐Ÿ““ New <Tutorial/Example>

Is this a request for a tutorial or for an example?
Tutorial.

What is the task?
Filter predictions of object detection model according to specified classes.

Is this example for a specific model?
FasterRCNN

Is this example for a specific dataset?
Dataset is COCO.


Don't remove
Main issue for examples: #39

Update README

๐Ÿ““ Documentation Update

Update README by replacing icevision references by icedata ones.

Test package with icevision master

Icevision and icedata versions go hand in hand together, some changes made on icedata will depend on changes on icevision not yet released, so we need to also test with icevision master.

Add pennfudan README

๐Ÿ““ Documentation Update

Add pennfudan README as well as its corresponding documentation

Add documentation

๐Ÿ““ Documentation Update

Add the documentation first draft using MkDocs

Allowing custom path for `icedata.load_data()`

๐Ÿš€ Feature

  • Currently, icedata.load_data() calls the get_data_dir() function from icevision which always return Path.home()/".icevision"/"data". It would be helpful if we can choose a different location to store the datasets.

ssl error when trying to download the model weights for mmdet.fcos

๐Ÿ› Bug

Describe the bug
ssl error while trying to download the weights of the mmdet.fcos model like so ...

SSLError: HTTPSConnectionPool(host='openmmlab.oss-cn-hangzhou.aliyuncs.com', port=443): Max retries exceeded with url: /mmdetection/v2.0/fcos/fcos_r101_caffe_fpn_gn-head_1x_coco/fcos_r101_caffe_fpn_gn-head_1x_coco-0e37b982.pth (Caused by SSLError(SSLError(1, '[SSL] unknown error (_ssl.c:1123)')))

To Reproduce
Steps to reproduce the behavior:
1.model_type = models.mmdet.fcos
2. backbone = model_type.backbones.resnet101_caffe_fpn_gn_head_1x_coco
3. model = model_type.model(backbone=backbone(pretrained=True), num_classes=len(parser.class_map), **extra_args)
4.

Expected behavior
Downloading of the weights and instantiation of the model

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. ubuntu 18.04]

Additional context
Add any other context about the problem here.

Fix CI workflow: mk-docs-build.yml

๐Ÿ› Bug

Describe the bug
The documentation is not deployed.

To Reproduce
1- Make a some changes to the documentation (.md files)
2- Merge a pull request
3- Check if the documentation is built and deployed

Expected behavior
The documentation should be built and deployed.

Add a Parser CheatSheet Document

๐Ÿ““ Documentation Update

What part of documentation was unclear or wrong?
The goal is to create a resource that will help both beginners and advanced users to easily create their parsers by providing some frequently used code snippets and best practices. code snippets are divide into sections:

Section 1: files related code snippets
Section 2: parsing related code snippets

Useful example

Custom Parser

Use template generator

The first step is to create a class that inherits from these smaller building blocks:

This is just an example, choose the mixins that are relevant to your use case

class WheatParser(parsers.FasterRCNN, parsers.FilepathMixin, parsers.SizeMixin):
    pass

We use a method called generate_template that will print out all the necessary methods we have to implement.

WheatParser.generate_template()

Output:

def __iter__(self) -> Any:
def height(self, o) -> int:
def width(self, o) -> int:
def filepath(self, o) -> Union[str, Path]:
def bboxes(self, o) -> List[BBox]:
def labels(self, o) -> List[int]:
def imageid(self, o) -> Hashable:

If, for example, all the images are .jpg and located in the data_dir folder, the image_paths attribute will be set as follow:

def __init__(self, data_dir):
        self.image_paths = get_files(data_dir, extensions=[".jpg"])

Files related code snippets

Let's suppose we have the follwoing fname variable:
`fname = Path("PennFudanPed/PNGImages/FudanPed00002.png")

fname PennFudanPed/PNGImages/FudanPed00002.png
fname.exists() True
fname.with_suffix('.txt') PennFudanPed/PNGImages/FudanPed00002.txt
fname.stem FudanPed00002

Parsing related code snippets

Read a CSV file using pandas

import pandas as pd
df = pd.read_csv("path/to/csv/file")
df.head() # or df.sample()

Then, the bboxes attribute, will be created this way

bbox = "[834.0, 222.0, 56.0, 36.0]"`
xywh = np.fromstring(bbox[1:-1], sep=",")

Output: array([834., 222., 56., 36.])

coordinates as a text with separators

label = "2 0.527267 0.702972 0.945466 0.467218"
xywh = np.fromstring(label, sep=" ")[1:]
Output: array([0.527267, 0.702972, 0.945466, 0.467218])

Update examples in docs

๐Ÿ““ Documentation Update

What part of documentation was unclear or wrong?
We need to update the examples found in the current version of the icedata documentation

icedata.coco.parser is a module not a function

๐Ÿ““ Documentation Update

What part of documentation was unclear or wrong?
I think the docs for icedata are incorrect, based on the version that is installed when icevision is installed from master with the bash script installer.

When I try this from the main front page https://airctic.github.io/icedata/coco/:

# COCO parser: provided out-of-the-box
parser = icedata.coco.parser(data_dir=path, class_map=class_map)

I get this error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_15591/3116397168.py in <module>
      1 import icedata
      2 
----> 3 icedata.coco.parser(data_dir=f'{path}/instances_slick_train.json', class_map=class_map)

TypeError: 'module' object is not callable

Describe the solution you'd like
I think the correct way to load annotations is defined in this test file: https://github.com/airctic/icevision/blob/f20a938956663d1aa8a17320caed815830dc4cd0/tests/parsers/test_coco_parser.py

Fix OCHuman Colab Badge

๐Ÿ““ Documentation Update

The OCHuman Colab Badge points to Francesco branch. We just need to update to point to icedata/notebooks/dev

Add `dataset` function for each dataset

๐Ÿš€ Feature

For each dataset we can also add a dataset function that returns a train,valid dataset with default transforms included.

This would also reduce the amount of repeated code on icevision tutorials.

HOW TO

Follow the structure of datasets/fridge/dataset.py or datasets/pennfudan/dataset.py:

  1. Create dataset.py file
  2. Add from icedata.datasets.<DATASET_FOLDER>.dataset import * to datasets/<DATASET_FOLDER/__init__.py

TODO (no specific order)

  • fridge
  • pennfudan
  • birds
  • biwi
  • coco (In progress by: @jpoberhauser)
  • ochuman
  • pets (in progress by: @ganesh3)
  • voc

Add remaining tests

๐Ÿš€ Feature

Some datasets still need to be finished and tested:

  • coco
  • voc
  • birds

[dev-install] Co-developing with icevision

Icedata and Icevision are developed hand-in-hand, currently it's tricky to get a editable "master" installation of icevision while also installing icedata. This use case is also being discussed here.

Currenly what we have to do, is to manually modify pyproject.toml to use a path dependency instead of pypi dependency for local installation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.