airctic / icedata Goto Github PK

IceData: Datasets Hub for the *IceVision* Framework

Home Page: https://airctic.github.io/icedata/

License: Apache License 2.0

Python 100.00%

voc-parser dataset computer-vision-datasets coco annotations-formats deep-learning object-detection fastai annotation-parsers coco-parser

icedata's Introduction

Datasets Hub for the IceVision Framework

Note: We Need Your Help If you find this work useful, please let other people know by starring it, and sharing it. Thank you!

Contributors

Documentation

Installation

pip install icedata

For more installation options, check our extensive documentation.

Important: We currently only support Linux/MacOS.

Why IceData?

IceData is a dataset hub for the IceVision Framework
It includes community maintained datasets and parsers and has out-of-the-box support for common annotation formats (COCO, VOC, etc.)
It provides an overview of each included dataset with a description, an annotation example, and other helpful information
It makes end-to-end training straightforward thanks to IceVision's unified API
It enables practioners to get moving with object detection technology quickly

Datasets

Source

The Datasets class is designed to simplify loading and parsing a wide range of computer vision datasets.

Main Features:

Caches data so you don't need to download it over and over
Lightweight and fast
Transparent and pythonic API
Out-of-the-box parsers convert common dataset annotation formats into the unified IceVision Data Format

IceData provides several ready-to-use datasets that use both common annotation formats such as COCO and VOC as well as other annotation formats such WheatParser used in the Kaggle Global Wheat Competition

Usage

Object detection datasets use multiple annotation formats (COCO, VOC, and others). IceVision makes it easy to work across all of them with its easy-to-use and extend parsers.

COCO and VOC compatible datasets

For COCO or VOC compatible datasets - especially ones that are not include in IceData - it is easiest to use the IceData COCO or VOC parser.

Example: Raccoon - a dataset using the VOC parser

# Imports
from icevision.all import *
import icedata


# WARNING: Make sure you have already cloned the raccoon dataset using the command shown here above
# Set images and annotations directories
data_dir = Path("raccoon_dataset")
images_dir = data_dir / "images"
annotations_dir = data_dir / "annotations"

# Define the class_map
class_map = ClassMap(["raccoon"])

# Create a parser for dataset using the predefined icevision VOC parser
parser = parsers.voc(
    annotations_dir=annotations_dir, images_dir=images_dir, class_map=class_map
)

# Parse the annotations to create the train and validation records
train_records, valid_records = parser.parse()
show_records(train_records[:3], ncols=3, class_map=class_map)

!!! info "Note" Notice how we use the predifined parsers.voc() function:

**parser = parsers.voc(
annotations_dir=annotations_dir, images_dir=images_dir, class_map=class_map
)**

Datasets included in IceData

Datasets included in IceData always have their own parser. It can be invoked with icedata.datasetname.parser(...).

Example: The IceData Fridge dataset

Please check out the fridge folder for more information on how this dataset is structured.

# Imports
from icevision.all import *
import icedata

# Load the Fridge Objects dataset
data_dir = icedata.fridge.load()

# Get the class_map
class_map = icedata.fridge.class_map()

# Parse the annotations
parser = icedata.fridge.parser(data_dir, class_map)
train_records, valid_records = parser.parse()

# Show images with their boxes and labels
show_records(train_records[:3], ncols=3, class_map=class_map)

!!! info "Note" Notice how we use the parser associated with the fridge dataset icedata.fridge.parser():

**parser = icedata.fridge.parser(data_dir, class_map)**

Datasets with a new annotation format

Sometimes, you will need to define a new annotation format for you dataset. Additional information can be found in the documentation. In this case, we strongly recommend you following the file structure and naming conventions used in the examples such as the Fridge dataset, or the PETS dataset.

Disclaimer

Inspired from the excellent HuggingFace Datasets project, icedata is a utility library that downloads and prepares computer vision datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have a license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the its license.

If you are a dataset owner and wish to update any of the information in IceData (description, citation, etc.), or do not want your dataset to be included, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!

If you are interested in learning more about responsible AI practices, including fairness, please see Google AI's Responsible AI Practices.

icedata's People

Contributors

Stargazers

Watchers

Forkers

ai-fast-track davanstrien frapochetti fstroth ribenamaplesyrup ganesh3 jerbly jpoberhauser rsomani95 rafael-f-brito yrodriguezmd ljcoopz robbie-palmer

icedata's Issues

Add COCO, VOC, and Birds README

📓 Documentation Update

What part of documentation was unclear or wrong?
Add the COCO, VOC, and Birds README, and create their corresponding documentation

Incorrect links on badges

📓 Documentation Update

Fix links on the badges

Add trained_models to the pets dataset

🚀 Feature

Add Faster RCNN pretrained models

Rename parsers.py to parser.py

The old datasets use the parsers.py convention but the template uses parser.py (which I believe is a better name)

So we need to rename the files to parser.py and update their respective __init__.py file.

AttributeError: module 'icedata' has no attribute 'voc'

🐛 Bug

Install icedata
pip install icedata or pip install git+https://github.com/airctic/icedata.git@master

Expected behavior
voc should be imported properly

Env: colab

Add a destination directory to `load_data()`

🚀 Feature

Is your feature request related to a problem? Please describe.
Presently, load_data() automatically saves the downloaded data to the /root/.icevision/data directory. The new features allows the user to choose the destination directory.

Describe the solution you'd like
Add an new argument called dest_dir to load_data() to let the user chosse the destination directory. dest_dir should be initiliazed to None in order to preserve the compatibility with existing API. If the user does not provide the dest_dir argument (dest_dir =None), the load_data() automatically saves the downloaded data to the /root/.icevision/data directory

Updates install guide with poetry

📓 Documentation Update

Updates the installation instruction and other relevant parts of the documentation with poetry.

For this library, it's important to use a poetry version > 1.1.x which is currently only in pre-release.

To update poetry to this version do:

poetry self update --preview

Update README and add link to IceVision

📓 Documentation Update

We need to add a link to our IceData repo

Adds init.py to dataset templates

Add a README template

📓 Documentation Update

We need to a README template that we will be added when generating a new dataset using template

Remove poetry.lock from repo

lock files should not be committed to libraries

Simple automatic tests for templates

🚀 Feature

We eventually will have automatic tests to check the validity of the data structure, if the download link is working correctly and etc.

But let's start with simple tests, simply checking if the new added dataset can be imported and all the necessary functions are present is enough for now.

update PennFundan dataset to 0.7.0

🚀 Feature

update PennFundan dataset to 0.7.0 and add dataset() method

Update the annotation tools list

📓 Documentation Update

Add other interesting annotation tools to the list

PennFundan - replace labels by label_ids

PennFundan - replace labels by label_ids

voc parser

🐛 Bug

When running the starting codes for the VOC, runs into error at the parser step.

To Reproduce
Steps to reproduce the behavior:

Go to https://airctic.github.io/icedata/examples/voc_exp/
Click on '....'
Scroll down to '....'
See error

Expected behavior

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):
Mac

Additional context
Add any other context about the problem here.

Update both plantdoc and bccd notebooks

Template generation for new datasets

🚀 Feature

Automate the process for adding a new dataset.

All datasets follow a very similar structure, a simple script can be made to automatically generate the initial skeleton for implementing a new dataset.

Update notebooks in docs

📓 Documentation Update

We need to update the notebooks found in the current version of the icedata documentation

Adds samples and tests to dataset template

🚀 Feature

When generating the dataset also add a folder sample_data and tests.

Maybe we can also add a readme file to explain what to do there.

Slow import times

🚀 Feature

Importing icedata takes a lot of time, this happens because interally we are importing unecessary stuff from icevision (heavy libraries like pytorch).

I think the problem mostly comes from icevision.imports

It might be a good idea to curate our own imports on icedata.imports to alliviate this issue.

Add PETS README

📓 Documentation Update

Add PETS README as well as its corresponding documentation

Pennfudan: replace imageid by record id

imageid has been replaced record id. Update PennFudan dataset

Add docs to dataset template generator

🚀 Feature

Automatically create/update the necessary documentation files when using the generator.

autogen.py
Automate the readme creation, currently we have to copy each file individually:

    # Copy Birds README
    shutil.copyfile(icedata_dir / "icedata/datasets/birds/README.md", dest_dir / "birds.md")

    # Copy COCO README
    shutil.copyfile(icedata_dir / "icedata/datasets/coco/README.md", dest_dir / "coco.md")

mkdocs.yml
Update the Datasets section:

  - Datasets:
    - Birds: birds.md
    - COCO: coco.md

Automatically generates sample data

🚀 Feature

When creating a new dataset it's not necessary to manually add data to sample_data, can we somehow create a script that automates the process?

I'm not sure this can be possible for all annotation formats, but we can start with the most common ones: COCO and VOC

pip install is not working

🐛 Bug

Describe the bug
icedata cannot be pip installed.

Solution
Add setup.py and settings.ini

Fastcore 1.1 breaking changes

🐛 Bug

Describe the bug
Fastcore 1.1 contains a breaking change described here:

Remove Path.{read,write} (use Path.{read_text,write_text} instead) and change Path.{load,save} to functions load_pickle and save_pickle (#121)

We need to update our calls to Path.read to fix this issue.

Add Fridge README

📓 Documentation Update

Add Fridge README as well as its corresponding documentation

How to filter predictions of specified classes

📓 New <Tutorial/Example>

Is this a request for a tutorial or for an example?
Tutorial.

What is the task?
Filter predictions of object detection model according to specified classes.

Is this example for a specific model?
FasterRCNN

Is this example for a specific dataset?
Dataset is COCO.

Don't remove
Main issue for examples: #39

Add dataset template test for load_data

🚀 Feature

A simple test to check if the given url is reachable would suffice.

Update README

📓 Documentation Update

Update README by replacing icevision references by icedata ones.

Test package with icevision master

Icevision and icedata versions go hand in hand together, some changes made on icedata will depend on changes on icevision not yet released, so we need to also test with icevision master.

Add pennfudan README

📓 Documentation Update

Add pennfudan README as well as its corresponding documentation

Add documentation

📓 Documentation Update

Add the documentation first draft using MkDocs

Allowing custom path for `icedata.load_data()`

🚀 Feature

Currently, icedata.load_data() calls the get_data_dir() function from icevision which always return Path.home()/".icevision"/"data". It would be helpful if we can choose a different location to store the datasets.

ssl error when trying to download the model weights for mmdet.fcos

🐛 Bug

Describe the bug
ssl error while trying to download the weights of the mmdet.fcos model like so ...

SSLError: HTTPSConnectionPool(host='openmmlab.oss-cn-hangzhou.aliyuncs.com', port=443): Max retries exceeded with url: /mmdetection/v2.0/fcos/fcos_r101_caffe_fpn_gn-head_1x_coco/fcos_r101_caffe_fpn_gn-head_1x_coco-0e37b982.pth (Caused by SSLError(SSLError(1, '[SSL] unknown error (_ssl.c:1123)')))

To Reproduce
Steps to reproduce the behavior:
1.model_type = models.mmdet.fcos
2. backbone = model_type.backbones.resnet101_caffe_fpn_gn_head_1x_coco
3. model = model_type.model(backbone=backbone(pretrained=True), num_classes=len(parser.class_map), **extra_args)
4.

Expected behavior
Downloading of the weights and instantiation of the model

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. ubuntu 18.04]

Additional context
Add any other context about the problem here.

Fix CI workflow: mk-docs-build.yml

🐛 Bug

Describe the bug
The documentation is not deployed.

To Reproduce
1- Make a some changes to the documentation (.md files)
2- Merge a pull request
3- Check if the documentation is built and deployed

Expected behavior
The documentation should be built and deployed.

Add a Parser CheatSheet Document

📓 Documentation Update

What part of documentation was unclear or wrong?
The goal is to create a resource that will help both beginners and advanced users to easily create their parsers by providing some frequently used code snippets and best practices. code snippets are divide into sections:

Section 1: files related code snippets
Section 2: parsing related code snippets

Useful example

Custom Parser

Use template generator

The first step is to create a class that inherits from these smaller building blocks:

This is just an example, choose the mixins that are relevant to your use case

class WheatParser(parsers.FasterRCNN, parsers.FilepathMixin, parsers.SizeMixin):
    pass

We use a method called generate_template that will print out all the necessary methods we have to implement.

WheatParser.generate_template()

Output:

def __iter__(self) -> Any:
def height(self, o) -> int:
def width(self, o) -> int:
def filepath(self, o) -> Union[str, Path]:
def bboxes(self, o) -> List[BBox]:
def labels(self, o) -> List[int]:
def imageid(self, o) -> Hashable:

If, for example, all the images are .jpg and located in the data_dir folder, the image_paths attribute will be set as follow:

def __init__(self, data_dir):
        self.image_paths = get_files(data_dir, extensions=[".jpg"])

Files related code snippets

Let's suppose we have the follwoing fname variable:
`fname = Path("PennFudanPed/PNGImages/FudanPed00002.png")

fname	PennFudanPed/PNGImages/FudanPed00002.png
fname.exists()	True
fname.with_suffix('.txt')	PennFudanPed/PNGImages/FudanPed00002.txt
fname.stem	FudanPed00002

Parsing related code snippets

Read a CSV file using pandas

import pandas as pd
df = pd.read_csv("path/to/csv/file")
df.head() # or df.sample()

Then, the bboxes attribute, will be created this way

bbox = "[834.0, 222.0, 56.0, 36.0]"`
xywh = np.fromstring(bbox[1:-1], sep=",")

Output: array([834., 222., 56., 36.])

coordinates as a text with separators

label = "2 0.527267 0.702972 0.945466 0.467218"
xywh = np.fromstring(label, sep=" ")[1:]
Output: array([0.527267, 0.702972, 0.945466, 0.467218])

Update examples in docs

📓 Documentation Update

What part of documentation was unclear or wrong?
We need to update the examples found in the current version of the icedata documentation

load_data incorrectly check if data is already downloaded

🐛 Bug

Describe the bug
This can be seen at the icevision quickstart tutorial, where we had to pass force_download=True to download the dataset.

Probably the checking if the dataset directory exists is wrong.

icedata.coco.parser is a module not a function

📓 Documentation Update

What part of documentation was unclear or wrong?
I think the docs for icedata are incorrect, based on the version that is installed when icevision is installed from master with the bash script installer.

When I try this from the main front page https://airctic.github.io/icedata/coco/:

# COCO parser: provided out-of-the-box
parser = icedata.coco.parser(data_dir=path, class_map=class_map)

I get this error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_15591/3116397168.py in <module>
      1 import icedata
      2 
----> 3 icedata.coco.parser(data_dir=f'{path}/instances_slick_train.json', class_map=class_map)

TypeError: 'module' object is not callable

Describe the solution you'd like
I think the correct way to load annotations is defined in this test file: https://github.com/airctic/icevision/blob/f20a938956663d1aa8a17320caed815830dc4cd0/tests/parsers/test_coco_parser.py

Fix Colab URL

Fix OCHuman Colab Badge

📓 Documentation Update

The OCHuman Colab Badge points to Francesco branch. We just need to update to point to icedata/notebooks/dev

Add `dataset` function for each dataset

🚀 Feature

For each dataset we can also add a dataset function that returns a train,valid dataset with default transforms included.

This would also reduce the amount of repeated code on icevision tutorials.

HOW TO

Follow the structure of datasets/fridge/dataset.py or datasets/pennfudan/dataset.py:

Create dataset.py file
Add from icedata.datasets.<DATASET_FOLDER>.dataset import * to datasets/<DATASET_FOLDER/__init__.py

TODO (no specific order)

Generating Docs Page Not Found

📓 Documentation Update

Clicking on the Contributing Guide link from the side-bar on https://airctic.github.io/icedata/ directs to https://airctic.github.io/icedata/readme.md
The body of the page displays a 404 - Not found error

Add remaining tests

🚀 Feature

Some datasets still need to be finished and tested:

coco
voc
birds

[dev-install] Co-developing with icevision

Icedata and Icevision are developed hand-in-hand, currently it's tricky to get a editable "master" installation of icevision while also installing icedata. This use case is also being discussed here.

Currenly what we have to do, is to manually modify pyproject.toml to use a path dependency instead of pypi dependency for local installation.

Check if dataset name already exists in generate dataset

🚀 Feature

Dataset names must be unique, first check if a dataset name already exists before creating a new one.

Pennfudan broken colab badge link

📓 Documentation Update

We recently discovered a typo in Pennfundan and fixed the occurrences to Pennfudan (without the n), but the colab badge links in the tutorials still contain the typo and end up pointed to a non-existent notebook.

We should update the badge links to the correct name.

Referred file: https://github.com/airctic/icedata/blob/master/icedata/datasets/pennfudan/README.md

airctic / icedata Goto Github PK

icedata's Introduction

Datasets Hub for the IceVision Framework

Contributors

Installation

Why IceData?

Datasets

Usage

COCO and VOC compatible datasets

Datasets included in IceData

Datasets with a new annotation format

Disclaimer

icedata's People

Contributors

Stargazers

Watchers

Forkers

icedata's Issues

📓 Documentation Update

📓 Documentation Update

🚀 Feature

🐛 Bug

🚀 Feature

📓 Documentation Update

📓 Documentation Update

📓 Documentation Update

🚀 Feature

🚀 Feature

📓 Documentation Update

🐛 Bug

🚀 Feature

📓 Documentation Update

🚀 Feature

🚀 Feature

📓 Documentation Update

🚀 Feature

🚀 Feature

🐛 Bug

🐛 Bug

📓 Documentation Update

📓 New <Tutorial/Example>

🚀 Feature

📓 Documentation Update

📓 Documentation Update

📓 Documentation Update

🚀 Feature

🐛 Bug

🐛 Bug

📓 Documentation Update

Useful example

Use template generator

This is just an example, choose the mixins that are relevant to your use case

Files related code snippets

Parsing related code snippets

Read a CSV file using pandas

coordinates as a text with separators

📓 Documentation Update

🐛 Bug

📓 Documentation Update

📓 Documentation Update

🚀 Feature

HOW TO

TODO (no specific order)

📓 Documentation Update

🚀 Feature

🚀 Feature

📓 Documentation Update

Recommend Projects

Recommend Topics

Recommend Org