Giter VIP home page Giter VIP logo

bioscan-1m's Introduction

BIOSCAN-1M Insect

Alt Text

Overview

This repository houses the codes and data pertaining to the BIOSCAN-1M-Insect project. Within this project, we introduce the BIOSCAN-1M Insect dataset, which can be accessed for download via the provided links. The repository encompasses code for data sampling and splitting, dataset statistics analysis, as well as image-based classification experiments centered around the taxonomy classification of insects.

Anyone interested in using BIOSCAN-1M Insect dataset and/or the corresponding code repository, please cite the Paper:

@inproceedings{gharaee2023step,
    title={A Step Towards Worldwide Biodiversity Assessment: The {BIOSCAN-1M} Insect Dataset},
    booktitle={Advances in Neural Information Processing Systems},
    author={Gharaee, Z. and Gong, Z. and Pellegrino, N. and Zarubiieva, I. and Haurum, J. B. and Lowe, S. C. and McKeown, J. T. A. and Ho, C. Y. and McLeod, J. and Wei, Y. C. and Agda, J. and Ratnasingham, S. and Steinke, D. and Chang, A. X. and Taylor, G. W. and Fieguth, P.},
    editor={A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
    pages={43593--43619},
    publisher={Curran Associates, Inc.},
    year={2023},
    volume={36},
    url={https://proceedings.neurips.cc/paper_files/paper/2023/file/87dbbdc3a685a97ad28489a1d57c45c1-Paper-Datasets_and_Benchmarks.pdf},
}

Dataset Access

The BIOSCAN-1M Insect dataset is available on GoogleDrive, Zenodo, Kaggle, and HuggingFace. To download a file from GoogleDrive run the following:

python main.py --file_to_download <file_name>

The list of files available for download from GoogleDrive are:

  • Metadata (TSV file format): BIOSCAN_1M_Insect_Dataset_metadata.tsv
  • Metadata (JSONLD file format): BIOSCAN_1M_Insect_Dataset_metadata.jsonld
  • Original images resized to 256 on smaller dimension (ZIP file format): original_256.zip
  • Original images resized to 256 on smaller dimension (HDF5 file format): original_256.hdf5
  • Cropped images resized to 256 on smaller dimension (ZIP file format): cropped_256.zip
  • Cropped images resized to 256 on smaller dimension (HDF5 file format): cropped_256.hdf5
  • Original full size images (113 ZIP files): bioscan_images_original_full_part{1:113}.zip
  • Cropped images (113 ZIP files): bioscan_images_cropped_full_part{1:113}.zip

Dataset

BIOSCAN dataset provides researchers with information about insects. Each record of the BIOSCAN-1M Insect dataset contains four primary attributes:

  • DNA Barcode Sequence
  • Barcode Index Number (BIN)
  • Biological Taxonomy Classification
  • RGB image

I. DNA Barcode Sequence

The presented DNA barcode sequence illustrates the nucleotide arrangement—Adenine (A), Thymine (T), Cytosine (C), and Guanine (G)—within a designated gene region, such as the mitochondrial cytochrome c oxidase subunit I (COI) gene. This sequence is visually represented in blocks of distinct colors:

TTTATATTTTATTTTTGGAGCATGATCAGGAATAGTTGGAACTTCAATAAGTTTATTAATTCGAACAGAATTAAGCCAACCAGGAATTTTTATTGGTAATGACCAAATTTATAATGTAATTGTTACAGCTCATGCCTTTATTATAATTTTTTTTATAGTTATACCTATTATAATTGGAGGATTCGGAAATTGACTAGTCCCATTAATATTAGGAGCTCCTGATATAGCTTTCCCTCGAATAAATAATATAAGTTTTTGAATGTTACCTCCTTCATTAACTCTATTATTATCAAGAAGAATAGTTGAAAATGGAGCTGGAACAGGATGAACTGTTTATCCCCCTTTATCCTCAGGAACTGCTCATGCAGGAGCTTCTGTTGATCTTGCTATTTTCTCTTTACATTTAGCAGGAATTTCTTCAATTCTTGGAGCTGTAAATTTTATTACAACAATTATTAATATACGATCTTCAGGAATTACACTTGATCGAATACCTTTATTTGTTTGATCTGTAATTATTACAGCTATTCTACTTTTACTGTCTCTTCCAGTATTAGCTGGAGCTATTACAATATTATTAACTGATCGTAATTTAAATACATCTTTTTTTGACCCAATTGGAGGAGGAGATCCAATTCTATATCAACATTTAT

Alt Text

This visual representation offers a glimpse into the intricate structure of DNA. The color scheme is designed as follows:

  • Adenine (A): Red
  • Thymine (T): Blue
  • Cytosine (C): Green
  • Guanine (G): Yellow

These nucleotides, represented by their respective colors, play a pivotal role in defining the genetic information encoded within the DNA sequence.

II. Barcode Index Number (BIN)

Organisms are grouped into Operational Taxonomic Units (OTUs) through genetic similarity, forming a genetic proxy for species. Each OTU is assigned a unique Barcode Index Number (BIN), serving as a Uniform Resource Identifier (URI). This BIN ensures that genetically identical taxa share the same identifier, registered in the Barcode Of Life Data system (BOLD).

BOLD:AER5166

Alt Text

BINs, acting as an alternative to Linnean names, provide a genetic-centric classification for organisms, emphasizing the significance of genetic code in taxonomy.

III. Biological Taxonomy Classification (Linnean names)

Taxonomic group ranking annotations categorize organisms hierarchically based on evolutionary relationships. It organizes species into groups based on shared characteristics and genetic relatedness. My Image

Figure illustrates the taxonomic classifications of five distinct living organisms within the insect class.

IV. RGB Images

We have published six packages, each containing 1,128,313 BIOSCAN-1M Insect dataset's images. These packages follow a consistent data structure, where the images are divided into 113 data chunks. Each chunk consists of 10,000 images, except for chunk 113, which contains 8,313 images.

  • (1) Original JPEG images (113 zip files).
  • (2) Cropped JPEG images (113 zip files).
  • (3) Original JPEG images resized to 256 on the smaller dimension (ZIP and HDF5).
  • (4) Cropped JPEG images resized to 256 on their smaller dimension (ZIP and HDF5).
Diptera: 896,324 Hymenoptera: 89,311 Coleoptera: 47,328 Hemiptera: 46,970
Lepidoptera: 32,538 Psocodea: 9,635 Thysanoptera: 2,088 Trichoptera: 1,296
Orthoptera: 1,057 Blattodea: 824 Neuroptera: 676 Ephemeroptera: 96
Dermaptera: 66 Archaeognatha: 63 Plecoptera: 30 Embioptera: 6
Figure shows original insect images from 16 orders of the BIOSCAN-1M Insect dataset. The numbers below each image identify the number of images in each order group, and clearly illustrate the degree of class imbalance in the BIOSCAN-1M Insect dataset.

Metadata

In addition to the image dataset, we have also published a corresponding metadata file for our dataset, named BIOSCAN_Insect_Dataset_metadata. This metadata file is available in both dataframe format (.tsv) and JSON-LD format (.jsonld). The metadata file encompasses valuable information, including taxonomy annotations, DNA barcode sequences, and indexes and labels for each data sample. Furthermore, the metadata file includes the image names and unique IDs that reference the corresponding storage location of each image. It also provides insights into the roles of the images within the split sets. Specifically, it indicates whether an image is used for training, validation, or testing in the six experiments conducted in our paper.

To run the following steps you first need to download dataset and the metadata file, and make path settings appropriately.

Dataset Statistics

To see the statistics of the BIOSCAN-1M Insect dataset, run the following:

python main.py --print_statistics --exp_name <experiment_name>

Dataset Sampling

To split BIOSCAN-1M Insect dataset into Train, Validation and Test sets using a stratified class-based sampling and split run the following:

python main.py --make_split 

To see the statistics of the BIOSCAN-1M Insect dataset split sets, run the following:

python main.py --print_split_statistics --exp_name <experiment_name>

Preprocessing

In order to enhance efficiency in terms of time and computational resources for conducting experiments on the BIOSCAN-1M Insect dataset's RGB images, we implemented an offline preprocessing step composed of two main modules:

  • Resize tool
  • Crop tool

The resizing tool together with our cropping tool are utilized to modify the original RGB images. By applying this preprocessing step, we aimed to optimize the subsequent experimental processes.

Original Original Original Original
Cropped Cropped Cropped Cropped

To resize and save original full size images, run the following:

python main.py --resize_image --resized_image_path <path_to_resized_images> --resized_hdf5_path <path_to_resized_hdf5>

To use our cropping tool, from project's GoogleDrive, download the available checkpoint BIOSCAN_Insect_crop_tool_checkpoint.ckpt stored in a designated directory BIOSCAN_1M_Insect_checkpoints/crop_tool_checkpoint ensuring accurate path configuration in the main.py script and run the following to create and save cropped images as well as their resized versions:

python main.py --crop_image --cropped_image_path <path_to_cropped_images> --resized_cropped_image_path <path_to_resized_cropped_images>

By setting --cropped_hdf5_path and --resized_cropped_hdf5_path, cropped images and resized cropped images will be saved in HDF5 file format as well.

Classification Experiments

Two image-based classification experiments were conducted, focusing on the taxonomy ranking of insects. The first set of experiments involved classifying BIOSCAN-1M Insect dataset's images into 16 orders. The second set of experiments specifically targeted the Order Diptera and aimed to classify its members into 40 families, which constitute a significant portion of the order.

My Image Figure depicts class distribution and class imbalance in the BIOSCAN-1M Insect dataset. We focus on the 16 most densely populated orders (top) and the 40 most densely populated diptera families (bottom). The image demonstrates that class imbalance is an inherent characteristic within the insect community.

Train

To train the model on a classification task using a baseline model, you can run the following command, setting the name of the experiment:

python main.py --loader --train --data_format <hdf5/folder> --exp_name <experiment_name>

Both the folder and HDF5 data formats are supported, making it convenient to conduct experiments using dataset packages.

Test

To evaluate our top-performing models, which were trained through the experiments outlined and executed in the BIOSCAN-1M-Insect paper, please proceed to download the available checkpoints from the GoogleDrive,
stored in a designated directory BIOSCAN_1M_Insect_checkpoints/classification_checkpoints, ensuring accurate path configuration to the dataset images, metadata file and results within the main.py script.

Subsequently, for order-level classification utilizing the resized and cropped images of the BIOSCAN-1M Insect Large dataset, execute the following instructions:

python main.py --loader --test --exp_name large_insect_order --best_model large_insect_order_vit_base_patch16_224_CE_s2 --model vit_base_patch16_224 --loss CE --seed 2 

My Image Figure presents per-class top-1 test accuracy of the Insect-Order and Diptera-Family classification experiments of the Large dataset.

Generalization

To assess the generalization capabilities of our models, which were trained on the BIOSCAN-1M-Insect dataset, specifically for order-level classification involving resized and cropped images from the BIOSCAN-1M Insect Large dataset, it is imperative to ensure precise path configurations to the new images as well as trained model within the generalization.py script. Subsequently, follow these steps:

python generalization.py 

Requirement

The requirements used to run the experiments are available in the requirements.txt file.

Copyright and License

The images included in the BIOSCAN-1M Insect dataset available through this repository are subject to copyright and licensing restrictions shown in the following:

  • Copyright Holder: CBG Photography Group
  • Copyright Institution: Centre for Biodiversity Genomics (email:[email protected])
  • Photographer: CBG Robotic Imager
  • Copyright License: Creative Commons-Attribution Non-Commercial Share-Alike (CC BY-NC-SA 4.0)
  • Copyright Contact: [email protected]
  • Copyright Year: 2021

Collaborators

"Ming Gong" [email protected]

bioscan-1m's People

Contributors

zahrag avatar zmgong avatar scottclowe avatar gwtaylor avatar dependabot[bot] avatar

Stargazers

wyw avatar David Szczecina avatar Guillaume Mougeot avatar awang avatar Rasmus Alex Buntzen-Chritensen avatar Michael Bunsen avatar Anthony Fuller avatar MiaoChen avatar  avatar michael catchen avatar Teodor Chiaburu avatar Joakim Bruslund Haurum avatar Misha Ts avatar Chris Mears avatar gaurav avatar Zhichao Tan avatar Quentin Geissmann avatar Chenxin Li avatar baeseongsu avatar Matt Thompson avatar Yu Su avatar

Watchers

 avatar  avatar  avatar

bioscan-1m's Issues

Reading hdf5 files

Hello,

Thanks a lot for making the dataset available. I'm really looking forward to trying to work with it.

I have just started looking at the data and am having trouble reading the images from cropped_256.hdf5. Perhaps this is just due to my own unfamiliarity with hdf5, but it seems like the images are 1 dimensional and that dimension does not seem necessarily divisible by 256.

I am reading the images in as follows:

import numpy as np

with h5py.File('cropped_256.hdf5','r+') as hdf5:
    img = np.array(hdf5['bioscan_dataset']['3652597.jpg'])
    print(img.shape)
    print(img.shape / 256)

and I get the following output:

(7633,)
29.81640625

I have tried with a few different images and I get similar results. For now I'll use the zip files so this probably isn't a barrier for many, but if I am doing something wrong in the reading of the hdf5 data, maybe a how-to in the readme would be a good idea.

Thanks again for the dataset!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.