BIOSCAN-1M Insect

Overview

This repository houses the codes and data pertaining to the BIOSCAN-1M-Insect project. Within this project, we introduce the BIOSCAN-1M Insect dataset, which can be accessed for download via the provided links. The repository encompasses code for data sampling and splitting, dataset statistics analysis, as well as image-based classification experiments centered around the taxonomy classification of insects.

Anyone interested in using BIOSCAN-1M Insect dataset and/or the corresponding code repository, please cite the Paper:

@inproceedings{gharaee2023step,
    title={A Step Towards Worldwide Biodiversity Assessment: The {BIOSCAN-1M} Insect Dataset},
    booktitle={Advances in Neural Information Processing Systems},
    author={Gharaee, Z. and Gong, Z. and Pellegrino, N. and Zarubiieva, I. and Haurum, J. B. and Lowe, S. C. and McKeown, J. T. A. and Ho, C. Y. and McLeod, J. and Wei, Y. C. and Agda, J. and Ratnasingham, S. and Steinke, D. and Chang, A. X. and Taylor, G. W. and Fieguth, P.},
    editor={A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
    pages={43593--43619},
    publisher={Curran Associates, Inc.},
    year={2023},
    volume={36},
    url={https://proceedings.neurips.cc/paper_files/paper/2023/file/87dbbdc3a685a97ad28489a1d57c45c1-Paper-Datasets_and_Benchmarks.pdf},
}

Dataset Access

The BIOSCAN-1M Insect dataset is available on GoogleDrive, Zenodo, Kaggle, and HuggingFace. To download a file from GoogleDrive run the following:

python main.py --file_to_download <file_name>

The list of files available for download from GoogleDrive are:

Metadata (TSV file format): BIOSCAN_1M_Insect_Dataset_metadata.tsv
Metadata (JSONLD file format): BIOSCAN_1M_Insect_Dataset_metadata.jsonld
Original images resized to 256 on smaller dimension (ZIP file format): original_256.zip
Original images resized to 256 on smaller dimension (HDF5 file format): original_256.hdf5
Cropped images resized to 256 on smaller dimension (ZIP file format): cropped_256.zip
Cropped images resized to 256 on smaller dimension (HDF5 file format): cropped_256.hdf5
Original full size images (113 ZIP files): bioscan_images_original_full_part{1:113}.zip
Cropped images (113 ZIP files): bioscan_images_cropped_full_part{1:113}.zip

Dataset

BIOSCAN dataset provides researchers with information about insects. Each record of the BIOSCAN-1M Insect dataset contains four primary attributes:

DNA Barcode Sequence
Barcode Index Number (BIN)
Biological Taxonomy Classification
RGB image

I. DNA Barcode Sequence

The presented DNA barcode sequence illustrates the nucleotide arrangement—Adenine (A), Thymine (T), Cytosine (C), and Guanine (G)—within a designated gene region, such as the mitochondrial cytochrome c oxidase subunit I (COI) gene. This sequence is visually represented in blocks of distinct colors:

TTTATATTTTATTTTTGGAGCATGATCAGGAATAGTTGGAACTTCAATAAGTTTATTAATTCGAACAGAATTAAGCCAACCAGGAATTTTTATTGGTAATGACCAAATTTATAATGTAATTGTTACAGCTCATGCCTTTATTATAATTTTTTTTATAGTTATACCTATTATAATTGGAGGATTCGGAAATTGACTAGTCCCATTAATATTAGGAGCTCCTGATATAGCTTTCCCTCGAATAAATAATATAAGTTTTTGAATGTTACCTCCTTCATTAACTCTATTATTATCAAGAAGAATAGTTGAAAATGGAGCTGGAACAGGATGAACTGTTTATCCCCCTTTATCCTCAGGAACTGCTCATGCAGGAGCTTCTGTTGATCTTGCTATTTTCTCTTTACATTTAGCAGGAATTTCTTCAATTCTTGGAGCTGTAAATTTTATTACAACAATTATTAATATACGATCTTCAGGAATTACACTTGATCGAATACCTTTATTTGTTTGATCTGTAATTATTACAGCTATTCTACTTTTACTGTCTCTTCCAGTATTAGCTGGAGCTATTACAATATTATTAACTGATCGTAATTTAAATACATCTTTTTTTGACCCAATTGGAGGAGGAGATCCAATTCTATATCAACATTTAT

This visual representation offers a glimpse into the intricate structure of DNA. The color scheme is designed as follows:

Adenine (A): Red
Thymine (T): Blue
Cytosine (C): Green
Guanine (G): Yellow

These nucleotides, represented by their respective colors, play a pivotal role in defining the genetic information encoded within the DNA sequence.

II. Barcode Index Number (BIN)

Organisms are grouped into Operational Taxonomic Units (OTUs) through genetic similarity, forming a genetic proxy for species. Each OTU is assigned a unique Barcode Index Number (BIN), serving as a Uniform Resource Identifier (URI). This BIN ensures that genetically identical taxa share the same identifier, registered in the Barcode Of Life Data system (BOLD).

BOLD:AER5166

BINs, acting as an alternative to Linnean names, provide a genetic-centric classification for organisms, emphasizing the significance of genetic code in taxonomy.

III. Biological Taxonomy Classification (Linnean names)

Taxonomic group ranking annotations categorize organisms hierarchically based on evolutionary relationships. It organizes species into groups based on shared characteristics and genetic relatedness.

Figure illustrates the taxonomic classifications of five distinct living organisms within the insect class.

IV. RGB Images

We have published six packages, each containing 1,128,313 BIOSCAN-1M Insect dataset's images. These packages follow a consistent data structure, where the images are divided into 113 data chunks. Each chunk consists of 10,000 images, except for chunk 113, which contains 8,313 images.

(1) Original JPEG images (113 zip files).
(2) Cropped JPEG images (113 zip files).
(3) Original JPEG images resized to 256 on the smaller dimension (ZIP and HDF5).
(4) Cropped JPEG images resized to 256 on their smaller dimension (ZIP and HDF5).


Diptera: 896,324	Hymenoptera: 89,311	Coleoptera: 47,328	Hemiptera: 46,970


Lepidoptera: 32,538	Psocodea: 9,635	Thysanoptera: 2,088	Trichoptera: 1,296


Orthoptera: 1,057	Blattodea: 824	Neuroptera: 676	Ephemeroptera: 96


Dermaptera: 66	Archaeognatha: 63	Plecoptera: 30	Embioptera: 6

Figure shows original insect images from 16 orders of the BIOSCAN-1M Insect dataset. The numbers below each image identify the number of images in each order group, and clearly illustrate the degree of class imbalance in the BIOSCAN-1M Insect dataset.

Metadata

In addition to the image dataset, we have also published a corresponding metadata file for our dataset, named BIOSCAN_Insect_Dataset_metadata. This metadata file is available in both dataframe format (.tsv) and JSON-LD format (.jsonld). The metadata file encompasses valuable information, including taxonomy annotations, DNA barcode sequences, and indexes and labels for each data sample. Furthermore, the metadata file includes the image names and unique IDs that reference the corresponding storage location of each image. It also provides insights into the roles of the images within the split sets. Specifically, it indicates whether an image is used for training, validation, or testing in the six experiments conducted in our paper.

To run the following steps you first need to download dataset and the metadata file, and make path settings appropriately.

Dataset Statistics

To see the statistics of the BIOSCAN-1M Insect dataset, run the following:

python main.py --print_statistics --exp_name <experiment_name>

Dataset Sampling

To split BIOSCAN-1M Insect dataset into Train, Validation and Test sets using a stratified class-based sampling and split run the following:

python main.py --make_split

To see the statistics of the BIOSCAN-1M Insect dataset split sets, run the following:

python main.py --print_split_statistics --exp_name <experiment_name>

Preprocessing

In order to enhance efficiency in terms of time and computational resources for conducting experiments on the BIOSCAN-1M Insect dataset's RGB images, we implemented an offline preprocessing step composed of two main modules:

Resize tool
Crop tool

The resizing tool together with our cropping tool are utilized to modify the original RGB images. By applying this preprocessing step, we aimed to optimize the subsequent experimental processes.


Original	Original	Original	Original


Cropped	Cropped	Cropped	Cropped

To resize and save original full size images, run the following:

python main.py --resize_image --resized_image_path <path_to_resized_images> --resized_hdf5_path <path_to_resized_hdf5>

To use our cropping tool, from project's GoogleDrive, download the available checkpoint BIOSCAN_Insect_crop_tool_checkpoint.ckpt stored in a designated directory BIOSCAN_1M_Insect_checkpoints/crop_tool_checkpoint ensuring accurate path configuration in the main.py script and run the following to create and save cropped images as well as their resized versions:

python main.py --crop_image --cropped_image_path <path_to_cropped_images> --resized_cropped_image_path <path_to_resized_cropped_images>

By setting --cropped_hdf5_path and --resized_cropped_hdf5_path, cropped images and resized cropped images will be saved in HDF5 file format as well.

Classification Experiments

Two image-based classification experiments were conducted, focusing on the taxonomy ranking of insects. The first set of experiments involved classifying BIOSCAN-1M Insect dataset's images into 16 orders. The second set of experiments specifically targeted the Order Diptera and aimed to classify its members into 40 families, which constitute a significant portion of the order.

Figure depicts class distribution and class imbalance in the BIOSCAN-1M Insect dataset. We focus on the 16 most densely populated orders (top) and the 40 most densely populated diptera families (bottom). The image demonstrates that class imbalance is an inherent characteristic within the insect community.

Train

To train the model on a classification task using a baseline model, you can run the following command, setting the name of the experiment:

python main.py --loader --train --data_format <hdf5/folder> --exp_name <experiment_name>

Both the folder and HDF5 data formats are supported, making it convenient to conduct experiments using dataset packages.

Test

To evaluate our top-performing models, which were trained through the experiments outlined and executed in the BIOSCAN-1M-Insect paper, please proceed to download the available checkpoints from the GoogleDrive,
stored in a designated directory BIOSCAN_1M_Insect_checkpoints/classification_checkpoints, ensuring accurate path configuration to the dataset images, metadata file and results within the main.py script.

Subsequently, for order-level classification utilizing the resized and cropped images of the BIOSCAN-1M Insect Large dataset, execute the following instructions:

python main.py --loader --test --exp_name large_insect_order --best_model large_insect_order_vit_base_patch16_224_CE_s2 --model vit_base_patch16_224 --loss CE --seed 2

Figure presents per-class top-1 test accuracy of the Insect-Order and Diptera-Family classification experiments of the Large dataset.

Generalization

To assess the generalization capabilities of our models, which were trained on the BIOSCAN-1M-Insect dataset, specifically for order-level classification involving resized and cropped images from the BIOSCAN-1M Insect Large dataset, it is imperative to ensure precise path configurations to the new images as well as trained model within the generalization.py script. Subsequently, follow these steps:

python generalization.py

Requirement

The requirements used to run the experiments are available in the requirements.txt file.

Copyright and License

The images included in the BIOSCAN-1M Insect dataset available through this repository are subject to copyright and licensing restrictions shown in the following:

Copyright Holder: CBG Photography Group
Copyright Institution: Centre for Biodiversity Genomics (email:[email protected])
Photographer: CBG Robotic Imager
Copyright License: Creative Commons-Attribution Non-Commercial Share-Alike (CC BY-NC-SA 4.0)
Copyright Contact: [email protected]

Collaborators

"Ming Gong" [email protected]

zahrag / bioscan-1m Goto Github PK

bioscan-1m's Introduction

BIOSCAN-1M Insect

Overview

Dataset Access

Dataset

I. DNA Barcode Sequence

II. Barcode Index Number (BIN)

III. Biological Taxonomy Classification (Linnean names)

IV. RGB Images

Metadata

Dataset Statistics

Dataset Sampling

Preprocessing

Classification Experiments

Train

Test

Generalization

Requirement

Copyright and License

Collaborators

bioscan-1m's People

Contributors

Stargazers

Watchers

Forkers

bioscan-1m's Issues

Recommend Projects

Recommend Topics

Recommend Org