Giter VIP home page Giter VIP logo

gwern2deepdanbooru's Introduction

Gwern2DeepDanbooru

Reorganizes Danbooru Datasets from Gwern to be valid for DeepDanbooru

Format Comparison

Gwern DeepDanbooru
File Structure Images and Metadata have separate Subdirectories Metadata is a single file alongside the Images subdirectory
Image Subdirectories Images are bucketed into 4-digit, Zero padded subdirectories based on the final 3 digits of the Image's ID. Images are bucketed into subdirectories based on the first 2 digits of the Image's md5 hash.[1](#note1)
Images Images are available at full size, but a script is provided to downsample the images to 512x512px jpg format for machine learning. Downloads are also available in this format.
Image filenames are their ids.
Images are assumed to be named with their md5 hash.
Metadata Metadata is truncated into multiple json files.
The files are not strictly json compliant as they are formatted as newline-separated json objects (the "array" of objects is missing encompassing brackets and comma separation)
All metadata available via Danbooru's API is included.
Metadata for training is stored in a single SQLite database in a table called posts.
The table's columns are id, file_ext, md5, tag_string, and tag_count_general.
There is some infrastructure for rating and score, but they are not documented.
1: Gwern notes in their introduction to the dataset that Danbooru's MD5 hashes are not always correct: accordingly, using MD5 hashes may cause issues

Installation

A pypi package has not yet been compiled, so instead either clone this repository or use:
pip install git+https://github.com/AdamantLife/Gwern2DeepDanbooru

Basic Usage

(Remember to always maintain a backup of your data in case you wish to use the Gwern data in its original format)

While Gwern2DeepDanbooru offers a variety of methods, the baseline usage can be achieved via the simple commandline:

cd {gwern data location}
python -m Gwern2DeepDanbooru run

This command will:

  • create a new directory called Project/ in the current work directory
  • compile all metadata into a single, valid json file
  • move all images (which have metadata available) within this directory to the appropriate subdirectory in Project/Images/
  • create Project/project.sqlite3 and Project/tags.txt
  • populate the database and text file with the required data to train DeepDanbooru.

Alternative Usage

The result of this method is virtually equivalent to the above commandline, but its behavior can be modified. It also takes longer to complete and requires more resources.

from Gwern2DeepDanbooru import G2DD

g2dd = G2DD()

## Locates all available Gwern resources and creates the base structure for a DeepDanbooru project
g2dd.initialize__directories()

## Combines all metadata files into a file called "allmetadata.json" in the same directory
## as the metadata, strips out information not used by DeepDanbooru, and removes the metadata
## for missing images
g2dd.create_allmetadata_minimal()

## Removes images which do not have corresponding metadata
g2dd.clean_images()

## Performs the following modifications to the dataset:
##      Updates allmetadata.json with the correct file extension (Gwern converts all images to .jpg)
##      Updates allmetadata.json with the correct md5 hash (in case the md5 hash was incorrect)
##      If a hash collision occurs (two images with the same md5 hash) checks if the images are
##          actually identical: if so, combines the tags from both images and removes the subsequently
##          found image
##      Checks if the image is completely blank: if so, removes it
## The last two points can be modified with the appropriate arguments: consult the docs for more information
g2dd.gwern.prepare_images_for_project()

## Creates the DeepDanbooru Project by adding allmetadata to Project/project.sqlite3 and moving all
## images in allmetadata.json to their appropriate folder in Project/images/
g2dd.create_project()

Additional Information

Tags Table

Gwern2DeepDanbooru offers a number of other utilities for working with the dataset. One important utility to be aware of is the tags table created in Project/project.sqlite3: this table records all tags added to the posts in the database via methods in Gwern2DeepDanbooru.project (which are also used by G2DD instance) and is used to make some tag querying methods faster. If you modify the tag_string column of posts manually, you'll want to use Gwern2DeepDanbooru.project.sync_tags(database, postid) to make sure that it is updated.

Test Set

A relatively small Test Set can be found here on Google Drive. It is Gwern-formatted and contains the following:

  • 1004 semi-random images:
    • The images are organized across 10 directories
    • 1 of the images does not have metadata which should be ignored by most cleaning operations
    • 1 is a blank image which should be ignored on cleaning operations that include the ignore_blanks argument
    • and 2 of which are the same image which should have their tags combined for operations that support that operation
  • There are 1503 metadata dicts:
    • the 1003 images that have metadata
    • and 500 additional dicts which do not have an associated image and should be ignored by many operations

Documentation

TODO

gwern2deepdanbooru's People

Contributors

adamantlife avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

gwern2deepdanbooru's Issues

Multiprocessing Execution

Issue #2 (Optimization for run command) also makes me want to introduce a Multiprocessing option: obviously this likely will hog system resources (which Gwern2DeepDanbooru run was specifically written to avoid), but arguments to limit the impact should be included.

Syntax error project.py

After installing with
pip install git+https://github.com/AdamantLife/Gwern2DeepDanbooru

When running
python -m Gwern2DeepDanbooru run

Error
Traceback (most recent call last): File "C:\Python38\lib\runpy.py", line 185, in _run_module_as_main mod_name, mod_spec, code = _get_module_details(mod_name, _Error) File "C:\Python38\lib\runpy.py", line 144, in _get_module_details return _get_module_details(pkg_main_name, error) File "C:\Python38\lib\runpy.py", line 111, in _get_module_details __import__(pkg_name) File "C:\Python38\lib\site-packages\Gwern2DeepDanbooru\__init__.py", line 26, in <module> from Gwern2DeepDanbooru import utils, gwern, project File "C:\Python38\lib\site-packages\Gwern2DeepDanbooru\project.py", line 95 def get_metadata_by_id(self, _id, , rowid = False, db = None): ^ SyntaxError: invalid syntax

New Optimizations for run command

After the most recent commit, execution of the base commandline Gwern2DeepDanbooru run is taking significantly longer with the full size dataset. I'm working on optimizing it further.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.