anansi-project / rfcs Goto Github PK

An initiative to structure the world of metadata for Comic Books, Mangas and other graphic novels.

rfcs's Introduction

The Anansi Project

What is it?

The Anansi Project is an initiative to bring structure and cohesion to the world of metadata for Comic Books, Mangas, and other graphic novels.

The basic premise of this project is that the current state of things is inadequate. There are multiple existing formats (ComicRack, ComicBookInfo, ACBF, CoMet…), but most of them don't have a clear specification and/or governance. The end result is that all producing/consuming applications are forced to support multiple formats, often with different interpretation due to the lack of specification, and with limiting capabilities as none of the format can handle all the use cases.

How will that work?

The Anansi Project starts with an RFC (Request For Comments) process, to collect feeback from the community.

It will focus on 3 different areas:

Target metadata model
Metadata containers
Metadata sources of truth

When talking about metadata containers, it is important to distinguish between the WHAT and the HOW. The WHAT is the information we want to represent, the HOW is the way we store the information.

Let's take for example the ComicRack ComicInfo.xml format:

the HOW is the file format itself, an XML based document
the WHAT is the various fields that are described in the XML schema. For example, the StoryArc element, describing a single Story Arc the comic book belongs to.

Target metadata model (the WHAT)

Before discussing any implementation details (the HOW), we should work toward a target metadata model which can cater for all the different use cases the community has around metadata.

Using the existing metadata models from the various containers will help to highlight their limitations, and to work toward a better model.

It is important to note that the target metadata model will need to be evolutive, and as such, to be versioned.

The following UML data model serves as the current state of the target data model:

If you want to provide comment about the target data model, create a Github Issue.

Metadata containers (the HOW)

Metadata containers are numerous, but most of the time those formats have a strong coupling between their data model and container format.

A container format is no more than a mean to serialize data, so it's easier to store alongside the file it relates to.

The idea behind this stream is two-fold:

Focus on the existing container formats themselves, but not their data model, to highlight their limitations.
Discuss the required properties of an ideal container format.

Metadata sources of truth

If the first 2 are the WHAT and HOW, this one could be considered as the WHERE.

In the U.S. Comics world, ComicVine is considered the source of truth for metadata. Other mediums have their own source of reference data.

The idea of this stream is to gather as much information as possible on the various metadata sources, but also about their limitations, whether regarding the data itself, but also about the ownership of that data.

It might be too early to decide on building an open-source metadata source of truth, but that could be the end goal.

Great! How can I participate?

The Anansi Project is just starting, first with the RFC process. We will use the Github Issues to discuss topics related to the 3 areas of work. Once we have enough information on a topic, it will be consolidated into dedicated Markdown files.

rfcs's People

Contributors

Stargazers

Watchers

Forkers

darylf

rfcs's Issues

Small Suggestions based on recent Data Model

The UML diagram that was recently is looking good. Though two small things that might potentially be useful for consumers of the metadata:

Artist, would sit alongside Author, would cover most use cases where there is someone other then the Author doing artwork, which is common in comics. Additionally this could be made to be a with a type field on it to specify exact work, such as Cover Artist, Artist etc if you wanted to go down that route.
A Series Type field could also be of benefit, allowing to specify a series as a Comic, Manga, Manhwa, Graphic Novel etc.
Rating, this one could be tricky as every site and every country has a different rating system.

Metron (DB w/API) Open Source

was doing some searching to try and help find a good solution and ran into this. Its an open source db w/api for comics. Was wondering if it might be feasible to fork this project, then get a download of data from Grand Comics Database to jump start populating it, provide a page for komga community to maintain it, and have Komga use it for scraping?

https://metron.cloud/
https://github.com/bpepple/metron
https://www.comics.org/

How do you intend to avoid the "competing standards" issue.

I absolutely love this project, combining all of the best parts of the current comic metadata formats into a proper, opensource, standard. But I have to ask, how do you intend to avoid the issue as easily represented with this xkcd comic Will be easy to convert our older formats to this new one, and will it be easy for devs to integrate it?

ComicVine

The behemoth of U.S. Comics.

Website: https://comicvine.gamespot.com/
API: yes, public

Tags that should be added (based on komga)

-Data on if the series is right to left
-If the series is finished or ongoing

The Doujinshi and Manga Lexicon

website: https://www.doujinshi.org/

(potentially NSFW)

CoMet

Website: http://www.denvog.com/comet/

Format: XML, apparently stored inside the archive at the root

There should be a XML schema available, but I can't find it

MyAnimeList

Website: https://myanimelist.net/

Comixology

Website: https://www.comixology.com/
API: not that I am aware of

I created a very basic scraper for this site.

https://github.com/SenorSmartyPants/Comixology-Scraper

Cover dates

I’m not sure how widespread this is, but some publishers of US comics have two publishing dates: the date the issue was released, and a cover date, which is typically a couple months ahead, and also fuzzy. ComicVine distinguishes the two as “In Store Date” and “Cover Date” respectively.

For example, Wonder Woman #761 was released on August 25, but has a cover date of October.

It might be useful to distinguish the two, at the very least to prevent confusion for people writing scrapers or manually writing metadata.

External metadata files

I don't know if this project is concerned with where the metadata file is stored, i.e. like how ComicInfo.xml is stored inside a compressed comic archive. If the project is, I would propose support for external metadata files.

An external metadata file would look something like this:

Comic File	Metadata File
`Batman 001 (1940).pdf`	`Batman 001 (1940).xml`

Here are a few reasons why this would be good:

Not all comics are archives. Some are epub, pdf etc
Editing a metadata file within an archive is generally not an atomic write operation. Data corruption of the comic file becomes a higher risk. It's better a metadata file become corrupt than a comicbook file.

Some projects which solved this problem in a similar way:

Kodi: they store external xml metadata in a similar why to what I mentioned above.
Calibre: They use a metadata.opf file which describes a book.

Metadata container vs Source of Truth consistency

Hello! First off, thank you so much for starting this. Comic book metadata has needed some freshening up for a while.
So, my question. Do you envision some sort of embedded link in the metadata container to its original source of truth, which will reside alongside the other metadata?
I mention this because metadata container obsolescence will always be with us, I think, and some mechanism for linking sources of truth with one another might facilitate keeping the data alive and accurate. I dunno. It looks like a problem with no clear answers.

Splitting of Static and Dynamic Data

Problem

A problem I've had with the various formats that have existed so far, as well as this project is the storing of dynamic data. The problem with storing dynamic data is that it is constantly changing and unless I'm constantly checking all of a sudden my data is out of date.

Example:

Series status
Total issue count in a series
User Ratings

Proposition

I don't have an idea as to how to manage the dynamic data as current, but I would state that the file/s created from this RFC should only contain static data

Manga Updates

Website: https://www.mangaupdates.com/index.html

Character Appearances

WHAT: List of character appearances in book
I'm unsure if this is something that could simply be chalked up to tags or collections, but it would be nice if you could link a book to specific characters based on their appearance in the book. Comicvine integrates this into the Characters section of each issue (eg. X Lives of Wolverine). This information would make it much easier to create smart collections and search for issues featuring a particular character.
HOW: A list of strings (one per character). I'm not sure if there's issues with referring to them as simply a string (characters by different publishers with the same name springs to mind as a potential issue).

Support for multiple Series

Would it be possible to support a given Title be part of multiple Series? It would be useful for crossovers.

In an example of Witchblade / The Darkness, the title would be part of both the Witchblade and The Darkness series. I understand most would simply file it under a single series called "Witchblade / The Darkness", but I think some would appreciate the option to individually catalog them.

Grand Comics Database

Website: https://www.comics.org/
API: No, however there is an advance search capability on the website: https://www.comics.org/search/advanced/

Two additional key items:

Wiki page on database schema: https://docs.comics.org/wiki/Current_Schema
The SQL database itself is publicly available: https://www.comics.org/download/

OPDS 2 and Readium Web Publication Manifest

Regarding the target data model for comics, some work has already been done in the context of opds 2 protocol.
It could be worth it giving a look into it:

How do magazine serializations fit into the model?

Especially prevalent in manga, Japanese comics tend to start by being published to a weekly/monthly/other magazine.

Each issue of the magazine has a 1..n to other series.

It's also worth noting that the magazine is not the same as the publisher of the volumes that the issues are collected in.

The magazine itself can be ongoing or complete. And the multiple series in it can be ongoing or complete.

Baka-Updates shows all of this information: https://www.mangaupdates.com/series.html?id=130757
Shuukan Shounen Jump (Shueisha) aka Weekly Shonen Jump is the magazine for Demon Blade / Kimetsu no Yaiba.
Though it's worth noting Amazon doesn't even have a place to display "Shuukan Shounen Jump (Shueisha)".

As a use case that would use this data, a user may want to see "In this issue, these Series are present", where each Series could on the magazine's page could link to the published volumes of the Series.

You could then search a Series, and see what magazine issues it comes up in as well

How would this fit into the anansi model?

Bedetheque

Bedetheque is one of the oldest and most well-known source of truth for French BDs. The data behind the website is also used in their commercial product, BD Gest'.

From what I could gather, the data is manually input by users of the BD Gest' software.

Website: https://www.bedetheque.com/
API: no, but there is a scraper

Advanced Comic Book Format

Website: https://acbf.fandom.com/wiki/Advanced_Comic_Book_Format_Wiki

Format: XML

Has an XML schema. The schema is versioned, and uses XML namespaces.

Collected issues

How will collected issues be handled? Comic vine handles them pretty awfully, making them their own series rather than apart of the series of issues they're collecting (example, https://comicvine.gamespot.com/transformers/4050-29903/, each and every one of the 7 collected issues have their own series page, and are issue one of their respective series, had to manually edit each one in comictagger) Most add tpb to the end despite it being available digitally, which is also annoying.

I think a solution would be to have a specific tag or option for collected editions, a simple true/false toggle to say if its a collected issue in a series would work well, it would be apart of the same series as its non collected counterparts, but apps could separate them into their own categories, so you could use the same issue # and title fields, its just that one tag would change what its categorized as. (would look like this)

Bubble

A French company which partners with local comic book stores for deliveries. They also have an app to organize your physical collection of BDs.

Website: https://www.appbubble.co/
API: Yes, private.
Data redistribution: No, the data they use is coming from a third-party provider (ORB).

What is a Series?

There's been some talk about series in a few issues, and my feeling is that the definition of a Series is not exactly clear. It also highlights some problems:

The elephant in the room: the Volume
A typical construct of the US comics industry, the volume is used alongside series to identify some kind of portion of a series. It is safe to assume that we most likely need to add the Volume in the data model, else the US comics fanbase would be lost without it!
Collected editions (discussed in #21)
As suggested in #21, having a way to specify the Edition for books in a Series could help to address this point. Some examples of _Edition_s could be:
- collected editions (TPB for US comics, Intégrales for Franco-Belge, Tankobon for manga)
- single issues (US comics floppies, manga published in chapters)
- Black & White, or colorized edition
- different translation of a manga. For example in the early days of manga published in France, the images would be reversed to be presented in Left-to-Right direction, but nowadays everything is published in the original reading direction. Some manga have been republished in the original reading direction. It could also be using a newly updated translation, or original onomatopoeia (again, in the early days, even drawn onomatopoeia would be translated and redrawn).
Translation language
The same series could be translated in multiple languages (think Tintin). Some people consider the original series and its translation to be the same series, some consider it to be 2 different series.

I also wanted to highlight the difference between:

the data model
This is how the data can be represented. The various fields in the database, the relationship between objects.
the data organization
This is how the data model is filled with actual data.

Some problems can be solved by either a good data model, or a good data organization, or both. Even a good data model can lead to bad data organization.

A good example of bad data organization is how ComicVine represents the TPB, where each TPB is its own Series. Many people consider this to be a problem, and would much prefer to have a Series for all the TPBs.

Kitsu

Website: https://kitsu.io/explore/manga

Reading List Data

This seems to have been touched on briefly in #3, but doesn't seem to be incorporated into the data structure in any way.

I want to propose a new link to the "Book" structure called "StoryArc". Each Book would be "part of" 0...* different StoryArc structures. Each StoryArc contains a string for title.

There would need to be a StoryArcEntry containing an int for readingOrderNumber.

The readingOrderNumber may change depending on user's manual modifications to the reading list, so not sure if it's as simple as I've described above. Unless this system contains a single readingOrderNumber as the absolute source of truth that can't be modified (I imagine modifications to the reading order would then need to be handled by each specific app, rather than within this structure).

Under the current model, does all info about the relevant collections, series etc get stored within each book? Or is the intention to reference specific collection/series/storyarc id's which contain more info about these objects. I imagine each tagged file needs to be completely self-contained without dependence on external structures, however it doesn't seem efficient to try and keep everything within the book.

Marvel Developer Portal

Looks like Marvel has their own API for accessing comic metadata. Is there any concern with using official sources of truth?

https://developer.marvel.com/

Inducks

Inducks database at https://inducks.org/index.php is THE source of truth for all Disney comics.
There's no native API provided, although some time ago I started a project exposing at least part of the db (publicly available as a nightly snapshot) via REST API.

Brainstorming on adoption

Hello! I am developing a comic book curation webapp, and have made some progress on various fronts. I stumbled across this initiative a little too late, I must admit. My app offers ComicVine scraping against a local library of comic books (no manga support)

To that end, my biggest frustrations have been:

Defining what a comic book even is, in the context of volumes, series, trade paperbacks, alternate covers?
ComicVine... and the documentation on what seems to be the only reliable API that gives us some metadata about comic books. But to parse meaning out of it and present that to users whose primary objective is to curate their collection is nothing short of pulling teeth. Their metadata, while comprehensive is inconsistent, missing, and sometimes duplicative or incorrect.

That said, it is still a very easy setup and possibly a good starting point to look at in terms of adoption.

Offering ways to meaningfully support the popular metadata formats (ComicRack et al)

My initial efforts are here: https://github.com/rishighan/threetwo-import-service/blob/76722ab6a1f1ebb8811e9f68f1803dccda20a9d9/models/comic.model.ts

I would love to get to know the extent of this effort and even adopt it.

Thanks for starting this and hope I can take pieces of it and find a way to contribute to this effort.

Comic Book Ontology

A metadata vocabulary for describing comic books and comic book collections.: https://comicmeta.org/cbo/

Table of contents hasn't really been a thing in comic formats, some manga/comics I convert to cbz had a table of contents in their ebook format, would be great to be able to take them too into the cbz.

Open source library development

Wanted to start a conversion about developing open source libraries to support the format selected.

What would be the languages to start? My projects Java so I'd be glad to be a reference implementation for building and testing the format, but would need to identify a reader that can do the remote side of providing POC.

Anthology comics

I am wondering how this standard would propose to handle British comics such as 2000 AD, The Beano, and Sonic the Comic, which are generally published in an anthology style where each physical issue contains parts of multiple separate series.

For example, an average issue of 2000AD might contains:

A contents page
Part 12 of a ongoing 20-part "Judge Dredd" story.
Part 2 or a ongoing 6-part story "Rogue Trooper" story.
A single part "Future Shock" story in it's entirety.
A one-off original story that's not part of any established series.
A letters page .

A reader with a collection of 2000AD back issues may wish to read the complete issues back-to-back, but they're equally likely to just want to read a particular Judge Dredd storyline without seeing all the other stuff which was included in the same issues.

Right now, as the proposed data model seems to assume that each "Book" contains only a single part of a single series. It would be nice if the metadata was capable of expressing this anthology structure, so that comic book reading software was able to extract and recombine stories as the reader likes.

Open Packaging Format (EPUB)

Industry standard used in ebooks. It relies on other standards like the Dublin Core Metadata Initiative for the metadata fields, or the MARC Code List for Relators for author roles.

Website: http://idpf.org/epub/20/spec/OPF_2.0.1_draft.htm

Format: XML

with schema, versioned, using namespaces

Anilist

Website: https://anilist.co/search/manga
API: yes

ComicRack / ComicInfo.xml

Part of ComicRack, which is not developed anymore. The single developer, cYo, disappeared from the internet. ComicRack is/was also a shareware model, which could limit the steering capability of the format.

Format: XML, usually stored inside the archive at the root

A schema is available, but is not versioned. At least two versions have been lying around (1, 2).
Does not use XML Namespaces

Known limitations:

There is a StoryArc field, but no field to indicate the ordering of the book in this story arc.
It is impossible to specify more than one Story Arc per book.
No field to specify the day of a release date. Only has Year and Month. ComicTagger is using the unofficial Day tag.

should One-shots have no series or a series of the same name?

At the moment in ComicVine, one shots are contained in a series of the same name (example).

Does it bring any value to have a series with a single book?

COMET Standard

Not exactly useful for the anansi project as this is basically a database schema for businesses selling data, but it's still good to know about.

https://cometstandard.com/

Book Media Type & Multiple Series Possibility

WHAT: Differentiate between trades, issues and omni's as a Book Type?
HOW: The "Book" object could include a string called mediaType to handle this.

WHAT: Differentiate between trades, issues and omni's as a Series Type?
This depends one whether you have separate series for one-shots ( #32), trades (#21) etc, but if so you would need a way to specify the series media type.
HOW: The "Series" object could include a string called mediaType to handle this.

WHAT: Handling Books with multiple issues contained (potentially from different series
Handling TPB's and omni's as "Book" objects would also mean that a single book can be a part of multiple series (eg. a trade oncorporating issues from multiple series.
HOW: Allow the possibility of 0..* Series objects, instead of 0...1

League of Comic Geeks

Website: https://leagueofcomicgeeks.com/
API: no