laion-ai / big-interleaved-dataset Goto Github PK

View Code? Open in Web Editor NEW

56.0 56.0 8.0 151 KB

Big-Interleaved-Dataset

License: Apache License 2.0

Python 96.21% HTML 3.79%

big-interleaved-dataset's People

Contributors

Stargazers

Watchers

Forkers

subramanyamchalla24 harry-stark karthikrangasai justhungryman siddheshmhatre blackbeans327z stjordanis afareed007

big-interleaved-dataset's Issues

BILD Tracking

This issue will track the overall progress of BILD.

Presently, BILD is divided into three phases. All of these are being tracked separately.

Phase 1: Data extraction from common crawl, maybe licensed part of the internet archive from Webis group. Being tracked at #2
Phase 2: Data filtering for NSFW components, data quality, duplicated data, and other broad things.Being tracked at #3
Phase 3: Filtered data can be used for creating datasets of various modalities. However, this project would like to tackle the interleaved format. Being tracked at #4

Small scale example to run in colab

BILD Phase 2

Phase 2 pipeline will deal with various filtering steps required with extracted data. Will add more descriptions soon.

Some initial resources for filtering data sources:https://github.com/StevenBlack/hosts

Quality of extractor

What is it extracting? I see video links, audio links, images ? Does that support all variations of tags ?
Is it fast ?
Does it return all the info for 100 random warc ?

End to End example

A complete pipeline for a fully end to end example

Add different CC mime types support

Data extraction pipeline from data sources.

Data Sources:

Common Crawl
Maybe, licensed part of Internet archive from Webis.de group.
Other sources that the community can recommend.

Pipeline:

Common crawl provides most of its dataset in form of WARC files consisting of HTTPS responses. Thus pipeline will have to parse the WARC file and then the underneath HTML response to extract the required data mainly text, different media links etc, disregarding the script, CSS, and other components.

Naturally, it'll be divided into two parts.

WARC file parser.
HTML parser.

WARC parser

There are many open-source WARC parsers available in the wild. WARCIO is most commonly used, but there is an improved version known as FastWARC

HTML parsers

There are various HTML parsers available, we may need to select the best suited for our requirements which basically corresponds to text and media-link attributes preservation.
Our ideal parser should retain the text as well as multimodal attributes in the corresponding HTML along with their locality.
We've tried several HTML parsers, but only one has been able to suit our requirements -- Resilparse

Pipeline prototypes

Provided below are the prototypes of parsers used

Resilparse on WARC file
Resilparse standalone HTML
HTML2TEXT
inscriptis

Integration in fast pyspark pipeline

See wip at https://github.com/LAION-AI/CommonCrawlS3Bench

Idea

Read CC from S3 using pyspark on AWS
Should be tested to be optimal
Keep the code as minimal as possible
have a clean pip package for reproducibility (example https://github.com/rom1504/python-template)
support subsampling to a small part of CC for testing

Packaging: setup.py

Feature tracker: HTML parsers

Support for processing/extracting or deleting html elements

Hi, is this dataset still working in progress?

Thanks for all selfless contributors for this opened dataset!
I believe such an interleaved dataset can help for building general chat intelligence for both vision and language task (like GPT-4 or Flamingo). Is this wonderful work done or still in progress?

If it is still WIP, I think I can serve as a contributor (if permitted)