laion-ai / big-interleaved-dataset Goto Github PK
View Code? Open in Web Editor NEWBig-Interleaved-Dataset
License: Apache License 2.0
Big-Interleaved-Dataset
License: Apache License 2.0
This issue will track the overall progress of BILD.
Presently, BILD is divided into three phases. All of these are being tracked separately.
Phase 1: Data extraction from common crawl, maybe licensed part of the internet archive from Webis group. Being tracked at #2
Phase 2: Data filtering for NSFW components, data quality, duplicated data, and other broad things.Being tracked at #3
Phase 3: Filtered data can be used for creating datasets of various modalities. However, this project would like to tackle the interleaved format. Being tracked at #4
Phase 2 pipeline will deal with various filtering steps required with extracted data. Will add more descriptions soon.
Some initial resources for filtering data sources:https://github.com/StevenBlack/hosts
What is it extracting? I see video links, audio links, images ? Does that support all variations of tags ?
Is it fast ?
Does it return all the info for 100 random warc ?
A complete pipeline for a fully end to end example
Common crawl provides most of its dataset in form of WARC files consisting of HTTPS responses. Thus pipeline will have to parse the WARC file and then the underneath HTML response to extract the required data mainly text, different media links etc, disregarding the script, CSS, and other components.
Naturally, it'll be divided into two parts.
There are many open-source WARC parsers available in the wild. WARCIO is most commonly used, but there is an improved version known as FastWARC
Provided below are the prototypes of parsers used
See wip at https://github.com/LAION-AI/CommonCrawlS3Bench
Idea
Support for processing/extracting or deleting html elements
Thanks for all selfless contributors for this opened dataset!
I believe such an interleaved dataset can help for building general chat intelligence for both vision and language task (like GPT-4 or Flamingo). Is this wonderful work done or still in progress?
If it is still WIP, I think I can serve as a contributor (if permitted)
I'm interested in this project. How could I join and contribute?
Starting point: https://gist.github.com/rom1504/67ada3dedbecc113ae2dbdfd9c642d83
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.