Giter VIP home page Giter VIP logo

big-interleaved-dataset's People

Contributors

alvin319 avatar christophschuhmann avatar harry-stark avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

big-interleaved-dataset's Issues

BILD Tracking

This issue will track the overall progress of BILD.

Presently, BILD is divided into three phases. All of these are being tracked separately.

  • Phase 1: Data extraction from common crawl, maybe licensed part of the internet archive from Webis group. Being tracked at #2

  • Phase 2: Data filtering for NSFW components, data quality, duplicated data, and other broad things.Being tracked at #3

  • Phase 3: Filtered data can be used for creating datasets of various modalities. However, this project would like to tackle the interleaved format. Being tracked at #4

Quality of extractor

What is it extracting? I see video links, audio links, images ? Does that support all variations of tags ?
Is it fast ?
Does it return all the info for 100 random warc ?

BILD: Phase 1

Data extraction pipeline from data sources.

Data Sources:

  • Common Crawl
  • Maybe, licensed part of Internet archive from Webis.de group.
  • Other sources that the community can recommend.

Pipeline:

Common crawl provides most of its dataset in form of WARC files consisting of HTTPS responses. Thus pipeline will have to parse the WARC file and then the underneath HTML response to extract the required data mainly text, different media links etc, disregarding the script, CSS, and other components.

Naturally, it'll be divided into two parts.

  1. WARC file parser.
  2. HTML parser.

WARC parser

There are many open-source WARC parsers available in the wild. WARCIO is most commonly used, but there is an improved version known as FastWARC

HTML parsers

  • There are various HTML parsers available, we may need to select the best suited for our requirements which basically corresponds to text and media-link attributes preservation.
  • Our ideal parser should retain the text as well as multimodal attributes in the corresponding HTML along with their locality.
  • We've tried several HTML parsers, but only one has been able to suit our requirements -- Resilparse

Pipeline prototypes

Provided below are the prototypes of parsers used

  • Resilparse on WARC file
  • Resilparse standalone HTML
  • HTML2TEXT
  • inscriptis

Hi, is this dataset still working in progress?

Thanks for all selfless contributors for this opened dataset!
I believe such an interleaved dataset can help for building general chat intelligence for both vision and language task (like GPT-4 or Flamingo). Is this wonderful work done or still in progress?

If it is still WIP, I think I can serve as a contributor (if permitted)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.