webdataset / tarp Goto Github PK

Fast and simple stream processing of files in tar files, useful for deep learning, big data, and many other applications.

Makefile 1.06% Go 93.15% Shell 5.37% Nix 0.42%

tarp's Introduction

%matplotlib inline
import matplotlib.pyplot as plt
import torch.utils.data
import torch.nn
from random import randrange
import os
os.environ["WDS_VERBOSE_CACHE"] = "1"
os.environ["GOPEN_VERBOSE"] = "0"

The WebDataset Format

WebDataset format files are tar files, with two conventions:

within each tar file, files that belong together and make up a training sample share the same basename when stripped of all filename extensions
the shards of a tar file are numbered like something-000000.tar to something-012345.tar, usually specified using brace notation something-{000000..012345}.tar

You can find a longer, more detailed specification of the WebDataset format in the WebDataset Format Specification

WebDataset can read files from local disk or from any pipe, which allows it to access files using common cloud object stores. WebDataset can also read concatenated MsgPack and CBORs sources.

The WebDataset representation allows writing purely sequential I/O pipelines for large scale deep learning. This is important for achieving high I/O rates from local storage (3x-10x for local drives compared to random access) and for using object stores and cloud storage for training.

The WebDataset format represents images, movies, audio, etc. in their native file formats, making the creation of WebDataset format data as easy as just creating a tar archive. Because of the way data is aligned, WebDataset works well with block deduplication as well and aligns data on predictable boundaries.

Standard tools can be used for accessing and processing WebDataset-format files.

bucket = "https://storage.googleapis.com/webdataset/testdata/"
dataset = "publaynet-train-{000000..000009}.tar"

url = bucket + dataset
!curl -s {url} | tar tf - | sed 10q

PMC4991227_00003.json
PMC4991227_00003.png
PMC4537884_00002.json
PMC4537884_00002.png
PMC4323233_00003.json
PMC4323233_00003.png
PMC5429906_00004.json
PMC5429906_00004.png
PMC5592712_00002.json
PMC5592712_00002.png
tar: stdout: write error

Note that in these .tar files, we have pairs of .json and .png files; each such pair makes up a training sample.

WebDataset Libraries

There are several libraries supporting the WebDataset format:

webdataset for Python3 (includes the wids library), this repository
Webdataset.jl a Julia implementation
tarp, a Golang implementation and command line tool
Ray Data sources and sinks

The webdataset library can be used with PyTorch, Tensorflow, and Jax.

The `webdataset` Library

The webdataset library is an implementation of PyTorch IterableDataset (or a mock implementation thereof if you aren't using PyTorch). It implements as form of stream processing. Some of its features are:

large scale parallel data access through sharding
high performance disk I/O due to purely sequential reads
latency insensitive due to big fat pipes
no local storage required
instant startup for training jobs
only requires reading from file descriptors/network streams, no special APIs
its API encourages high performance I/O pipelines
scalable from tiny desktop datasets to petascale datasets
provides local caching if desired
requires no dataset metadata; any collection of shards can be read and used instantly

The main limitations people run into are related to the fact that IterableDataset is less commonly used in PyTorch and some existing code may not support it as well, and that achieving an exactly balanced number of training samples across many compute nodes for a fixed epoch size is tricky; for multinode training, webdataset is usually used with shard resampling.

There are two interfaces, the concise "fluid" interface and a longer "pipeline" interface. We'll show examples using the fluid interface, which is usually what you want.

import webdataset as wds
pil_dataset = wds.WebDataset(url).shuffle(1000).decode("pil").to_tuple("png", "json")

The resulting datasets are standard PyTorch IterableDataset instances.

isinstance(pil_dataset, torch.utils.data.IterableDataset)

True

for image, json in pil_dataset:
    break
plt.imshow(image)

<matplotlib.image.AxesImage at 0x7f73806db970>

We can add onto the existing pipeline for augmentation and data preparation.

import torchvision.transforms as transforms
from PIL import Image

preproc = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    lambda x: 1-x,
])

def preprocess(sample):
    image, json = sample
    try:
        label = json["annotations"][0]["category_id"]
    except:
        label = 0
    return preproc(image), label

dataset = pil_dataset.map(preprocess)

for image, label in dataset:
    break
plt.imshow(image.numpy().transpose(1, 2, 0))

<matplotlib.image.AxesImage at 0x7f7375fc2230>

WebDataset is just an instance of a standard IterableDataset. It's a single-threaded way of iterating over a dataset. Since image decompression and data augmentation can be compute intensive, PyTorch usually uses the DataLoader class to parallelize data loading and preprocessing. WebDataset is fully compatible with the standard DataLoader.

Here are a number of notebooks showing how to use WebDataset for image classification and LLM training:

train-resnet50-wds -- simple, single GPU training from Imagenet
train-resnet50-multiray-wds -- multinode training using webdataset
generate-text-dataset -- initial dataset generation
tesseract-wds -- shard-to-shard transformations, here for OCR running over large datasets
train-ocr-errors-hf -- an example of LLM fine tuning using a dataset in webdataset format

The wds-notes notebook contains some additional documentation and information about the library.

The `webdataset` Pipeline API

The wds.WebDataset fluid interface is just a convenient shorthand for writing down pipelines. The underlying pipeline is an instance of the wds.DataPipeline class, and you can construct data pipelines explicitly, similar to the way you use nn.Sequential inside models.

dataset = wds.DataPipeline(
    wds.SimpleShardList(url),

    # at this point we have an iterator over all the shards
    wds.shuffle(100),

    # add wds.split_by_node here if you are using multiple nodes
    wds.split_by_worker,

    # at this point, we have an iterator over the shards assigned to each worker
    wds.tarfile_to_samples(),

    # this shuffles the samples in memory
    wds.shuffle(1000),

    # this decodes the images and json
    wds.decode("pil"),
    wds.to_tuple("png", "json"),
    wds.map(preprocess),
    wds.shuffle(1000),
    wds.batched(16)
)

batch = next(iter(dataset))
batch[0].shape, batch[1].shape

(torch.Size([16, 3, 224, 224]), (16,))

The `wids` Library for Indexed WebDatasets

Installing the webdataset library installs a second library called wids. This library provides fully indexed/random access to the same datasets that webdataset accesses using iterators/streaming.

Like the webdataset library, wids is high scalable and provides efficient access to very large datasets. Being indexed, it is easily backwards compatible with existing data pipelines based on indexed dataset, including precise epochs for multinode training. The library comes with its own ChunkedSampler and DistributedChunkedSampler classes, which provided shuffling accross nodes while still preserving enough locality of reference for efficient training.

Internally, the library uses a mmap-based tar file reader implementation; this allows very fast access without precomputed indexes, and it also means that shard and the equivalet of "shuffle buffers" are shared in memory between workers on the same machine.

This additional power comes at some cost: the library requires a small metadata file that lists all the shards in a dataset and the number of samples contained in each, the library requires local storage for as many shards as there are I/O workers on a node, it uses shared memory and mmap, and the availability of indexing makes it easy to accidentally use inefficient access patterns.

Generally, the recommendation is to use webdataset for all data generation, data transformation, and training code, and to use wids only if you need fully random access to datasets (e.g., for browing or sparse sampling), need an indexed-based sampler, or are converting tricky legacy code.

import wids

train_url = "https://storage.googleapis.com/webdataset/fake-imagenet/imagenet-train.json"

dataset = wids.ShardListDataset(train_url)

sample = dataset[1900]

print(sample.keys())
print(sample[".txt"])
plt.imshow(sample[".jpg"])

dict_keys(['.cls', '.jpg', '.txt', '__key__', '__dataset__', '__index__', '__shard__', '__shardindex__'])
a high quality color photograph of a dog


https://storage.googleapis.com/webdataset/fake-ima base: https://storage.googleapis.com/webdataset/fake-imagenet name: imagenet-train nfiles: 1282 nbytes: 31242280960 samples: 128200 cache: /tmp/_wids_cache





<matplotlib.image.AxesImage at 0x7f7373669e70>

There are several examples of how to use wids in the examples directory.

train-resnet50-wids shows how to train a ResNet-50 model on ImageNet using wids
train-resnet50-multiray-wids shows how to train a ResNet-50 model on ImageNet using multiple nodes

Note that the APIs between webdataset and wids are not fully consistent:

wids keeps the extension's "." in the keys, while webdataset removes it (".txt" vs "txt")
wids doesn't have a fully fluid interface, and add_transformation just adds to a list of transformations
webdataset currently can't read the wids JSON specifications

Installation and Documentation

$ pip install webdataset

For the Github version:

$ pip install git+https://github.com/tmbdev/webdataset.git

Here are some videos talking about WebDataset and large scale deep learning:

Dependencies

The WebDataset library only requires PyTorch, NumPy, and a small library called braceexpand.

WebDataset loads a few additional libraries dynamically only when they are actually needed and only in the decoder:

PIL/Pillow for image decoding
torchvision, torchvideo, torchaudio for image/video/audio decoding
msgpack for MessagePack decoding
the curl command line tool for accessing HTTP servers
the Google/Amazon/Azure command line tools for accessing cloud storage buckets

Loading of one of these libraries is triggered by configuring a decoder that attempts to decode content in the given format and encountering a file in that format during decoding. (Eventually, the torch... dependencies will be refactored into those libraries.)

tarp's People

Contributors

Stargazers

Watchers

Forkers

nvlabs knopt cxz ocroarchive laihoe bryant1410 gongqi-zhen shubham-goel tals cyrilzakka choyuansu ethicalsecurity-agency collabora

tarp's Issues

Go error for a very large tar file

I tried to split a 2.3TB tar file into many 1GB tar files. With tarp I got this error after 55 shards:

panic: runtime error: index out of range [1] with length 0

goroutine 19 [running]:
github.com/tmbdev/tarp/dpipes.FnameSplit({0xc0002201e0?, 0xc0005dc030?})
        /root/tarp/dpipes/rawtario.go:36 +0xd2
github.com/tmbdev/tarp/dpipes.Aggregate(0x0?, 0xc0003ec060)
        /root/tarp/dpipes/rawtario.go:55 +0x137
created by github.com/tmbdev/tarp/dpipes.TarSource.func1
        /root/tarp/dpipes/tario.go:55 +0x9e

At first I thought the tar is damaged but trying the old tarproc in python (installed with pip install tarproc) I was able to split the big tar. The speed dropped from 32 shards per minute to 18 shards per minute but I am happy my problem is solved.

Issues Installing

Installation instructions need some updating.

On a fresh EC2 machine w/ Ubuntu or Amazon Linux---

installation method 1

go get -v github.com/tmbdev/tarp/tarp =>

go: go.mod file not found in current directory or any parent directory.
	'go get' is no longer supported outside a module.
	To build and install a command, use 'go install' with a version,
	like 'go install example.com/cmd@latest'
	For more information, see https://golang.org/doc/go-get-install-deprecation
	or run 'go help get' or 'go help install'.

installation method 1.5

go install github.com/tmbdev/tarp/tarp@latest =>

go: github.com/tmbdev/tarp/tarp@latest (in github.com/tmbdev/tarp/[email protected]):
	The go.mod file for the module providing named packages contains one or
	more replace directives. It must not contain directives that would cause
	it to be interpreted differently than if it were the main module.

installation method 2:

make bin/tarp =>

cd tarp && make tarp
make[1]: Entering directory `/home/ec2-user/tarp/tarp'
go clean
go mod tidy
go: downloading github.com/dgraph-io/badger/v3 v3.2103.2
go: downloading github.com/jessevdk/go-flags v1.5.0
go: downloading github.com/shamaton/msgpack v1.2.1
go: downloading github.com/dgraph-io/ristretto v0.1.0
go: downloading github.com/dustin/go-humanize v1.0.0
go: downloading github.com/golang/protobuf v1.5.2
go: downloading github.com/pkg/errors v0.9.1
go: downloading go.opencensus.io v0.23.0
go: downloading golang.org/x/sys v0.0.0-20221006211917-84dc82d7e875
go: downloading github.com/stretchr/testify v1.6.1
go: downloading github.com/Masterminds/squirrel v1.5.3
go: downloading github.com/mattn/go-sqlite3 v1.14.15
go: downloading gopkg.in/zeromq/goczmq.v4 v4.1.0
go: downloading github.com/gogo/protobuf v1.3.2
go: downloading github.com/golang/snappy v0.0.4
go: downloading github.com/google/flatbuffers v22.9.29+incompatible
go: downloading github.com/cespare/xxhash v1.1.0
go: downloading github.com/klauspost/compress v1.15.11
go: downloading golang.org/x/net v0.0.0-20221004154528-8021a29435af
go: downloading github.com/cespare/xxhash/v2 v2.1.2
go: downloading github.com/golang/glog v1.0.0
go: downloading github.com/dgryski/go-farm v0.0.0-20200201041132-a6ae2369ad13
go: downloading google.golang.org/protobuf v1.28.1
go: downloading github.com/google/go-cmp v0.5.5
go: downloading github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da
go: downloading github.com/lann/builder v0.0.0-20180802200727-47ae307949d0
go: downloading github.com/davecgh/go-spew v1.1.1
go: downloading github.com/pmezard/go-difflib v1.0.0
go: downloading gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c
go: downloading github.com/OneOfOne/xxhash v1.2.2
go: downloading github.com/spaolacci/murmur3 v1.1.0
go: downloading github.com/lann/ps v0.0.0-20150810152359-62de8c46ede0
go: downloading golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1
go get -u
go: downloading github.com/dgraph-io/badger/v3 v3.2103.3
go: downloading github.com/dgraph-io/badger v1.6.2
go: downloading github.com/tmbdev/tarp v0.0.2
go: downloading github.com/dgraph-io/ristretto v0.1.1
go: downloading golang.org/x/sys v0.1.0
go: downloading github.com/mattn/go-sqlite3 v1.14.16
go: downloading github.com/google/flatbuffers v22.10.26+incompatible
go: downloading github.com/klauspost/compress v1.15.12
go: downloading golang.org/x/net v0.1.0
go: upgraded github.com/dgraph-io/badger/v3 v3.2103.2 => v3.2103.3
go: upgraded github.com/dgraph-io/ristretto v0.1.0 => v0.1.1
go: upgraded github.com/google/flatbuffers v22.9.29+incompatible => v22.10.26+incompatible
go: upgraded github.com/klauspost/compress v1.15.11 => v1.15.12
go: upgraded github.com/mattn/go-sqlite3 v1.14.15 => v1.14.16
go: upgraded github.com/tmbdev/tarp/dpipes v0.0.0-20220223203531-468ca2eefc90 => v0.0.0-20221009163818-4aac5677b928
go: upgraded golang.org/x/net v0.0.0-20221004154528-8021a29435af => v0.1.0
go: upgraded golang.org/x/sys v0.0.0-20221006211917-84dc82d7e875 => v0.1.0
go build -ldflags "-X main.version=`date -Iseconds`" -o tarp split.go sort.go main.go cat.go proc.go create.go
# pkg-config --cflags  -- libczmq libzmq libsodium
Package libczmq was not found in the pkg-config search path.
Perhaps you should add the directory containing `libczmq.pc'
to the PKG_CONFIG_PATH environment variable
No package 'libczmq' found
Package libzmq was not found in the pkg-config search path.
Perhaps you should add the directory containing `libzmq.pc'
to the PKG_CONFIG_PATH environment variable
No package 'libzmq' found
Package libsodium was not found in the pkg-config search path.
Perhaps you should add the directory containing `libsodium.pc'
to the PKG_CONFIG_PATH environment variable
No package 'libsodium' found
pkg-config: exit status 1
make[1]: *** [tarp] Error 2
make[1]: Leaving directory `/home/ec2-user/tarp/tarp'
make: *** [bin/tarp] Error 2

Usage of shuffle in tarp

I have a large tar file where data is sorted and now I want to shread into smaller tar files using tarp where data are shuffled across and within the smaller tar files.
tarp cat --shuffle=5000 in.tar -o - | tarp split -c 10000 - -o ./out_-%06d.tar
This command works but I want to understand how data is shuffled over across and within smaller tar files.

Usage of tarp sort

Hello,

I have two issues/questions with the tarp sort command:

No output produced

When running:

tarp sort data/* -f rgb.png -o - | tarp split -c 512 - -o "test-%06d.tar"

I get no files created. I simply get progress when moving to a new source file, just nothing created

tarp cat data/*  -o - | tarp split -c 512 - -o "test-%06d.tar"

works fine though

How use the field `-f`

It seems that split needs an extra argument to select the fields we want to keep. How do we do to select them all as it gets very tedious when many of them are present.

Thank you for your help

Cannot install tarp with golang v1.17.2

On Google Colab:

Try 1

go version
go get github.com/tmbdev/tarp/tarp

go version go1.17.2 linux/amd64
go get: installing executables with 'go get' in module mode is deprecated.
	Use 'go install pkg@version' instead.
	For more information, see https://golang.org/doc/go-get-install-deprecation
	or run 'go help get' or 'go help install'.

Try 2

go version
go install github.com/tmbdev/tarp/tarp@latest

go version go1.17.2 linux/amd64
go install: github.com/tmbdev/tarp/tarp@latest (in github.com/tmbdev/tarp/[email protected]):
	The go.mod file for the module providing named packages contains one or
	more replace directives. It must not contain directives that would cause
	it to be interpreted differently than if it were the main module.

Try 3

go version
rm -rf tarp
git clone https://github.com/tmbdev/tarp.git
cd tarp
make bin/tarp
make install

go version go1.17.2 linux/amd64
cd tarp && make tarp
make[1]: Entering directory '/content/tarp/tarp'
go clean
Makefile:5: recipe for target 'tarp' failed
make[1]: Leaving directory '/content/tarp/tarp'
Makefile:5: recipe for target 'bin/tarp' failed
+ pwd
+ go version
+ rm -rf tarp
+ git clone https://github.com/tmbdev/tarp.git
Cloning into 'tarp'...
+ cd tarp
+ make bin/tarp
go: github.com/DataDog/[email protected]: missing go.sum entry; to add it:
	go mod download github.com/DataDog/zstd
go: github.com/DataDog/[email protected]: missing go.sum entry; to add it:
	go mod download github.com/DataDog/zstd
make[1]: *** [tarp] Error 1
make: *** [bin/tarp] Error 2

Add LICENSE

This repo does not contain any information about licensing. Can you please add the relevant information?

tarp split not creating shards with the correct number of samples per shard.

Command:

tarp cat --shuffle=5000 my_dataset.tar -o - | tarp split -c 10000 - -o ./images_and_xmls-%06d.tar

Output

[info] version  false
[progress] # source -
[info] version  false
[info] # shuffle 5000
[progress] # writing -
[progress] # source images_and_xmls.tar
[progress] # shard ./my_dataset-000000.tar
[progress] # shard ./my_dataset-000001.tar
[progress] # shard ./my_dataset-000002.tar
[progress] # shard ./my_dataset-000003.tar
[progress] # shard ./my_dataset-000004.tar
[progress] # shard ./my_dataset-000005.tar
[progress] # shard ./my_dataset-000006.tar
[progress] # shard ./my_dataset-000007.tar
[progress] # shard ./my_dataset-000008.tar
[progress] # shard ./my_dataset-000009.tar
[progress] # shard ./my_dataset-000010.tar
[progress] # shard ./my_dataset-000011.tar
[progress] # shard ./my_dataset-000012.tar
[progress] # shard ./my_dataset-000013.tar
[progress] # shard ./my_dataset-000014.tar
[progress] # shard ./my_dataset-000015.tar
[progress] # shard ./my_dataset-000016.tar
[progress] # shard ./my_dataset-000017.tar

But for some reason each shard contains approximately 2700 items. Note that the original tar file contains almost 50k items.

Thanks,

undefined: msgpack.Decode during installation

Hello,

I encounter the following error when trying to install tarp

webdataset/tarp/tarp/sort.go:44:17: undefined: msgpack.Encode
webdataset/tarp/tarp/sort.go:83:6: undefined: msgpack.Decode

I tried both go get -v github.com/webdataset/tarp/tarp and go get -v github.com/tmbdev/tarp/tarp

Tried with Go version 1.13.8, 1.16.3 and 1.17.

This is a bit weird because I tried to install it a week ago in the same machine (ubuntu 20.04) and it installed without an issue.

Installation/usage guide

I am trying to use tarp in splitting video tutorial for webdataset and was wondering if there is any installation/setup guide. Thanks.

Docker support

I think Docker support might be helpful - especially since many Deep Learning practitioners using this might not be familiar with the Go tool chain.

Imagining an executable image like docker run tarp:latest <tarp-command-here>. Building the image could be part of the future CI. The final image could be as lean as just containing the compiled binary. Base image e.g. go:1.17-alpine.

Subset support

Hello,

Does tarp currently support re-sharding with a subset of the data?

For instance, if I have data/ which contains many tar files (lets say 100 tar files with 100 data in each file), and a binary string subset.txt which specifies which of these 100*100 data I want to keep, is it possible to use tarp to "re-shard" data/ to get newdata/, also containing tar files with 100 data in each file, but only containing the data for which subset.txt indicates to include that data?

webdataset / tarp Goto Github PK

tarp's Introduction

The WebDataset Format

WebDataset Libraries

The webdataset Library

The webdataset Pipeline API

The wids Library for Indexed WebDatasets

Installation and Documentation

Dependencies

tarp's People

Contributors

Stargazers

Watchers

Forkers

tarp's Issues

installation method 1

installation method 1.5

installation method 2:

No output produced

How use the field -f

Try 1

Try 2

Try 3

Command:

Output

Recommend Projects

Recommend Topics

Recommend Org

The `webdataset` Library

The `webdataset` Pipeline API

The `wids` Library for Indexed WebDatasets

How use the field `-f`