Giter VIP home page Giter VIP logo

rand-archive's Introduction

Random Access Archive

.raa files are essentially a dict header + consecutive bytes of the samples. It was made to faccilitate and accelerate deep learning training on large datasets. It's written in Rust and fast, but easily accesible programmatically in Python. Most importantly, it allows you to shuffle the data, without sacrificing too much on sequential reads, by shuffling blocks of contiguous data. It also allows for lazy sharding.

Comparison

The main advantage of this library, is how extensible it is. Other libraries like Webdataset, FFCV, Streaming Dataset, TF Record, are very batteries included, which is great for experimentation, but sacrifices on extensibility heavily since they also include data processing. Our philosiphy quite simple, you write string byte pairs, and you read string byte pairs. We only implement functionality that NEEDS to be implemented at the reader level for optimization, like shuffling and sharding.

Benchmarks:

!todo

Usage

pip install rand-archive

Writing:

from rand_archive import Writer

with Writer("test.raa") as w:
  w.write("test", bytes("test"))

Reading

from rand_archive import Reader

for _ in Reader().open_file("dummy.raa").with_shuffling():
  pass

rand-archive's People

Contributors

leifu1128 avatar sweep-ai[bot] avatar

Watchers

 avatar

rand-archive's Issues

Sweep: Header read/write unit test

Write a unit test that creates a dummy header, writes it, reads it, then ensures they are the same. Do this by creating a header with default, inserting a dummy EntryMetadata, and then write and read it.

Checklist
  • src/archive_test.rs
  • Import the necessary modules and functions from the crate and std library.
    • Write a function named "test_header_read_write" that does the following:
    • Create a dummy Header using the default() function.
    • Insert a dummy EntryMetadata into the Header using the insert() function.
    • Write the Header to a temporary file using the write() function.
    • Read the Header back from the temporary file using the read() function.
    • Compare the original and the read back Header using the assert_eq! macro.

Sweep: Write unit tests for Block, Collector and Reader

See unit tests for ArchiveWriter and Header, follow the same convention and create a dummy .raa file in the same directory.

Checklist
  • tests/test_block.rs ✅ Commit d90c5e5
• Write a unit test for each method in the Block class. Make sure to cover all possible edge cases. • Create a dummy .raa file in the same directory for testing purposes.
Sandbox Execution Logs
trunk init 1/3 ✅
⡿ Downloading Trunk 1.15.0...
⡿ Downloading Trunk 1.15.0...
⢿ Downloading Trunk 1.15.0...
⣻ Downloading Trunk 1.15.0...
⣽ Downloading Trunk 1.15.0...
⣾ Downloading Trunk 1.15.0...
⣷ Downloading Trunk 1.15.0...
⣯ Downloading Trunk 1.15.0...
⣟ Downloading Trunk 1.15.0...
⡿ Downloading Trunk 1.15.0...
⢿ Downloading Trunk 1.15.0...
⣻ Downloading Trunk 1.15.0...
⣽ Downloading Trunk 1.15.0...
⣾ Downloading Trunk 1.15.0...
✔ Downloading Trunk 1.15.0... done
⡿ Verifying Trunk sha256...
✔ Verifying Trunk sha256... done
⡿ Unpacking Trunk...
✔ Unpacking Trunk... done

























✔ 16 linters were enabled (.trunk/trunk.yaml)

  actionlint 1.6.25 (1 github-workflow file)
  bandit 1.7.5 (1 python file)
  black 23.9.1 (1 jupyter, 1 python file)
  checkov 2.4.9 (2 yaml files)
  clippy 1.65.0 (1 rust file)
  git-diff-check (23 files)
  isort 5.12.0 (1 python file) (created .isort.cfg)
  markdownlint 0.36.0 (1 markdown file) (created .markdownlint.yaml)
  osv-scanner 1.3.6 (1 lockfile file)
  prettier 3.0.3 (1 markdown, 2 yaml files)
  ruff 0.0.288 (1 python file) (created ruff.toml)
  rustfmt 1.65.0 (12 rust files)
  taplo 0.8.1 (3 toml files)
  trivy 0.45.0 (1 lockfile, 2 yaml files)
  trufflehog 3.55.1 (24 files)
  yamllint 1.32.0 (2 yaml files) (created .yamllint.yaml)


Next Steps

 1. Read documentation
    Our documentation can be found at https://docs.trunk.io

 2. Get help and give feedback
    Join the Trunk community at https://slack.trunk.io
trunk check tests/test_block.rs 2/3 ✅

















































































































































































































































































































































































































































































































































































































































































































































































































































































































































































  AUTOFIXES  

tests/test_block.rs
 1:1  high  Incorrect formatting  

  1 | #[cfg(test)]
  2 | mod tests {
  3 |     use std::fs;
  4 |     use std::assert_eq;
    |     use std::fs;
  5 | 
  6 |     use rand_archive::block::Block;

→ Apply formatting (Y/n/all/none):   Formatting applied.

Re-checking autofixed files...




Checked 2 files
✔ No issues
trunk fmt tests/test_block.rs 3/3 ✅
Checked 1 file
✔ No issues
  • tests/test_collector.rs ✅ Commit 6c3bee6
• Write a unit test for each method in the Collector class. Make sure to cover all possible edge cases. • Use the dummy .raa file created in the test_block.rs file for testing purposes.
Sandbox Execution Logs
trunk init 1/3 ✅
⡿ Downloading Trunk 1.15.0...
⡿ Downloading Trunk 1.15.0...
⢿ Downloading Trunk 1.15.0...
⣻ Downloading Trunk 1.15.0...
⣽ Downloading Trunk 1.15.0...
⣾ Downloading Trunk 1.15.0...
⣷ Downloading Trunk 1.15.0...
⣯ Downloading Trunk 1.15.0...
⣟ Downloading Trunk 1.15.0...
⡿ Downloading Trunk 1.15.0...
⢿ Downloading Trunk 1.15.0...
⣻ Downloading Trunk 1.15.0...
⣽ Downloading Trunk 1.15.0...
⣾ Downloading Trunk 1.15.0...
✔ Downloading Trunk 1.15.0... done
⡿ Verifying Trunk sha256...
✔ Verifying Trunk sha256... done
⡿ Unpacking Trunk...
✔ Unpacking Trunk... done

























✔ 16 linters were enabled (.trunk/trunk.yaml)

  actionlint 1.6.25 (1 github-workflow file)
  bandit 1.7.5 (1 python file)
  black 23.9.1 (1 jupyter, 1 python file)
  checkov 2.4.9 (2 yaml files)
  clippy 1.65.0 (1 rust file)
  git-diff-check (23 files)
  isort 5.12.0 (1 python file) (created .isort.cfg)
  markdownlint 0.36.0 (1 markdown file) (created .markdownlint.yaml)
  osv-scanner 1.3.6 (1 lockfile file)
  prettier 3.0.3 (1 markdown, 2 yaml files)
  ruff 0.0.288 (1 python file) (created ruff.toml)
  rustfmt 1.65.0 (12 rust files)
  taplo 0.8.1 (3 toml files)
  trivy 0.45.0 (1 lockfile, 2 yaml files)
  trufflehog 3.55.1 (24 files)
  yamllint 1.32.0 (2 yaml files) (created .yamllint.yaml)


Next Steps

 1. Read documentation
    Our documentation can be found at https://docs.trunk.io

 2. Get help and give feedback
    Join the Trunk community at https://slack.trunk.io
trunk check tests/test_collector.rs 2/3 ✅

























































































































































































































































































































































































































































































































































































































































































































































































































































































































































  AUTOFIXES  

tests/test_collector.rs
 1:1  high  Incorrect formatting  

   1 | #[cfg(test)]
   2 | mod tests {
   3 |     use std::fs;
   4 |     use std::assert_eq;
     |     use std::fs;
   5 | 
   6 |     use rand_archive::collector::Collector;
   7 |     use rand_archive::block::Block;
     |     use rand_archive::collector::Collector;
   8 |     use rand_archive::header::{EntryMetadata, Header};
   9 | 

→ Apply formatting (Y/n/all/none):   Formatting applied.

Re-checking autofixed files...




Checked 2 files
✔ No issues
trunk fmt tests/test_collector.rs 3/3 ✅
Checked 1 file
✔ No issues
  • tests/test_reader.rs ✅ Commit 6c3bee6
• Write a unit test for each method in the Reader class. Make sure to cover all possible edge cases. • Use the dummy .raa file created in the test_block.rs file for testing purposes.
Sandbox Execution Logs
trunk init 1/3 ✅
⡿ Downloading Trunk 1.15.0...
⡿ Downloading Trunk 1.15.0...
⢿ Downloading Trunk 1.15.0...
⣻ Downloading Trunk 1.15.0...
⣽ Downloading Trunk 1.15.0...
⣾ Downloading Trunk 1.15.0...
⣷ Downloading Trunk 1.15.0...
⣯ Downloading Trunk 1.15.0...
⣟ Downloading Trunk 1.15.0...
⡿ Downloading Trunk 1.15.0...
⢿ Downloading Trunk 1.15.0...
⣻ Downloading Trunk 1.15.0...
⣽ Downloading Trunk 1.15.0...
⣾ Downloading Trunk 1.15.0...
✔ Downloading Trunk 1.15.0... done
⡿ Verifying Trunk sha256...
✔ Verifying Trunk sha256... done
⡿ Unpacking Trunk...
✔ Unpacking Trunk... done



























✔ 16 linters were enabled (.trunk/trunk.yaml)

  actionlint 1.6.25 (1 github-workflow file)
  bandit 1.7.5 (1 python file)
  black 23.9.1 (1 jupyter, 1 python file)
  checkov 2.4.9 (2 yaml files)
  clippy 1.65.0 (1 rust file)
  git-diff-check (23 files)
  isort 5.12.0 (1 python file) (created .isort.cfg)
  markdownlint 0.36.0 (1 markdown file) (created .markdownlint.yaml)
  osv-scanner 1.3.6 (1 lockfile file)
  prettier 3.0.3 (1 markdown, 2 yaml files)
  ruff 0.0.288 (1 python file) (created ruff.toml)
  rustfmt 1.65.0 (12 rust files)
  taplo 0.8.1 (3 toml files)
  trivy 0.45.0 (1 lockfile, 2 yaml files)
  trufflehog 3.55.1 (24 files)
  yamllint 1.32.0 (2 yaml files) (created .yamllint.yaml)


Next Steps

 1. Read documentation
    Our documentation can be found at https://docs.trunk.io

 2. Get help and give feedback
    Join the Trunk community at https://slack.trunk.io
trunk check tests/test_reader.rs 2/3 ✅




















































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































  AUTOFIXES  

tests/test_reader.rs
 1:1  high  Incorrect formatting  

  1 | #[cfg(test)]
  2 | mod tests {
  3 |     use std::fs;
  4 |     use rand_archive::reader::Reader;
  5 |     use rand_archive::block::Block;
  6 |     use crate::utils::setup;
    |     use rand_archive::block::Block;
    |     use rand_archive::reader::Reader;
    |     use std::fs;
  7 | 
  8 |     #[test]

→ Apply formatting (Y/n/all/none):   Formatting applied.

Re-checking autofixed files...





Checked 2 files
✔ No issues
trunk fmt tests/test_reader.rs 3/3 ✅
Checked 1 file
✔ No issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.