Giter VIP home page Giter VIP logo

arrow2's Introduction

Arrow2: Transmute-free Arrow

test codecov

A Rust crate to work with Apache Arrow. The most feature-complete implementation of the Arrow format after the C++ implementation.

Check out the guide for a general introduction on how to use this crate, and API docs for a detailed documentation of each of its APIs.

Features

  • Most feature-complete implementation of Apache Arrow after the reference implementation (C++)
    • Decimal 256 unsupported (not a Rust native type)
  • C data interface supported for all Arrow types (read and write)
  • C stream interface supported for all Arrow types (read and write)
  • Full interoperability with Rust's Vec
  • MutableArray API to work with bitmaps and arrays in-place
  • Full support for timestamps with timezones, including arithmetics that take timezones into account
  • Support to read from, and write to:
    • CSV
    • Apache Arrow IPC (all types)
    • Apache Arrow Flight (all types)
    • Apache Parquet (except deep nested types)
    • Apache Avro (all types)
    • NJSON
    • ODBC (some types)
  • Extensive suite of compute operations
    • aggregations
    • arithmetics
    • cast
    • comparison
    • sort and merge-sort
    • boolean (AND, OR, etc) and boolean kleene
    • filter, take
    • hash
    • if-then-else
    • nullif
    • temporal (day, month, week day, hour, etc.)
    • window
    • ... and more ...
  • Extensive set of cargo feature flags to reduce compilation time and binary size
  • Fully-decoupled IO between CPU-bounded and IO-bounded tasks, allowing this crate to both be used in async contexts without blocking and leverage parallelism
  • Fastest known implementation of Avro and Parquet (e.g. faster than the official C++ implementations)

Safety and Security

This crate uses unsafe when strictly necessary:

  • when the compiler can't prove certain invariants and
  • FFI

We have extensive tests over these, all of which run and pass under MIRI. Most uses of unsafe fall into 3 categories:

  • The Arrow format has invariants over UTF-8 that can't be written in safe Rust
  • TrustedLen and trait specialization are still nightly features
  • FFI

We actively monitor for vulnerabilities in Rust's advisory and either patch or mitigate them (see e.g. .cargo/audit.yaml and .github/workflows/security.yaml).

Reading from untrusted data currently may panic! on the following formats:

  • Apache Parquet
  • Apache Avro

We are actively addressing this.

Integration tests

Our tests include roundtrip against:

  • Apache Arrow IPC (both little and big endian) generated by C++, Java, Go, C# and JS implementations.
  • Apache Parquet format (in its different configurations) generated by Arrow's C++ and Spark's implementation
  • Apache Avro generated by the official Rust Avro implementation

Check DEVELOPMENT.md for our development practices.

Versioning

We use the SemVer 2.0 used by Cargo and the remaining of the Rust ecosystem, we also use the 0.x.y versioning, since we are iterating over the API.

Design

This repo and crate's primary goal is to offer a safe Rust implementation of the Arrow specification. As such, it

  • MUST NOT implement any logical type other than the ones defined on the arrow specification, schema.fbs.
  • MUST lay out memory according to the arrow specification
  • MUST support reading from and writing to the C data interface at zero-copy.
  • MUST support reading from, and writing to, the IPC specification, which it MUST verify against golden files available here.

Design documents about each of the parts of this repo are available on their respective READMEs.

FAQ

Any plans to merge with the Apache Arrow project?

Maybe. The primary reason to have this repo and crate is to be able to prototype and mature using a fundamentally different design based on a transmute-free implementation. This requires breaking backward compatibility and loss of features that is impossible to achieve on the Arrow repo.

Furthermore, the arrow project currently has a release mechanism that is unsuitable for this type of work:

  • A release of the Apache consists of a release of all implementations of the arrow format at once, with the same version. It is currently at 5.0.0.

This implies that the crate version is independent of the changelog or its API stability, which violates SemVer. This procedure makes the crate incompatible with Rust's (and many others') ecosystem that heavily relies on SemVer to constraint software versions.

Secondly, this implies the arrow crate is versioned as >0.x. This places expectations about API stability that are incompatible with this effort.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

arrow2's People

Contributors

jorgecarleitao avatar ritchie46 avatar sundy-li avatar elferherrera avatar dandandan avatar b41sh avatar xuanwo avatar vasanthakumarv avatar anirishduck avatar yjhmelody avatar cjermain avatar ozgrakkurt avatar illumination-k avatar hohav avatar yjshen avatar tustvold avatar ncpenke avatar alexander-beedie avatar zhyass avatar rinchannowww avatar qqwy avatar abreis avatar ariesdevil avatar reswqa avatar simonvandel avatar haoyang670 avatar houqp avatar mdrach avatar dousir9 avatar jimexist avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.