Giter VIP home page Giter VIP logo

pepa's Introduction

Peeker into Parquets

I'm spending unhealthy amounts of time to get a parquet file and then do something trivial like see how many of rows are there, how many nulls are in a given column, checking what is the exact name name of a particular column, ... Launching a python interpreter, typing the import pandas as pd and df = pd.read_parquet("file.parquet") and type the exact pandas query seems too much a chore and slow for something which is often quite standard.

I'm thus developing this Peeker into Parquets (pepa) tool to capture the most basic cases, and in a performant manner -- to get shape/schema, we need to just peek at the metadata file, not decrypt all the columns and everything. The output is a json, to allow piping to eg jq.

And I also do this to gain some Rust practice -- the code itself will thus likely be pleasing to neither eye nor heart.

Quickstart

Install the crate and pepa <yo-parquet-file>, which by default nets you something like

{
  "shape": {
    "num_cols_leaf": 2,
    "num_rows": 2
  },
  "schema": {
    "columns": {
      "a": "INT64",
      "b": "DOUBLE"
    }
  }
}

with the column types being physical, and number of columns going down to the leafs (thus a structure column is not counted as a 1).

For parquets with many columns, run with -l0 instead to get just a stats of how many columns per physical type are there. If you are interested in more per-column stats like nulls vs non-nulls, run with -l2.

Upcoming features

  • adding index stats to l0/l1 (key_value_metadata.pandas -> parse json -> index_columns, partition_columns)
  • adding disk size and memory usage as an option or l2,
  • supporting some simple filtering (though this is not supposed to replace any existing analytical engine),
  • per-column stats of most frequent values as an option or l3,
  • support partition discovery when processing a folder,
  • python interface for the library (usage: prior to running a batch job on multiple parquets, get stats for all of them to calculate the right batch size),

Possible bugs

  • non scalar types could crash things
  • tested on fastparquet and pyarrow, but not on others such as spark

Internal improvements

  • start breaking up the lib.rs into metadata parser, pandas parser, etc
  • tests
  • error handling
  • build & publish pipeline

pepa's People

Contributors

tmi avatar vojtechtuma-imi avatar

Stargazers

 avatar Matěj Račinský avatar Hermes Ribeiro avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.