Peeker into Parquets

I'm spending unhealthy amounts of time to get a parquet file and then do something trivial like see how many of rows are there, how many nulls are in a given column, checking what is the exact name name of a particular column, ... Launching a python interpreter, typing the import pandas as pd and df = pd.read_parquet("file.parquet") and type the exact pandas query seems too much a chore and slow for something which is often quite standard.

I'm thus developing this Peeker into Parquets (pepa) tool to capture the most basic cases, and in a performant manner -- to get shape/schema, we need to just peek at the metadata file, not decrypt all the columns and everything. The output is a json, to allow piping to eg jq.

And I also do this to gain some Rust practice -- the code itself will thus likely be pleasing to neither eye nor heart.

Quickstart

Install the crate and pepa <yo-parquet-file>, which by default nets you something like

{
  "shape": {
    "num_cols_leaf": 2,
    "num_rows": 2
  },
  "schema": {
    "columns": {
      "a": "INT64",
      "b": "DOUBLE"
    }
  }
}

with the column types being physical, and number of columns going down to the leafs (thus a structure column is not counted as a 1).

For parquets with many columns, run with -l0 instead to get just a stats of how many columns per physical type are there. If you are interested in more per-column stats like nulls vs non-nulls, run with -l2.

Upcoming features

adding index stats to l0/l1 (key_value_metadata.pandas -> parse json -> index_columns, partition_columns)
adding disk size and memory usage as an option or l2,
supporting some simple filtering (though this is not supposed to replace any existing analytical engine),
per-column stats of most frequent values as an option or l3,
support partition discovery when processing a folder,
python interface for the library (usage: prior to running a batch job on multiple parquets, get stats for all of them to calculate the right batch size),

Possible bugs

non scalar types could crash things
tested on fastparquet and pyarrow, but not on others such as spark

Internal improvements

start breaking up the lib.rs into metadata parser, pandas parser, etc
tests
error handling
build & publish pipeline

tmi / pepa Goto Github PK

pepa's Introduction

Peeker into Parquets

Quickstart

Upcoming features

Possible bugs

Internal improvements

pepa's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent