Giter VIP home page Giter VIP logo

parquet.jl's Introduction

Parquet

Build Status

Load a parquet file. Only metadata is read initially, data is loaded in chunks on demand.

julia> using Parquet

julia> parfile = "customer.impala.parquet"

julia> p = ParFile(parfile)
Parquet file: /home/tan/Work/julia/packages/Parquet/test/parquet-compatibility/parquet-testdata/impala/1.1.1-SNAPPY/customer.impala.parquet
    version: 1
    nrows: 150000
    created by: impala version 1.2-INTERNAL (build a462ec42e550c75fccbff98c720f37f3ee9d55a3)
    cached: 0 column chunks

Examine the schema.

julia> nrows(p)
150000

julia> ncols(p)
8

julia> colnames(p)
8-element Array{AbstractString,1}:
 "c_acctbal"   
 "c_mktsegment"
 "c_nationkey" 
 "c_name"      
 "c_address"   
 "c_custkey"   
 "c_phone"     
 "c_comment"   

julia> schema(p)
Schema:
    schema {
      optional INT64 c_custkey
      optional BYTE_ARRAY c_name
      optional BYTE_ARRAY c_address
      optional INT32 c_nationkey
      optional BYTE_ARRAY c_phone
      optional DOUBLE c_acctbal
      optional BYTE_ARRAY c_mktsegment
      optional BYTE_ARRAY c_comment
    }

Can convert the parquet schema to different forms:

julia> schema(JuliaConverter(STDOUT), p, :Customer)
type Customer
    Customer() = new()
    c_custkey::Int64
    c_name::Vector{UInt8}
    c_address::Vector{UInt8}
    c_nationkey::Int32
    c_phone::Vector{UInt8}
    c_acctbal::Float64
    c_mktsegment::Vector{UInt8}
    c_comment::Vector{UInt8}
end

julia> schema(ThriftConverter(STDOUT), p, :Customer)
struct Customer {
     optional i64 c_custkey
     optional binary c_name
     optional binary c_address
     optional i32 c_nationkey
     optional binary c_phone
     optional double c_acctbal
     optional binary c_mktsegment
     optional binary c_comment
}

julia> schema(ProtoConverter(STDOUT), p, :Customer)
message Customer {
    optional sint64 c_custkey;
    optional bytes c_name;
    optional bytes c_address;
    optional sint32 c_nationkey;
    optional bytes c_phone;
    optional double c_acctbal;
    optional bytes c_mktsegment;
    optional bytes c_comment;
}

Can inject the type dynamically to a module to have further methods working directly on the Julia type.

julia> schema(JuliaConverter(Main), p, :Customer)

julia> Base.show(io::IO, cust::Customer) = println(io, bytestring(cust.c_name), " Phone#:", bytestring(cust.c_phone))

Create cursor to iterate over records. In parallel mode, multiple remote cursors can be created and iterated on in parallel.

julia> rc = RecCursor(p, 1:5, colnames(p), JuliaBuilder(p, Customer))
Record Cursor on /home/tan/Work/julia/packages/Parquet/test/parquet-compatibility/parquet-testdata/impala/1.1.1-SNAPPY/customer.impala.parquet
    rows: 1:5
    cols: c_acctbal.c_mktsegment.c_nationkey.c_name.c_address.c_custkey.c_phone.c_comment


julia> i = start(rc);

julia> while !done(rc, i)
        v,i = next(rc, i)
        show(v)
       end
Customer#000000033 Phone#:27-375-391-1280
Customer#000000065 Phone#:33-733-623-5267
Customer#000000001 Phone#:25-989-741-2988
Customer#000000642 Phone#:32-925-597-9911
Customer#000000161 Phone#:17-805-718-2449

parquet.jl's People

Contributors

tanmaykm avatar

Watchers

James Cloos avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.