Giter VIP home page Giter VIP logo

corc's Introduction

   O~~~   O~~    O~ O~~~   O~~~
 O~~    O~~  O~~  O~~    O~~   
O~~    O~~    O~~ O~~   O~~    
 O~~    O~~  O~~  O~~    O~~   
   O~~~   O~~    O~~~      O~~~

Use corc to read and write data in the Optimized Row Columnar (ORC) file format in your Cascading applications. The reading of ACID datasets is also supported.

Status ⚠️

This project is no longer in active development.

Start using

You can obtain corc from Maven Central :

Maven Central GitHub license

Cascading Dependencies

Corc has been built and tested against Cascading 3.3.0.

Hive Dependencies

Corc is built with Hive 2.3.4. Several dependencies will need to be included when using Corc:

<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-exec</artifactId>
  <version>2.3.4</version>
  <classifier>core</classifier>
</dependency>
<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-serde</artifactId>
  <version>2.3.4</version>
</dependency>
<dependency>
  <groupId>com.esotericsoftware.kryo</groupId>
  <artifactId>kryo</artifactId>
  <version>2.22</version>
</dependency>
<dependency>
  <groupId>com.google.protobuf</groupId>
  <artifactId>protobuf-java</artifactId>
  <version>2.5.0</version>
</dependency>

Overview

HiveCascading/Java
STRINGString
BOOLEANBoolean
TINYINTByte
SMALLINTShort
INTInteger
BIGINTLong
FLOATFloat
DOUBLEDouble
TIMESTAMPjava.sql.Timestamp
DATEjava.sql.Date
BINARYbyte[]
CHARString (HiveChar)
VARCHARString (HiveVarchar)
DECIMALBigDecimal (HiveDecimal)
ARRAYList<Object>
MAPMap<Object, Object>
STRUCTList<Object>
UNIONTYPESub-type

Constructing an OrcFile instance

OrcFile provides two public constructors; one for sourcing and one for sinking. However, these are provided to be more flexible for others who may wish to extend the class. It is advised to construct an instance via the SourceBuilder and SinkBuilder classes.

SourceBuilder

Create a builder:

SourceBuilder builder = OrcFile.source();

Specify the fields that should be read. If the declared schema is a subset of the complete schema, then column projection will occur:

builder.declaredFields(fields);
// or
builder.columns(structTypeInfo);
// or
builder.columns(structTypeInfoString);

Specify the complete schema of the underlying ORC Files. This is only required for reading ORC Files that back a transactional Hive table. The default behaviour should be to obtain the schema from the ORC Files being read:

builder.schemaFromFile();
// or
builder.schema(fields);
// or
builder.schema(structTypeInfo);
// or
builder.schema(structTypeInfoString);

ORC Files support predicate pushdown. This allows whole row groups to be skipped if they do not contain any rows that match the given SearchArgument:

Fields message = new Fields("message", String.class);
SearchArgument searchArgument = SearchArgumentFactory.newBuilder()
    .startAnd()
    .equals(message, "hello")
    .end()
    .build();

builder.searchArgument(searchArgument);

When passing objects to the SearchArgument.Builder, care should be taken to choose the correct type:

HiveJava
STRINGString
BOOLEANBoolean
TINYINTByte
SMALLINTShort
INTInteger
BIGINTLong
FLOATFloat
DOUBLEDouble
TIMESTAMPjava.sql.Timestamp
DATEorg.apache.hadoop.hive.serde2.io.DateWritable
CHARString (HiveChar)
VARCHARString (HiveVarchar)
DECIMALBigDecimal

When reading ORC Files that back a transactional Hive table, include the VirtualColumn#ROWID ("ROW__ID") virtual column. The column will be prepended to the record's Fields:

builder.prependRowId();

Finally, build the OrcFile:

OrcFile orcFile = builder.build();

SinkBuilder

OrcFile orcFile = OrcFile.sink()
    .schema(schema)
    .build();

The schema parameter can be one of Fields, StructTypeInfo or the String representation of the StructTypeInfo. When providing a Fields instance, care must be taken when deciding how best to specify the types as there is no one-to-one bidirectional mapping between Cascading types and Hive types. The TypeInfo is able to represent richer, more complex types. Consider your ORC File schema and the mappings to Fields types carefully.

Constructing a StructTypeInfo instance

List<String> names = new ArrayList<>();
names.add("col0");
names.add("col1");

List<TypeInfo> typeInfos = new ArrayList<>();
typeInfos.add(TypeInfoFactory.stringTypeInfo);
typeInfos.add(TypeInfoFactory.longTypeInfo);

StructTypeInfo structTypeInfo = (StructTypeInfo) TypeInfoFactory.getStructTypeInfo(names, typeInfos);

or...

String typeString = "struct<col0:string,col1:bigint>";

StructTypeInfo structTypeInfo = (StructTypeInfo) TypeInfoUtils.getTypeInfoFromTypeString(typeString);

or, via the convenience builder...

StructTypeInfo structTypeInfo = new StructTypeInfoBuilder()
    .add("col0", TypeInfoFactory.stringTypeInfo)
    .add("col1", TypeInfoFactory.longTypeInfo)
    .build();

Reading transactional Hive tables

Corc also supports the reading of ACID datasets that underpin transactional Hive tables. However, for this to work effectively with an active Hive table you must provide your own lock management. We intend to make this functionality available in the cascading-hive project. When reading the data you may optionally include the virtual RecordIdentifer column, also known as the ROW__ID column, with one of the following approaches:

  1. Add a field named 'ROW__ID' to your Fields definition. This must be of type org.apache.hadoop.hive.ql.io.RecordIdentifier. For convenience you can use the constant OrcFile#ROW__ID with some fields arithmetic: Fields myFields = Fields.join(OrcFile.ROW__ID, myFields);.
  2. Use the OrcFile.source().prependRowId() option. Be sure to exclude the RecordIdentifer column from your typeInfo instance. The ROW__ID field will be added to your tuple stream automatically.

Usage

OrcFile can be used with Hfs, just like TextDelimited.

OrcFile orcFile = ...
String path = ...
Hfs hfs = new Hfs(orcFile, path);

Credits

Created by Dave Maughan & Elliot West, with thanks to: Patrick Duin, James Grant & Adrian Woodhead.

Legal

This project is available under the Apache 2.0 License.

Copyright 2015-2020 Expedia, Inc.

corc's People

Contributors

ananamj avatar cmathiesen avatar ddcprg avatar dependabot[bot] avatar massdosage avatar nahguam avatar patduin avatar s-hcom-ci avatar teabot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

corc's Issues

CoGroup issue with Cascading on Tez

Hi,

I opened a thread on the Cascading user group (https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/cascading-user/GDauDn91NtA/BLjWjL7GAwAJ) but it seems to be related to the ORC scheme.

When using the Corc scheme and a CoGroup operation with a different tap (coming from an avro file for example), about half the lines are lost in the output (looks like some of the reducers aren't writing anything). More details and a piece of code are contained in the thread above. I might be able to send a test class with data if needed. We don't have this issue with another ORC Scheme (https://github.com/branky/cascading.hive).

Using Predicate Pushdown still reads records that don't match.

To test, in OrcFileTest.readStringPredicatePushdownIncludeStripe test, add a few more records to the writer that don't match the predicate.

  public void readStringPredicatePushdownIncludeStripe() throws IOException {
    TypeInfo typeInfo = TypeInfoFactory.stringTypeInfo;

    try (OrcWriter writer = getOrcWriter(typeInfo)) {
      writer.addRow("hello");
      writer.addRow("goodbye");
    }

    StructTypeInfo structTypeInfo = new StructTypeInfoBuilder().add("a", TypeInfoFactory.stringTypeInfo).build();

    SearchArgument searchArgument = SearchArgumentFactory.newBuilder().startAnd().equals("a", "hello").end().build();

    OrcFile orcFile = OrcFile.source().columns(structTypeInfo).schemaFromFile().searchArgument(searchArgument).build();
    Tap<?, ?, ?> tap = new Hfs(orcFile, path);

    List<Tuple> list = Plunger.readDataFromTap(tap).asTupleList();

    assertThat(list.size(), is(1)); //Fails
  }

Cannot write Union types

Writing a union type results in:

java.lang.ClassCastException: org.apache.hadoop.hive.serde2.objectinspector.StandardUnionObjectInspector$StandardUnion cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcUnion

DefaultConverterFactory.UnionConverter.toWritableObjectInternal returns a StandardUnion. However, Corc uses OrcStructInspector which in turn uses OrcUnionObjectInspector. OrcUnionObjectInspector.getTag casts the parameter to OrcUnion which results in the ClassCastException above.

OrcUnion is currently package private which limits options. A quick dirty fix would be to create a public OrcUnion factory in the org.apache.hadoop.hive.ql.io.orc package. However it should be possible to use TypeInfoUtils.getStandardWritableObjectInspectorFromTypeInfo in Corc for writing which would allow the continued use of StandardUnion in UnionConverter.

Cannot set primitive value even if type is primitive in SearchArguments

Consider the following failing unit test example for SearchArgumentFactoryTest:

 @Test
  public void primitiveBoolean() throws Exception {
    Fields booleanField = new Fields("A", boolean.class);
    SearchArgumentFactory.Builder.checkValueTypes(booleanField, true);
  }

It will fail because the primitive will be autoboxed in all operator methods of SearchArgumentFactory to an Object. Making the checkValueTypes(...) throw an exception. It works if you use hive's search args directly (probably because it is less strict in the checking).

Same holds for other primitive types.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.