expediagroup / corc Goto Github PK
View Code? Open in Web Editor NEWAn ORC File Scheme for the Cascading data processing platform.
License: Apache License 2.0
An ORC File Scheme for the Cascading data processing platform.
License: Apache License 2.0
To test, in OrcFileTest.readStringPredicatePushdownIncludeStripe test, add a few more records to the writer that don't match the predicate.
public void readStringPredicatePushdownIncludeStripe() throws IOException {
TypeInfo typeInfo = TypeInfoFactory.stringTypeInfo;
try (OrcWriter writer = getOrcWriter(typeInfo)) {
writer.addRow("hello");
writer.addRow("goodbye");
}
StructTypeInfo structTypeInfo = new StructTypeInfoBuilder().add("a", TypeInfoFactory.stringTypeInfo).build();
SearchArgument searchArgument = SearchArgumentFactory.newBuilder().startAnd().equals("a", "hello").end().build();
OrcFile orcFile = OrcFile.source().columns(structTypeInfo).schemaFromFile().searchArgument(searchArgument).build();
Tap<?, ?, ?> tap = new Hfs(orcFile, path);
List<Tuple> list = Plunger.readDataFromTap(tap).asTupleList();
assertThat(list.size(), is(1)); //Fails
}
Consider the following failing unit test example for SearchArgumentFactoryTest:
@Test
public void primitiveBoolean() throws Exception {
Fields booleanField = new Fields("A", boolean.class);
SearchArgumentFactory.Builder.checkValueTypes(booleanField, true);
}
It will fail because the primitive will be autoboxed in all operator methods of SearchArgumentFactory to an Object. Making the checkValueTypes(...)
throw an exception. It works if you use hive's search args directly (probably because it is less strict in the checking).
Same holds for other primitive types.
Hi,
I opened a thread on the Cascading user group (https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/cascading-user/GDauDn91NtA/BLjWjL7GAwAJ) but it seems to be related to the ORC scheme.
When using the Corc scheme and a CoGroup operation with a different tap (coming from an avro file for example), about half the lines are lost in the output (looks like some of the reducers aren't writing anything). More details and a piece of code are contained in the thread above. I might be able to send a test class with data if needed. We don't have this issue with another ORC Scheme (https://github.com/branky/cascading.hive).
I'm having what seems to be this issue: https://issues.apache.org/jira/browse/HIVE-10790 . Per the ticket, this should be fixed in Hive 2.0. However, Corc uses Hive 1.0 . Is there any way we can work around this issue? Or is upgrading to Hive 2.0 strictly required? If so, is that in the works? I would be happy to contribute any help here.
Writing a union type results in:
java.lang.ClassCastException: org.apache.hadoop.hive.serde2.objectinspector.StandardUnionObjectInspector$StandardUnion cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcUnion
DefaultConverterFactory.UnionConverter.toWritableObjectInternal
returns a StandardUnion
. However, Corc
uses OrcStructInspector
which in turn uses OrcUnionObjectInspector
. OrcUnionObjectInspector.getTag
casts the parameter to OrcUnion
which results in the ClassCastException
above.
OrcUnion
is currently package private which limits options. A quick dirty fix would be to create a public OrcUnion
factory in the org.apache.hadoop.hive.ql.io.orc
package. However it should be possible to use TypeInfoUtils.getStandardWritableObjectInspectorFromTypeInfo
in Corc
for writing which would allow the continued use of StandardUnion
in UnionConverter
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.