linkedinattic / cubert Goto Github PK

View Code? Open in Web Editor NEW

245.0 245.0 61.0 2.63 MB

Fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop

Home Page: http://linkedin.github.io/Cubert/

License: Apache License 2.0

Shell 0.12% ANTLR 0.82% Java 99.05%

cubert's People

Contributors

Stargazers

Watchers

cubert's Issues

Compile Error with custom Test UDAF from cubert code

I have a scenario in which i need to pick the first non null element for each grouping key of CUBE operator.

Where can i find reference steps to write a custom aggregation function for CUBE operator in CUBERT.

The reference for it at link http://linkedin.github.io/Cubert/userdefined/aggregations.html is empty.

Also, I tried to implement custom test UDAF as specified in below path :

https://github.com/linkedin/Cubert/blob/master/src/test/java/com/linkedin/cubert/operator/TestUDAF.java

I have created the jar, registered that jar and declared the function in cubert script which uses that UDAF. However, the script fails with below error :

Caused by: PreconditionException [MISC_ERROR] com.comscore.cookiecount.cubert.functions.TestUDAF should implement one of these interfaces: AdditiveCubeAggregate, PartitionedAdditiveAggregate, EasyCubeAggregate
at com.linkedin.cubert.operator.CubeOperator.createAggregators(CubeOperator.java:641)
at com.linkedin.cubert.operator.CubeOperator.createOutputSchema(CubeOperator.java:449)
at com.linkedin.cubert.operator.CubeOperator.getPostCondition(CubeOperator.java:413)
at com.linkedin.cubert.analyzer.physical.SemanticAnalyzer.getPostCondition(SemanticAnalyzer.java:811)
at com.linkedin.cubert.analyzer.physical.SemanticAnalyzer.visitOperator(SemanticAnalyzer.java:309)
... 9 more

Caused by: PreconditionException [MISC_ERROR] com.myudfs.cubert.functions.
TestUDAF should implement one of these interfaces: AdditiveCubeAggregate, PartitionedAdditiveAggregate, EasyCubeAggregate
at com.linkedin.cubert.operator.CubeOperator.createAggregators(CubeOperator.java:641)
at com.linkedin.cubert.operator.CubeOperator.createOutputSchema(CubeOperator.java:449)
at com.linkedin.cubert.operator.CubeOperator.getPostCondition(CubeOperator.java:413)
at com.linkedin.cubert.analyzer.physical.SemanticAnalyzer.getPostCondition(SemanticAnalyzer.java:811)
at com.linkedin.cubert.analyzer.physical.SemanticAnalyzer.visitOperator(SemanticAnalyzer.java:309)
... 9 more

Is there something I'm missing in this implementation ?

Thanks,

Swapnil Salunkhe

Error happened! the Demo don't provide some .avro files for loading n-dims input

Hi mani,

Error happened!

Exception in thread "main" java.lang.RuntimeException: java.io.IOException: Not a data file.
at com.linkedin.cubert.analyzer.physical.DependencyAnalyzer.exitProgram(DependencyAnalyzer.java:299)
at com.linkedin.cubert.analyzer.physical.PhysicalPlanWalker.walk(PhysicalPlanWalker.java:75)
at com.linkedin.cubert.analyzer.physical.DependencyAnalyzer.rewrite(DependencyAnalyzer.java:93)
at com.linkedin.cubert.ScriptExecutor.rewrite(ScriptExecutor.java:343)
at com.linkedin.cubert.ScriptExecutor.main(ScriptExecutor.java:529)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

This project don't have some .avro or .txt files when aggregating a cube using n-dims.
So I dont know how to create a .avro file and organize the data structure.
Can you upload some files as demo, to show the inner data structure in the .avro file?

Thanks a lot!

theseus yang

Cannot read partitioned avro files

Hi,

I've tried loading avro files with the following stucture:

/path/to/avro/daily/year=2014/month=12/day=05/country=de/de-r-00000.avro

Using the following script:

JOB "job1"
        REDUCERS 50;
        MAP {
                input = LOAD "/path/to/avro" USING AVRO;
        }
...
END

But I get the following error:

[Dependency Analyzer] Program inputs: [/path/to/avro]

Cannot compile cubert script. Exiting.
java.lang.RuntimeException: java.io.IOException: there are no files in /path/to/avro
    at com.linkedin.cubert.analyzer.physical.DependencyAnalyzer.exitProgram(DependencyAnalyzer.java:277)
    at com.linkedin.cubert.analyzer.physical.PhysicalPlanWalker.walk(PhysicalPlanWalker.java:75)
    at com.linkedin.cubert.analyzer.physical.DependencyAnalyzer.rewrite(DependencyAnalyzer.java:91)
    at com.linkedin.cubert.ScriptExecutor.rewrite(ScriptExecutor.java:319)
    at com.linkedin.cubert.ScriptExecutor.main(ScriptExecutor.java:481)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.io.IOException: there are no files in /path/to/avro
    at com.linkedin.cubert.utils.AvroUtils.getSchema(AvroUtils.java:71)
    at com.linkedin.cubert.io.avro.AvroStorage.getPostCondition(AvroStorage.java:109)
    at com.linkedin.cubert.analyzer.physical.DependencyAnalyzer.getPostCondition(DependencyAnalyzer.java:309)
    at com.linkedin.cubert.analyzer.physical.DependencyAnalyzer.exitProgram(DependencyAnalyzer.java:262)
    ... 9 more

Everything works perfectly fine if I load de-r-00000.avro file directly. But not if I point to the directory with partitions.

Hadoop compatibility issues? Hadoop 2.5.1 gives Exception in thread "main" java.lang.IncompatibleClassChangeError

Hi, gang,

When I try to run tutorial, I got the following error message:

Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at com.linkedin.cubert.io.CubertInputFormat.getSplits(CubertInputFormat.java:74)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:493)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:510)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at com.linkedin.cubert.plan.physical.JobExecutor.run(JobExecutor.java:148)
at com.linkedin.cubert.plan.physical.ExecutorService.executeJob(ExecutorService.java:229)
at com.linkedin.cubert.plan.physical.ExecutorService.executeJobId(ExecutorService.java:196)
at com.linkedin.cubert.plan.physical.ExecutorService.execute(ExecutorService.java:140)
at com.linkedin.cubert.ScriptExecutor.execute(ScriptExecutor.java:301)
at com.linkedin.cubert.ScriptExecutor.main(ScriptExecutor.java:517)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

Could this be Hadoop compatibility issue? I am using Apache Hadoop 2.5.1.

Best,
Charlie

Error in bin/cubert

Hi, gang,

There is an error in bin/cubert when running on any case sensitive OS, such as Linux or MacOS.

Line 13: CUBERT_JAR=echo $CUBERT_HOME/lib/cubert-*.jar

Should be: CUBERT_JAR=echo $CUBERT_HOME/lib/Cubert-*.jar

Good job though!

Charlie Zha

COUNT_DISTINCT computed same value of two dimension key

Hi, @mvarshney Maneesh Varshney:

Recently I was researching cubert.

I transformed a sql with cubert script using cube operator and grouping set operator.

Why the count_distinct(mid) and count_distinct(session_id) has the same result value after computing the cube grouping sets?

eg:
count_distinct(mid) count_distinct(session_id)

500 500

200 200

Can anyone help me if I'm writing the wrong script??

Here is part of my code:

JOB1

Map{ data = LOAD xxx USING TEXT }

 BLOCKGEN data BY SIZE 1000000 PARTITIONED ON mid, session_id;


 STORE data INTO "/cubert/temp" USING RUBIX("overwrite":"true");

END

JOB2

Map {

data = LOAD "" USING RUBIX

}

cube data by

columns...

INNERT dim, session_id

AGGREGATES SUM(pv) as pv,

COUNT_DISTINCT(mid) as uv,

COUNT_DISTINCT(session_id) as visits,

SUM(bounce) as bounce

grouping sets

(log_date,app_name,app_platform),

(log_date,app_name,app_platform,is_new) ......

BLOCKGEN causes Java Heap Space

Hi, @suvodeep-pyne @mparkhe
When I perform a BLOCKGEN operation, at final reduce, the Java Heap Size exception throws.
I increased the REDUCE NUMBER seems not work.

2015-06-26 16:14:56,215 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: finalMerge called with 4 in-memory map-outputs and 5 on-disk map-outputs
2015-06-26 16:14:56,217 INFO [main] org.apache.hadoop.mapred.Merger: Merging 4 sorted segments
2015-06-26 16:14:56,217 INFO [main] org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 4 segments left of total size: 142558662 bytes
2015-06-26 16:14:57,234 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merged 4 segments, 142558706 bytes to disk to satisfy reduce memory limit
2015-06-26 16:14:57,235 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merging 6 files, 789110010 bytes from disk
2015-06-26 16:14:57,236 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
2015-06-26 16:14:57,236 INFO [main] org.apache.hadoop.mapred.Merger: Merging 6 sorted segments
2015-06-26 16:14:57,243 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor [.snappy]
2015-06-26 16:14:57,243 INFO [main] org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 6 segments left of total size: 3293450894 bytes
2015-06-26 16:14:57,605 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
2015-06-26 16:15:44,841 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space
    at com.linkedin.cubert.memory.PagedByteArray.ensureCapacity(PagedByteArray.java:192)
    at com.linkedin.cubert.memory.PagedByteArray.write(PagedByteArray.java:141)
    at com.linkedin.cubert.memory.PagedByteArrayOutputStream.write(PagedByteArrayOutputStream.java:67)
    at java.io.DataOutputStream.write(DataOutputStream.java:107)
    at java.io.DataOutputStream.writeUTF(DataOutputStream.java:401)
    at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
    at org.apache.pig.data.utils.SedesHelper.writeChararray(SedesHelper.java:66)
    at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:580)
    at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:462)
    at org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135)
    at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:650)
    at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:470)
    at org.apache.pig.data.BinSedesTuple.write(BinSedesTuple.java:40)
    at com.linkedin.cubert.io.DefaultTupleSerializer.serialize(DefaultTupleSerializer.java:41)
    at com.linkedin.cubert.io.DefaultTupleSerializer.serialize(DefaultTupleSerializer.java:28)
    at com.linkedin.cubert.utils.SerializedTupleStore.addToStore(SerializedTupleStore.java:118)
    at com.linkedin.cubert.block.CreateBlockOperator$StoredBlock.<init>(CreateBlockOperator.java:145)
    at com.linkedin.cubert.block.CreateBlockOperator.createBlock(CreateBlockOperator.java:536)
    at com.linkedin.cubert.block.CreateBlockOperator.next(CreateBlockOperator.java:488)
    at com.linkedin.cubert.plan.physical.PhaseExecutor.prepareOperatorChain(PhaseExecutor.java:261)
    at com.linkedin.cubert.plan.physical.PhaseExecutor.<init>(PhaseExecutor.java:111)
    at com.linkedin.cubert.plan.physical.CubertReducer.run(CubertReducer.java:68)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

JOIN Operator failed to parse

Hi, @mparkhe
Would you like to give an example for a join operator.
I follow the document in http://linkedin.github.io/Cubert/operators/join.html but seems not work.

JOB "Join count words"
    REDUCERS 5;
    MAP {
        a = LOAD "/cubert/words.txt" USING TEXT("schema": "STRING word");
    }

    MAP {
        b = LOAD "/cubert/words.txt" USING TEXT("schema": "STRING word");
    }

    test_joined = HASH-JOIN a BY word, b BY word;

    STORE test_joined INTO "/cubert/woud_count/join_output" USING TEXT();
END

shengli-mac$ cubert join.cmr
Using HADOOP_CLASSPATH=:/Users/shengli/git_repos/cubert/release/lib/*
line 13:9 mismatched input '=' expecting {'.', ID}
line 13:23 mismatched input 'BY' expecting '{'

Cannot parse cubert script. Exiting.
PROGRAM "Join Word Count";

Cannot compile cubert script. Exiting.
Exception in thread "main" java.text.ParseException
at com.linkedin.cubert.plan.physical.PhysicalParser.parsingTask(PhysicalParser.java:197)
at com.linkedin.cubert.plan.physical.PhysicalParser.parseInputStream(PhysicalParser.java:161)
at com.linkedin.cubert.plan.physical.PhysicalParser.parseProgram(PhysicalParser.java:156)
at com.linkedin.cubert.ScriptExecutor.compile(ScriptExecutor.java:304)
at com.linkedin.cubert.ScriptExecutor.main(ScriptExecutor.java:523)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

upload code at github

Upload Cubert v0.1.0 code at github.

linkedinattic / cubert Goto Github PK

cubert's People

Contributors

Stargazers

Watchers

Forkers

cubert's Issues

Compile Error with custom Test UDAF from cubert code

Error happened! the Demo don't provide some .avro files for loading n-dims input

Cannot read partitioned avro files

Hadoop compatibility issues? Hadoop 2.5.1 gives Exception in thread "main" java.lang.IncompatibleClassChangeError

Error in bin/cubert

COUNT_DISTINCT computed same value of two dimension key

BLOCKGEN causes Java Heap Space

JOIN Operator failed to parse

upload code at github

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent