Giter VIP home page Giter VIP logo

cubert's People

Contributors

mvarshney avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cubert's Issues

Compile Error with custom Test UDAF from cubert code

I have a scenario in which i need to pick the first non null element for each grouping key of CUBE operator.

Where can i find reference steps to write a custom aggregation function for CUBE operator in CUBERT.

The reference for it at link http://linkedin.github.io/Cubert/userdefined/aggregations.html is empty.

Also, I tried to implement custom test UDAF as specified in below path :

https://github.com/linkedin/Cubert/blob/master/src/test/java/com/linkedin/cubert/operator/TestUDAF.java

I have created the jar, registered that jar and declared the function in cubert script which uses that UDAF. However, the script fails with below error :

Caused by: PreconditionException [MISC_ERROR] com.comscore.cookiecount.cubert.functions.TestUDAF should implement one of these interfaces: AdditiveCubeAggregate, PartitionedAdditiveAggregate, EasyCubeAggregate
at com.linkedin.cubert.operator.CubeOperator.createAggregators(CubeOperator.java:641)
at com.linkedin.cubert.operator.CubeOperator.createOutputSchema(CubeOperator.java:449)
at com.linkedin.cubert.operator.CubeOperator.getPostCondition(CubeOperator.java:413)
at com.linkedin.cubert.analyzer.physical.SemanticAnalyzer.getPostCondition(SemanticAnalyzer.java:811)
at com.linkedin.cubert.analyzer.physical.SemanticAnalyzer.visitOperator(SemanticAnalyzer.java:309)
... 9 more

Caused by: PreconditionException [MISC_ERROR] com.myudfs.cubert.functions.
TestUDAF should implement one of these interfaces: AdditiveCubeAggregate, PartitionedAdditiveAggregate, EasyCubeAggregate
at com.linkedin.cubert.operator.CubeOperator.createAggregators(CubeOperator.java:641)
at com.linkedin.cubert.operator.CubeOperator.createOutputSchema(CubeOperator.java:449)
at com.linkedin.cubert.operator.CubeOperator.getPostCondition(CubeOperator.java:413)
at com.linkedin.cubert.analyzer.physical.SemanticAnalyzer.getPostCondition(SemanticAnalyzer.java:811)
at com.linkedin.cubert.analyzer.physical.SemanticAnalyzer.visitOperator(SemanticAnalyzer.java:309)
... 9 more

Is there something I'm missing in this implementation ?

Thanks,

Swapnil Salunkhe

Error happened! the Demo don't provide some .avro files for loading n-dims input

Hi mani,

Error happened!

Exception in thread "main" java.lang.RuntimeException: java.io.IOException: Not a data file.
at com.linkedin.cubert.analyzer.physical.DependencyAnalyzer.exitProgram(DependencyAnalyzer.java:299)
at com.linkedin.cubert.analyzer.physical.PhysicalPlanWalker.walk(PhysicalPlanWalker.java:75)
at com.linkedin.cubert.analyzer.physical.DependencyAnalyzer.rewrite(DependencyAnalyzer.java:93)
at com.linkedin.cubert.ScriptExecutor.rewrite(ScriptExecutor.java:343)
at com.linkedin.cubert.ScriptExecutor.main(ScriptExecutor.java:529)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

This project don't have some .avro or .txt files when aggregating a cube using n-dims.
So I dont know how to create a .avro file and organize the data structure.
Can you upload some files as demo, to show the inner data structure in the .avro file?

Thanks a lot!

theseus yang

Cannot read partitioned avro files

Hi,

I've tried loading avro files with the following stucture:

/path/to/avro/daily/year=2014/month=12/day=05/country=de/de-r-00000.avro

Using the following script:

JOB "job1"
        REDUCERS 50;
        MAP {
                input = LOAD "/path/to/avro" USING AVRO;
        }
...
END

But I get the following error:

[Dependency Analyzer] Program inputs: [/path/to/avro]

Cannot compile cubert script. Exiting.
java.lang.RuntimeException: java.io.IOException: there are no files in /path/to/avro
    at com.linkedin.cubert.analyzer.physical.DependencyAnalyzer.exitProgram(DependencyAnalyzer.java:277)
    at com.linkedin.cubert.analyzer.physical.PhysicalPlanWalker.walk(PhysicalPlanWalker.java:75)
    at com.linkedin.cubert.analyzer.physical.DependencyAnalyzer.rewrite(DependencyAnalyzer.java:91)
    at com.linkedin.cubert.ScriptExecutor.rewrite(ScriptExecutor.java:319)
    at com.linkedin.cubert.ScriptExecutor.main(ScriptExecutor.java:481)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.io.IOException: there are no files in /path/to/avro
    at com.linkedin.cubert.utils.AvroUtils.getSchema(AvroUtils.java:71)
    at com.linkedin.cubert.io.avro.AvroStorage.getPostCondition(AvroStorage.java:109)
    at com.linkedin.cubert.analyzer.physical.DependencyAnalyzer.getPostCondition(DependencyAnalyzer.java:309)
    at com.linkedin.cubert.analyzer.physical.DependencyAnalyzer.exitProgram(DependencyAnalyzer.java:262)
    ... 9 more

Everything works perfectly fine if I load de-r-00000.avro file directly. But not if I point to the directory with partitions.

Hadoop compatibility issues? Hadoop 2.5.1 gives Exception in thread "main" java.lang.IncompatibleClassChangeError

Hi, gang,

When I try to run tutorial, I got the following error message:

Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at com.linkedin.cubert.io.CubertInputFormat.getSplits(CubertInputFormat.java:74)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:493)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:510)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at com.linkedin.cubert.plan.physical.JobExecutor.run(JobExecutor.java:148)
at com.linkedin.cubert.plan.physical.ExecutorService.executeJob(ExecutorService.java:229)
at com.linkedin.cubert.plan.physical.ExecutorService.executeJobId(ExecutorService.java:196)
at com.linkedin.cubert.plan.physical.ExecutorService.execute(ExecutorService.java:140)
at com.linkedin.cubert.ScriptExecutor.execute(ScriptExecutor.java:301)
at com.linkedin.cubert.ScriptExecutor.main(ScriptExecutor.java:517)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

Could this be Hadoop compatibility issue? I am using Apache Hadoop 2.5.1.

Best,
Charlie

Error in bin/cubert

Hi, gang,

There is an error in bin/cubert when running on any case sensitive OS, such as Linux or MacOS.

Line 13: CUBERT_JAR=echo $CUBERT_HOME/lib/cubert-*.jar

Should be: CUBERT_JAR=echo $CUBERT_HOME/lib/Cubert-*.jar

Good job though!

Charlie Zha

COUNT_DISTINCT computed same value of two dimension key

Hi, @mvarshney Maneesh Varshney:

Recently I was researching cubert.

I transformed a sql with cubert script using cube operator and grouping set operator.

Why the count_distinct(mid) and count_distinct(session_id) has the same result value after computing the cube grouping sets?

eg:
count_distinct(mid) count_distinct(session_id)

500 500

200 200

Can anyone help me if I'm writing the wrong script??

Here is part of my code:

JOB1

Map{ data = LOAD xxx USING TEXT }

 BLOCKGEN data BY SIZE 1000000 PARTITIONED ON mid, session_id;


 STORE data INTO "/cubert/temp" USING RUBIX("overwrite":"true");

END

JOB2

Map {

data = LOAD "" USING RUBIX

}

cube data by

columns...

INNERT dim, session_id

AGGREGATES SUM(pv) as pv,

COUNT_DISTINCT(mid) as uv,

COUNT_DISTINCT(session_id) as visits,

SUM(bounce) as bounce

grouping sets

(log_date,app_name,app_platform),

(log_date,app_name,app_platform,is_new) ......

BLOCKGEN causes Java Heap Space

Hi, @suvodeep-pyne @mparkhe
When I perform a BLOCKGEN operation, at final reduce, the Java Heap Size exception throws.
I increased the REDUCE NUMBER seems not work.

2015-06-26 16:14:56,215 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: finalMerge called with 4 in-memory map-outputs and 5 on-disk map-outputs
2015-06-26 16:14:56,217 INFO [main] org.apache.hadoop.mapred.Merger: Merging 4 sorted segments
2015-06-26 16:14:56,217 INFO [main] org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 4 segments left of total size: 142558662 bytes
2015-06-26 16:14:57,234 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merged 4 segments, 142558706 bytes to disk to satisfy reduce memory limit
2015-06-26 16:14:57,235 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merging 6 files, 789110010 bytes from disk
2015-06-26 16:14:57,236 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
2015-06-26 16:14:57,236 INFO [main] org.apache.hadoop.mapred.Merger: Merging 6 sorted segments
2015-06-26 16:14:57,243 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor [.snappy]
2015-06-26 16:14:57,243 INFO [main] org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 6 segments left of total size: 3293450894 bytes
2015-06-26 16:14:57,605 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
2015-06-26 16:15:44,841 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space
    at com.linkedin.cubert.memory.PagedByteArray.ensureCapacity(PagedByteArray.java:192)
    at com.linkedin.cubert.memory.PagedByteArray.write(PagedByteArray.java:141)
    at com.linkedin.cubert.memory.PagedByteArrayOutputStream.write(PagedByteArrayOutputStream.java:67)
    at java.io.DataOutputStream.write(DataOutputStream.java:107)
    at java.io.DataOutputStream.writeUTF(DataOutputStream.java:401)
    at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
    at org.apache.pig.data.utils.SedesHelper.writeChararray(SedesHelper.java:66)
    at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:580)
    at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:462)
    at org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135)
    at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:650)
    at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:470)
    at org.apache.pig.data.BinSedesTuple.write(BinSedesTuple.java:40)
    at com.linkedin.cubert.io.DefaultTupleSerializer.serialize(DefaultTupleSerializer.java:41)
    at com.linkedin.cubert.io.DefaultTupleSerializer.serialize(DefaultTupleSerializer.java:28)
    at com.linkedin.cubert.utils.SerializedTupleStore.addToStore(SerializedTupleStore.java:118)
    at com.linkedin.cubert.block.CreateBlockOperator$StoredBlock.<init>(CreateBlockOperator.java:145)
    at com.linkedin.cubert.block.CreateBlockOperator.createBlock(CreateBlockOperator.java:536)
    at com.linkedin.cubert.block.CreateBlockOperator.next(CreateBlockOperator.java:488)
    at com.linkedin.cubert.plan.physical.PhaseExecutor.prepareOperatorChain(PhaseExecutor.java:261)
    at com.linkedin.cubert.plan.physical.PhaseExecutor.<init>(PhaseExecutor.java:111)
    at com.linkedin.cubert.plan.physical.CubertReducer.run(CubertReducer.java:68)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

JOIN Operator failed to parse

Hi, @mparkhe
Would you like to give an example for a join operator.
I follow the document in http://linkedin.github.io/Cubert/operators/join.html but seems not work.

JOB "Join count words"
    REDUCERS 5;
    MAP {
        a = LOAD "/cubert/words.txt" USING TEXT("schema": "STRING word");
    }

    MAP {
        b = LOAD "/cubert/words.txt" USING TEXT("schema": "STRING word");
    }

    test_joined = HASH-JOIN a BY word, b BY word;

    STORE test_joined INTO "/cubert/woud_count/join_output" USING TEXT();
END

shengli-mac$ cubert join.cmr
Using HADOOP_CLASSPATH=:/Users/shengli/git_repos/cubert/release/lib/*
line 13:9 mismatched input '=' expecting {'.', ID}
line 13:23 mismatched input 'BY' expecting '{'

Cannot parse cubert script. Exiting.
PROGRAM "Join Word Count";

Cannot compile cubert script. Exiting.
Exception in thread "main" java.text.ParseException
at com.linkedin.cubert.plan.physical.PhysicalParser.parsingTask(PhysicalParser.java:197)
at com.linkedin.cubert.plan.physical.PhysicalParser.parseInputStream(PhysicalParser.java:161)
at com.linkedin.cubert.plan.physical.PhysicalParser.parseProgram(PhysicalParser.java:156)
at com.linkedin.cubert.ScriptExecutor.compile(ScriptExecutor.java:304)
at com.linkedin.cubert.ScriptExecutor.main(ScriptExecutor.java:523)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.