hbutani / spark-druid-olap Goto Github PK

Sparkline BI Accelerator provides fast ad-hoc query capability over Logical Cubes. This has been folded into our SNAP Platform(http://bit.ly/2oBJSpP) an Integrated BI platform on Apache Spark.

Home Page: http://sparklinedata.com/

License: Apache License 2.0

Scala 93.80% Shell 6.20%

spark business-intelligence olap-cube sparksql query-optimization

spark-druid-olap's Introduction

Sparkline BI Accelerator

Latest release: 0.4.0
Documentation: Overview, Quick Start Guide, User Guide, Dev. Guide
Mailing List: User Mailing List
License: Apache 2.0
Continuous Integration:
Company: Sparkline Data

The Sparkline BI Accelerator is a Spark native Business Intelligence Stack geared towards providing fast ad-hoc querying over a Logical Cube(aka Star-Schema). It simplifies how enterprises can provide an ad-hoc query layer on top of a Hadoop/Spark(Big Open Data Stack).

we provide the ad-hoc query capability by extending the Spark SQL layer, through SQL extensions and an extended Optimizer(both logical and Physical optimizations).
we use OLAP indexing vs. pre-materialization as a technique to achieve query performance. OLAP Indexing is a well-known technique that is far superior to materialized views to support ad-hoc querying. We utilize another Open-Source Apache licensed Big Data component for the OLAP Indexing capability.

spark-druid-olap's People

Contributors

Stargazers

Watchers

Forkers

binlijin yanjiegao mindis udemirezen navis sirpkt kai2002 lazycrazyowl dieselnexr atreyasuraj yang040840219 agile-lab ash211 bjitd zolok nunb kainoa21 nishantmonu51 zimingwang tianshangjun rajitha703 big-data-laboratory vikramsuriyanarayanan 272029252 zhoushineyoung desperado1992 wllxyz carryni gangly mdiby jinghcruz purejade karthikrepo frank-dkvan flying0er aqia358 vivian1020 jerryjung pj25 jayfans3 agrajm treselle-systems dongbin86 yuangungun slunyakin-zz wangli084 tonyhau tejashtarun14 rajatahujaatinmobi kyhoolee kvasagiri mallik-g alexleethinker laosi0829 kioco qliro-marketing-services linmingqiang rahulmod jiyulongxu wchch yanghongkjxy lhfei attiqmscs004 sblack4 7mming7 zhangjianhrm redefinex kavap snhebo erisonliang saurabhdhupar laixiangshun juripero tyousaf ideepthink mengjin001 brandonjy dioptre staticor rick-zhang mysky528 gvramana phemmmie vsharathchandra er77 ghfork puneetjaiswal wenqchen ay-bay-com sunny30

spark-druid-olap's Issues

can we enable druid index on views created by spark data frames?

Followup on UnresolvedException in Spark for aggregation of the form fnX(fnY(..),..)

Followup on the following Spark-SQL issue: If there is an aggregation expression of the form fnX(fnY(..),..) (the top-level function has a fun. invocation as its child expression) then:

ResolveFunctions rule skips the top-level invocation. The expression is left as an UnResolvedFunction expression.
In ResolveGroupingAnalytics rule, the UnResolvedFunction is handled as an Alias
On the Expand operator, when the projections are computed when masking this expression as a null, a call is made to get its dataType(basicOperators:291). This causes a UnresolvedException to be thrown.

See DruidRewriteCubeTest::ShipDateYearAggCube for a an example.

Can be reproduced by replacing the shipDtYrGroup groupByExpr by "concat(concat(l_linestatus, 'a'), 'b')". More than one level of function invocation is needed, so "concat(l_linestatus, 'a')" works fine.

aggregation when join with another table

I have a InMemo table click_cached, And I try to join this table with a druid table cl_events_test and aggregate with druid like this select count(1),cast(cl_events_test.timestamp as date) as theday from cl_events_test, click_cached where click_cached.customerId=cl_events_test.customerId group by cast(cl_events_test.timestamp as date)

But I found druid index is not used in this case .

explain select count(1),cast(cl_events_test.timestamp as date) as theday from cl_events_test, click_cached where click_cached.customerId=cl_events_test.customerId group by cast(cl_events_test.timestamp as date);
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| plan |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| == Physical Plan == |
| TungstenAggregate(key=[cast(timestamp#318 as date)#473], functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#456L,theday#448]) |
| +- TungstenExchange hashpartitioning(cast(timestamp#318 as date)#473,200), None |
| +- TungstenAggregate(key=[cast(timestamp#318 as date) AS cast(timestamp#318 as date)#473], functions=[(count(1),mode=Partial,isDistinct=false)], output=[cast(timestamp#318 as date)#473,count#475L]) |
| +- Project [timestamp#318] |
| +- BroadcastHashJoin [customerId#316L], [customerId#453L], BuildRight |
| :- Project [timestamp#318,customerId#316L] |
| : +- Scan DruidRelationInfo(fullName = DruidRelationName(cl_events_test,10.25.2.91,cl_events_test), sourceDFName = cl_events_base, |
| timeDimensionCol = timestamp, |
| options = DruidRelationOptions(1000000,100000,true,true,true,30000,true,/druid,true,false,1,true,None))[event#313,targetId#314,targetName#315,customerId#316L,source#317,timestamp#318] |
| +- InMemoryColumnarTableScan [customerId#453L], InMemoryRelation [_c0#323L,theday#322,customerId#453L], true, 10000, StorageLevel(true, true, false, true, 1), Project [alias-2#325L AS _c0#323L,cast(alias-1#324 as date) AS theday#322,cast(customerId#316 as bigint) AS customerId#316L], Some(click_cached) |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+

count(distinct(dimension)) cannot translate to hyperUnique

I created a batch ingestion spec and defined a hyperUnique metric in it. When I query data from dataSource using sparkline, I found the count(distinct(dimension)) cannot translate to hyperUnique aggregation. Is it a bug or the misuse I made? or this feature is not supported for now?

Enable rewrites against the underlying Fact table

For example for tpch queries: allow the query to be written against lineitembase
The original thought was to add table properties to the underlying table recording the fact that it is associated with a Druid Index/Datasource. But this will not work because these table properties are not exposed in BaseRelation. So at the time of query rewrite we have lost the association between a LogicalRelation Operator and the underlying Table

Querying real time druid nodes

When querying realtime nodes, Sparkline does not return new data until clear druid cache is issued.

Add Rewrite Rule for Order By

Order By on dimensions or metric expression already pushable to Druid, should also be pushed to Druid

Set set spark.sparklinedata.druid.option.numSegmentsPerQuery=

This option does not seem to change the value for the next run.

Tried set spark.sparklinedata.druid.option.numSegmentsPerQuery=5 and set spark.sparklinedata.druid.option.numSegmentsPerQuery="5"

Druid Query Plan not generated when raw dataset is JSON

I have a raw dataset that contains JSON objects. I'm able to load onto Druid and query it using Druid Queries. However, when, I try to run a "groupBy" command using the accelerator (using 2.0), Druid query plan doesn't get generated, instead, the query runs against raw dataset:
sql("""CREATE TEMPORARY TABLE click_summary USING org.apache.spark.sql.json OPTIONS (path '/tmp/test/part-r-00000')""".stripMargin).printSchema() sql("""SELECT count(*) from click_summary""".stripMargin).show() sql(""" CREATE TEMPORARY TABLE clicksummarized USING org.sparklinedata.druid OPTIONS (sourceDataframe "click_summary", timeDimensionColumn "processingTime", druidDatasource "clickenhanced", druidHost "localhost", zkQualifyDiscoveryNames "true", queryHistoricalServers "true", numProcessingThreadsPerHistorical '1', starSchema ' { "factTable" : "clicksummarized", "relations" : [] } ')""".stripMargin) sql("""SELECT adIdChain.advertiser_guid, sum(clickCounters.total_click_count) as clicks from clicksummarized group by adIdChain.advertiser_guid""".stripMargin).show()

I tried specifying a columnMapping, but that doesn't help either:
With field names in Druid matching the nested naming of JSON
columnMapping '{"adIdChain.advertiser_guid" : "adIdChain.advertiser_guid","clickCounters.total_click_count" : "clickCounters.total_click_count"}',

With field names in Druid different from the nested naming of JSON
.option("columnMapping", "{\"adIdChain.advertiser_guid\" : \"adIdChain__advertiser_guid\"," + "\"clickCounters.total_click_count\" : \"clickCounters__total_click_count\"}")

Support joins on Star Schemas

handle spark datetime expressions in where clause on time dimension

Non-SQL df queries failing

Group by on columns in a dataframe are not working.

scala> q1OLAP.groupBy("l_returnflag").count().show()

16/01/15 11:21:59 ERROR Executor: Exception in task 0.0 in stage 16.0 (TID 1285)
org.sparklinedata.druid.DruidDataSourceException: Unexpected response status: HTTP/1.1 500 Internal Server Error
at org.sparklinedata.druid.client.DruidClient$$anonfun$3$$anonfun$apply$1.apply(DruidClient.scala:86)
at org.sparklinedata.druid.client.DruidClient$$anonfun$3$$anonfun$apply$1.apply(DruidClient.scala:81)
at scala.util.Try$.apply(Try.scala:161)
at org.sparklinedata.druid.client.DruidClient$$anonfun$3.apply(DruidClient.scala:81)
at org.sparklinedata.druid.client.DruidClient$$anonfun$3.apply(DruidClient.scala:70)
at scala.util.Success.flatMap(Try.scala:200)
at org.sparklinedata.druid.client.DruidClient.perform(DruidClient.scala:70)
at org.sparklinedata.druid.client.DruidClient.post(DruidClient.scala:101)
at org.sparklinedata.druid.client.DruidClient.executeQuery(DruidClient.scala:150)
at org.sparklinedata.druid.DruidRDD.compute(DruidRDD.scala:46)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/01/15 11:21:59 WARN ThrowableSerializationWrapper: Task exception could not be deserialized
java.lang.ClassNotFoundException: org.sparklinedata.druid.DruidDataSourceException
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at org.apache.spark.ThrowableSerializationWrapper.readObject(TaskEndReason.scala:167)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply$mcV$sp(TaskResultGetter.scala:108)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply(TaskResultGetter.scala:105)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply(TaskResultGetter.scala:105)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:105)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/01/15 11:21:59 ERROR TaskResultGetter: Could not deserialize TaskEndReason: ClassNotFound with classloader org.apache.spark.repl.SparkIMain$TranslatingClassLoader@41a64f33
16/01/15 11:21:59 WARN TaskSetManager: Lost task 0.0 in stage 16.0 (TID 1285, localhost): UnknownReason
16/01/15 11:21:59 ERROR TaskSetManager: Task 0 in stage 16.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 16.0 failed 1 times, most recent failure: Lost task 0.0 in stage 16.0 (TID 1285, localhost): UnknownReason
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:215)
at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:207)
at org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
at org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1314)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1377)
at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:178)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:401)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:362)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:370)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:23)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:30)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:32)
at $iwC$$iwC$$iwC$$iwC.(:34)
at $iwC$$iwC$$iwC.(:36)
at $iwC$$iwC.(:38)
at $iwC.(:40)
at (:42)
at .(:46)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

test more queries against tpch star schema

Executing same query multiple times and it fetches same result, however new data ingested in druid.

Hi Team,

I am executing same query multiple times with same session (same spark context). Also simultaneously new data ingested in druid. I am not getting updated result. Is there a way to clear the cache and fetch updated result from druid.

Finally if I kill the current session and start a new session, then getting updated result.

Let me know if any fix/solution available to this issue.

Thanks,
Senthil

Update to spark1.5 and spark-datetime for 1.5

spark 1.6 enable DruidPlanner even when spark.sql.hive.thriftServer.singleSession=false

SparkSQLSessionManager::openSession calls hiveContext.newSession; which doesn't copy the experimental.extraStrategies to the new SQLContext.

Current workaround is to use the HiveContext when HiveServer2 is started

No such Table exception

When I am running multiple Tableau users with many sheets I get the following error. The table is there and the error is intermittent. The same query executed 10-15 seconds after the error goes through.

org.spark-project.guava.util.concurrent.UncheckedExecutionException: org.apache.spark.sql.catalyst.analysis.NoSuchTableException
at org.spark-project.guava.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882)
at org.spark-project.guava.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:387)
at org.apache.spark.sql.hive.sparklinedata.SparklineMetastoreCatalog.org$apache$spark$sql$hive$sparklinedata$SparklineMetastoreCatalog$$super$lookupRelation(SparklineDataContext.scala:86)
at org.apache.spark.sql.hive.sparklinedata.SparklineMetastoreCatalog$$anonfun$lookupRelation$2.apply(SparklineDataContext.scala:86)
at org.apache.spark.sql.hive.sparklinedata.SparklineMetastoreCatalog$$anonfun$lookupRelation$2.apply(SparklineDataContext.scala:86)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.hive.sparklinedata.SparklineMetastoreCatalog.lookupRelation(SparklineDataContext.scala:86)
at org.apache.spark.sql.hive.sparklinedata.SparklineDataContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(SparklineDataContext.scala:74)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:161)
at org.apache.spark.sql.hive.sparklinedata.SparklineDataContext$$anon$1.lookupRelation(SparklineDataContext.scala:74)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:303)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$9.applyOrElse(Analyzer.scala:315)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$9.applyOrElse(Analyzer.scala:310)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:54)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:54)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:54)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:54)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:54)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:54)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:310)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:300)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:211)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:154)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:151)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:164)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException
at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:122)
at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:122)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:122)
at org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:384)
at org.apache.spark.sql.hive.sparklinedata.SparklineMetastoreCatalog.org$apache$spark$sql$hive$sparklinedata$SparklineMetastoreCatalog$$super$lookupRelation(SparklineDataContext.scala:86)
at org.apache.spark.sql.hive.sparklinedata.SparklineMetastoreCatalog$$anonfun$lookupRelation$2.apply(SparklineDataContext.scala:86)
at org.apache.spark.sql.hive.sparklinedata.SparklineMetastoreCatalog$$anonfun$lookupRelation$2.apply(SparklineDataContext.scala:86)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.hive.sparklinedata.SparklineMetastoreCatalog.lookupRelation(SparklineDataContext.scala:86)
at org.apache.spark.sql.hive.sparklinedata.SparklineDataContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(SparklineDataContext.scala:74)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:161)
at org.apache.spark.sql.hive.sparklinedata.SparklineDataContext$$anon$1.lookupRelation(SparklineDataContext.scala:74)
at org.apache.spark.sql.SQLContext.table(SQLContext.scala:831)
at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
at org.sparklinedata.druid.DefaultSource.createRelation(DefaultSource.scala:41)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.hive.HiveMetastoreCatalog$$anon$1.load(HiveMetastoreCatalog.scala:180)
at org.apache.spark.sql.hive.HiveMetastoreCatalog$$anon$1.load(HiveMetastoreCatalog.scala:124)
at org.spark-project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at org.spark-project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000)
at org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at org.spark-project.guava.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4880)
... 79 more
16/06/05 08:41:26 ERROR SparkExecuteStatementOperation: Error running hive query:

handle spark datetime functions as grouping expressions

Show by YEAR in Tableau
SELECT YEAR(CAST(CONCAT(TO_DATE(lineitem.l_shipdate),' 00:00:00') AS TIMESTAMP)) AS yr_l_shipdate_ok FROM ( select * from lineitembase ) lineitem JOIN ( select * from orders ) orders ON (lineitem.l_orderkey = orders.o_orderkey) JOIN ( select * from customer ) customer ON (orders.o_custkey = customer.c_custkey) JOIN ( select * from custnation ) custnation ON (customer.c_nationkey = custnation.cn_nationkey) JOIN ( select * from custregion ) custregion ON (custnation.cn_regionkey = custregion.cr_regionkey) GROUP BY YEAR(CAST(CONCAT(TO_DATE(lineitem.l_shipdate),' 00:00:00') AS TIMESTAMP))

2.SELECT CAST(FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(CONCAT(TO_DATE(lineitem.l_shipdate),' 00:00:00') AS TIMESTAMP)), 'yyyy-MM-01 00:00:00') AS TIMESTAMP) AS tmn_l_shipdate_ok FROM ( select * from lineitembase ) lineitem JOIN ( select * from orders ) orders ON (lineitem.l_orderkey = orders.o_orderkey) JOIN ( select * from customer ) customer ON (orders.o_custkey = customer.c_custkey) JOIN ( select * from custnation ) custnation ON (customer.c_nationkey = custnation.cn_nationkey) JOIN ( select * from custregion ) custregion ON (custnation.cn_regionkey = custregion.cr_regionkey) GROUP BY CAST(FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(CONCAT(TO_DATE(lineitem.l_shipdate),' 00:00:00') AS TIMESTAMP)), 'yyyy-MM-01 00:00:00') AS TIMESTAMP)

SELECT SUM(lineitem.l_extendedprice) AS sum_l_extendedprice_ok, CAST(CONCAT(TO_DATE(CAST(CONCAT(TO_DATE(lineitem.l_shipdate),' 00:00:00') AS TIMESTAMP)), ' 00:00:00') AS TIMESTAMP) AS tdy_l_shipdate_ok FROM ( select * from lineitembase ) lineitem JOIN ( select * from orders ) orders ON (lineitem.l_orderkey = orders.o_orderkey) JOIN ( select * from customer ) customer ON (orders.o_custkey = customer.c_custkey) JOIN ( select * from custnation ) custnation ON (customer.c_nationkey = custnation.cn_nationkey) JOIN ( select * from custregion ) custregion ON (custnation.cn_regionkey = custregion.cr_regionkey) GROUP BY CAST(CONCAT(TO_DATE(CAST(CONCAT(TO_DATE(lineitem.l_shipdate),' 00:00:00') AS TIMESTAMP)), ' 00:00:00') AS TIMESTAMP)

Investigate improvements on translation of comparison filters

Currently these are translated as Javascript functions. There is a huge difference in performance between equality predicate evaluation and these. Is there ta way o evaluate comparison predicates as native java inside Druid

Generating Denormalized TPCH Dataset

This is the command I used learning from here: https://github.com/SparklineData/spark-druid-olap/wiki/Generating-Denormalized-TPCH-Dataset

spark yingyang$ bin/spark-submit --packages com.databricks:spark-csv_2.10:1.1.0,SparklineData:spark-datetime:0.0.2,SparklineData:spark-druid-olap:0.0.2 --class org.sparklinedata.tpch.TpchGenMain /Users/yingyang/Downloads/tpch-spark-druid-master/tpchData/target/scala-2.10/tpchdata_2.10-0.0.1.jar /Users/yingyang/Downloads/data_dbgen --scale 1

I got an error:
Ivy Default Cache set to: /Users/yingyang/.ivy2/cache
The jars for the packages stored in: /Users/yingyang/.ivy2/jars
:: loading settings :: url = jar:file:/Users/yingyang/Downloads/spark/lib/spark-assembly-1.6.2-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
SparklineData#spark-datetime added as a dependency
SparklineData#spark-druid-olap added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.10;1.1.0 in list
found org.apache.commons#commons-csv;1.1 in list
found com.univocity#univocity-parsers;1.5.1 in list
found SparklineData#spark-datetime;0.0.2 in spark-packages
found com.github.nscala-time#nscala-time_2.10;1.6.0 in list
found joda-time#joda-time;2.5 in list
found org.joda#joda-convert;1.2 in list
found SparklineData#spark-druid-olap;0.0.2 in spark-packages
found org.apache.httpcomponents#httpclient;4.5 in central
found org.apache.httpcomponents#httpcore;4.4.1 in central
found commons-logging#commons-logging;1.2 in central
found commons-codec#commons-codec;1.9 in central
found org.json4s#json4s-ext_2.10;3.2.10 in central
found org.joda#joda-convert;1.6 in central
found com.github.scopt#scopt_2.10;3.3.0 in list
downloading http://dl.bintray.com/spark-packages/maven/SparklineData/spark-datetime/0.0.2/spark-datetime-0.0.2.jar ...
[SUCCESSFUL ] SparklineData#spark-datetime;0.0.2!spark-datetime.jar (426ms)
downloading http://dl.bintray.com/spark-packages/maven/SparklineData/spark-druid-olap/0.0.2/spark-druid-olap-0.0.2.jar ...
[SUCCESSFUL ] SparklineData#spark-druid-olap;0.0.2!spark-druid-olap.jar (501ms)
downloading https://repo1.maven.org/maven2/org/apache/httpcomponents/httpclient/4.5/httpclient-4.5.jar ...
[SUCCESSFUL ] org.apache.httpcomponents#httpclient;4.5!httpclient.jar (99ms)
downloading https://repo1.maven.org/maven2/org/json4s/json4s-ext_2.10/3.2.10/json4s-ext_2.10-3.2.10.jar ...
[SUCCESSFUL ] org.json4s#json4s-ext_2.10;3.2.10!json4s-ext_2.10.jar (19ms)
downloading https://repo1.maven.org/maven2/org/apache/httpcomponents/httpcore/4.4.1/httpcore-4.4.1.jar ...
[SUCCESSFUL ] org.apache.httpcomponents#httpcore;4.4.1!httpcore.jar (75ms)
downloading https://repo1.maven.org/maven2/commons-logging/commons-logging/1.2/commons-logging-1.2.jar ...
[SUCCESSFUL ] commons-logging#commons-logging;1.2!commons-logging.jar (18ms)
downloading https://repo1.maven.org/maven2/commons-codec/commons-codec/1.9/commons-codec-1.9.jar ...
[SUCCESSFUL ] commons-codec#commons-codec;1.9!commons-codec.jar (69ms)
downloading https://repo1.maven.org/maven2/org/joda/joda-convert/1.6/joda-convert-1.6.jar ...
[SUCCESSFUL ] org.joda#joda-convert;1.6!joda-convert.jar (21ms)
:: resolution report :: resolve 4244ms :: artifacts dl 1239ms
:: modules in use:
SparklineData#spark-datetime;0.0.2 from spark-packages in [default]
SparklineData#spark-druid-olap;0.0.2 from spark-packages in [default]
com.databricks#spark-csv_2.10;1.1.0 from list in [default]
com.github.nscala-time#nscala-time_2.10;1.6.0 from list in [default]
com.github.scopt#scopt_2.10;3.3.0 from list in [default]
com.univocity#univocity-parsers;1.5.1 from list in [default]
commons-codec#commons-codec;1.9 from central in [default]
commons-logging#commons-logging;1.2 from central in [default]
joda-time#joda-time;2.5 from list in [default]
org.apache.commons#commons-csv;1.1 from list in [default]
org.apache.httpcomponents#httpclient;4.5 from central in [default]
org.apache.httpcomponents#httpcore;4.4.1 from central in [default]
org.joda#joda-convert;1.6 from central in [default]
org.json4s#json4s-ext_2.10;3.2.10 from central in [default]
:: evicted modules:
org.joda#joda-convert;1.2 by [org.joda#joda-convert;1.6] in [default]
joda-time#joda-time;2.3 by [joda-time#joda-time;2.5] in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 17 | 8 | 8 | 2 || 14 | 8 |
---------------------------------------------------------------------

:: problems summary ::
:::: WARNINGS
module not found: com.github.SparklineData#spark-datetime;bf5693a575a1dea5b663e4e8b30a0ba94c21d62d

==== local-m2-cache: tried

  file:/Users/yingyang/.m2/repository/com/github/SparklineData/spark-datetime/bf5693a575a1dea5b663e4e8b30a0ba94c21d62d/spark-datetime-bf5693a575a1dea5b663e4e8b30a0ba94c21d62d.pom

  -- artifact com.github.SparklineData#spark-datetime;bf5693a575a1dea5b663e4e8b30a0ba94c21d62d!spark-datetime.jar:

  file:/Users/yingyang/.m2/repository/com/github/SparklineData/spark-datetime/bf5693a575a1dea5b663e4e8b30a0ba94c21d62d/spark-datetime-bf5693a575a1dea5b663e4e8b30a0ba94c21d62d.jar

==== local-ivy-cache: tried

  /Users/yingyang/.ivy2/local/com.github.SparklineData/spark-datetime/bf5693a575a1dea5b663e4e8b30a0ba94c21d62d/ivys/ivy.xml

==== central: tried

  https://repo1.maven.org/maven2/com/github/SparklineData/spark-datetime/bf5693a575a1dea5b663e4e8b30a0ba94c21d62d/spark-datetime-bf5693a575a1dea5b663e4e8b30a0ba94c21d62d.pom

  -- artifact com.github.SparklineData#spark-datetime;bf5693a575a1dea5b663e4e8b30a0ba94c21d62d!spark-datetime.jar:

  https://repo1.maven.org/maven2/com/github/SparklineData/spark-datetime/bf5693a575a1dea5b663e4e8b30a0ba94c21d62d/spark-datetime-bf5693a575a1dea5b663e4e8b30a0ba94c21d62d.jar

==== spark-packages: tried

  http://dl.bintray.com/spark-packages/maven/com/github/SparklineData/spark-datetime/bf5693a575a1dea5b663e4e8b30a0ba94c21d62d/spark-datetime-bf5693a575a1dea5b663e4e8b30a0ba94c21d62d.pom

  -- artifact com.github.SparklineData#spark-datetime;bf5693a575a1dea5b663e4e8b30a0ba94c21d62d!spark-datetime.jar:

  http://dl.bintray.com/spark-packages/maven/com/github/SparklineData/spark-datetime/bf5693a575a1dea5b663e4e8b30a0ba94c21d62d/spark-datetime-bf5693a575a1dea5b663e4e8b30a0ba94c21d62d.jar

    ::::::::::::::::::::::::::::::::::::::::::::::

    ::          UNRESOLVED DEPENDENCIES         ::

    ::::::::::::::::::::::::::::::::::::::::::::::

    :: com.github.SparklineData#spark-datetime;bf5693a575a1dea5b663e4e8b30a0ba94c21d62d: not found

    ::::::::::::::::::::::::::::::::::::::::::::::

:::: ERRORS
unknown resolver null

unknown resolver null

unknown resolver null

unknown resolver null

unknown resolver null

unknown resolver null

unknown resolver null

unknown resolver null

unknown resolver null

unknown resolver null

unknown resolver sbt-chain

unknown resolver null

unknown resolver sbt-chain

unknown resolver null

unknown resolver null

unknown resolver null

unknown resolver null

unknown resolver null

unknown resolver null

unknown resolver null

unknown resolver null

:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.github.SparklineData#spark-datetime;bf5693a575a1dea5b663e4e8b30a0ba94c21d62d: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1068)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:287)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:154)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Setting up dataset as part of quick start on spark-druid throws json parser error

I have implemented the steps listed as part of quick start guide to test spark sql with druid. while executing a command to set index with raw data, I get a Json Object error. Could you please help identify what's causing the issue. I have imported jackson to parse json in spark. Below is the error message.

scala> sql("""
| CREATE TEMPORARY TABLE orderLineItemPartSupplier
| USING org.sparklinedata.druid
| OPTIONS (sourceDataframe "orderLineItemPartSupplierBase",
| timeDimensionColumn "l_shipdate",
| druidDatasource "tpch",
| druidHost "localhost",
| druidPort "8082",
| columnMapping '{ "l_quantity" : "sum_l_quantity", "ps_availqty" : "sum_ps_availqty", "cn_name" : "c_nation", "cr_name" : "c_region", "sn_name" : "s_nation", "sr_name" : "s_region" } ',
| functionalDependencies '[ {"col1" : "c_name", "col2" : "c_address", "type" : "1-1"}, {"col1" : "c_phone", "col2" : "c_address", "type" : "1-1"}, {"col1" : "c_name", "col2" : "c_mktsegment", "type" : "n-1"}, {"col1" : "c_name", "col2" : "c_comment", "type" : "1-1"}, {"col1" : "c_name", "col2" : "c_nation", "type" : "n-1"}, {"col1" : "c_nation", "col2" : "c_region", "type" : "n-1"} ] ',
| starSchema ' { "factTable" : "orderLineItemPartSupplier", "relations" : [] } ')
| """.stripMargin
| )
org.json4s.package$MappingException: Do not know how to convert JObject(List()) into class java.lang.String
at org.json4s.Extraction$.convert(Extraction.scala:559)
at org.json4s.Extraction$.extract(Extraction.scala:331)
at org.json4s.Extraction$.extract(Extraction.scala:42)
at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:21)
at org.sparklinedata.druid.client.DruidClient.timeBoundary(DruidClient.scala:122)
at org.sparklinedata.druid.client.DruidClient.metadata(DruidClient.scala:130)
at org.sparklinedata.druid.metadata.DruidRelationInfo$.apply(DruidRelationInfo.scala:62)
at org.sparklinedata.druid.DefaultSource.createRelation(DefaultSource.scala:89)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
at org.apache.spark.sql.execution.datasources.CreateTempTableUsing.run(ddl.scala:93)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:144)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:129)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:30)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:57)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:59)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:61)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:63)
at $iwC$$iwC$$iwC$$iwC.(:65)
at $iwC$$iwC$$iwC.(:67)
at $iwC$$iwC.(:69)
at $iwC.(:71)
at (:73)
at .(:77)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Enable a mode where all Plans are executed from the Druid Index

The usecase here is to use the DruidDataSource as a traditional Spark DataSource. Bring in all the data from Druid and execute the plan in Spark. Even in this mode, we will still try to optimize the Plan to push Joins/Aggregates to Druid.

Supporting next_day function for druid push down

The following SQL needs to get pushed down to druid. Currently it does not.

SELECT v.campaign_name AS campaign_name,
sum(v.conversions) AS sum_conversions_ok,
sum(v.impressions) AS sum_impressions_ok,
cast(date_add(next_day(cast(v.date_string AS timestamp),'SU'),-7) AS timestamp) AS twk_date_string_ok
FROM (
SELECT *
FROM sparkline_viewability_2_4 ) v
WHERE (
v.advertiser_name = 'AMEX Personal Savings')
GROUP BY v.campaign_name,
cast(date_add(next_day(cast(v.date_string AS timestamp),'SU'),-7) AS timestamp)

Rewrite to Druid not happening when tables are cached in Spark

The query rewrite to Druid is not happening when the spark tables are cached.

explain select
c_mktsegment,

  sum(l_extendedprice) as price
  from customer,
               orders,
               lineitem
  where  dateIsBefore(dateTime(`o_orderdate`),dateTime("1995-03-15")) and dateIsAfter(dateTime(`l_shipdate`),dateTime("1995-03-15"))
               and c_custkey = o_custkey
               and l_orderkey = o_orderkey
  group by c_mktsegment

== Physical Plan ==

handle spark datetime expressions in where clause on non time columns

Sparklinedata connector errors while using with Spark1.6.0

We are doing a POC with Sparkline data and running queries against TPCH data. Using Sparklinedata connector with Spark1.6.0 causes the following error.

scala> sql("""
| select l_returnflag as r, l_linestatus as ls,
| count(*), sum(l_extendedprice) as s, max(ps_supplycost) as m, avg(ps_availqty) as a
| from orderLineItemPartSupplier
| group by l_returnflag, l_linestatus
| order by s, ls, r
| limit 3""".stripMargin
| ).show()
java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/expressions/AggregateExpression
at org.apache.spark.sql.sources.druid.AggregateTransform$$anonfun$4$$anonfun$apply$1.applyOrElse(AggregateTransform.scala:87)
at org.apache.spark.sql.sources.druid.AggregateTransform$$anonfun$4$$anonfun$apply$1.applyOrElse(AggregateTransform.scala:87)
at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$collect$1.apply(TreeNode.scala:136)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$collect$1.apply(TreeNode.scala:136)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:95)
at org.apache.spark.sql.catalyst.trees.TreeNode.collect(TreeNode.scala:136)
at org.apache.spark.sql.sources.druid.AggregateTransform$$anonfun$4.apply(AggregateTransform.scala:87)
at org.apache.spark.sql.sources.druid.AggregateTransform$$anonfun$4.apply(AggregateTransform.scala:87)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.apache.spark.sql.sources.druid.AggregateTransform$class.org$apache$spark$sql$sources$druid$AggregateTransform$$transformSingleGrouping(AggregateTransform.scala:87)
at org.apache.spark.sql.sources.druid.AggregateTransform$$anonfun$7$$anonfun$apply$10.apply(AggregateTransform.scala:202)
at org.apache.spark.sql.sources.druid.AggregateTransform$$anonfun$7$$anonfun$apply$10.apply(AggregateTransform.scala:201)
at scala.collection.GenTraversableViewLike$FlatMapped$$anonfun$foreach$3.apply(GenTraversableViewLike.scala:90)
at scala.collection.GenTraversableViewLike$FlatMapped$$anonfun$foreach$3.apply(GenTraversableViewLike.scala:89)
at scala.collection.GenTraversableViewLike$FlatMapped$$anonfun$foreach$3$$anonfun$apply$1.apply(GenTraversableViewLike.scala:91)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.GenTraversableViewLike$FlatMapped$$anonfun$foreach$3.apply(GenTraversableViewLike.scala:90)
at scala.collection.GenTraversableViewLike$FlatMapped$$anonfun$foreach$3.apply(GenTraversableViewLike.scala:89)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.SeqLike$$anon$2.foreach(SeqLike.scala:635)
at scala.collection.GenTraversableViewLike$FlatMapped$class.foreach(GenTraversableViewLike.scala:89)
at scala.collection.SeqViewLike$$anon$4.foreach(SeqViewLike.scala:79)
at scala.collection.GenTraversableViewLike$FlatMapped$class.foreach(GenTraversableViewLike.scala:89)
at scala.collection.SeqViewLike$$anon$4.foreach(SeqViewLike.scala:79)
at scala.collection.GenTraversableViewLike$FlatMapped$$anonfun$foreach$3.apply(GenTraversableViewLike.scala:90)
at scala.collection.GenTraversableViewLike$FlatMapped$$anonfun$foreach$3.apply(GenTraversableViewLike.scala:89)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.SeqLike$$anon$2.foreach(SeqLike.scala:635)
at scala.collection.GenTraversableViewLike$FlatMapped$class.foreach(GenTraversableViewLike.scala:89)
at scala.collection.SeqViewLike$$anon$4.foreach(SeqViewLike.scala:79)
at scala.collection.GenTraversableViewLike$Mapped$class.foreach(GenTraversableViewLike.scala:80)
at scala.collection.SeqViewLike$$anon$3.foreach(SeqViewLike.scala:78)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableLike$class.to(TraversableLike.scala:629)
at scala.collection.SeqViewLike$AbstractTransformed.to(SeqViewLike.scala:43)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:257)
at scala.collection.SeqViewLike$AbstractTransformed.toList(SeqViewLike.scala:43)
at org.apache.spark.sql.sources.druid.LimitTransfom$$anonfun$1.apply(DruidTransforms.scala:54)
at org.apache.spark.sql.sources.druid.LimitTransfom$$anonfun$1.apply(DruidTransforms.scala:40)
at org.apache.spark.sql.sources.druid.DruidPlanner$$anonfun$plan$1.apply(DruidPlanner.scala:41)
at org.apache.spark.sql.sources.druid.DruidPlanner$$anonfun$plan$1.apply(DruidPlanner.scala:41)
at scala.collection.GenTraversableViewLike$FlatMapped$$anonfun$foreach$3.apply(GenTraversableViewLike.scala:90)
at scala.collection.GenTraversableViewLike$FlatMapped$$anonfun$foreach$3.apply(GenTraversableViewLike.scala:89)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.SeqLike$$anon$2.foreach(SeqLike.scala:635)
at scala.collection.GenTraversableViewLike$FlatMapped$class.foreach(GenTraversableViewLike.scala:89)
at scala.collection.SeqViewLike$$anon$4.foreach(SeqViewLike.scala:79)
at scala.collection.GenTraversableViewLike$Mapped$class.foreach(GenTraversableViewLike.scala:80)
at scala.collection.SeqViewLike$$anon$3.foreach(SeqViewLike.scala:78)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableLike$class.to(TraversableLike.scala:629)
at scala.collection.SeqViewLike$AbstractTransformed.to(SeqViewLike.scala:43)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:257)
at scala.collection.SeqViewLike$AbstractTransformed.toList(SeqViewLike.scala:43)
at org.apache.spark.sql.sources.druid.LimitTransfom$$anonfun$1.apply(DruidTransforms.scala:61)
at org.apache.spark.sql.sources.druid.LimitTransfom$$anonfun$1.apply(DruidTransforms.scala:40)
at org.apache.spark.sql.sources.druid.DruidPlanner$$anonfun$plan$1.apply(DruidPlanner.scala:41)
at org.apache.spark.sql.sources.druid.DruidPlanner$$anonfun$plan$1.apply(DruidPlanner.scala:41)
at scala.collection.GenTraversableViewLike$FlatMapped$$anonfun$foreach$3.apply(GenTraversableViewLike.scala:90)
at scala.collection.GenTraversableViewLike$FlatMapped$$anonfun$foreach$3.apply(GenTraversableViewLike.scala:89)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.SeqLike$$anon$2.foreach(SeqLike.scala:635)
at scala.collection.GenTraversableViewLike$FlatMapped$class.foreach(GenTraversableViewLike.scala:89)
at scala.collection.SeqViewLike$$anon$4.foreach(SeqViewLike.scala:79)
at scala.collection.GenTraversableViewLike$FlatMapped$class.foreach(GenTraversableViewLike.scala:89)
at scala.collection.SeqViewLike$$anon$4.foreach(SeqViewLike.scala:79)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableLike$class.to(TraversableLike.scala:629)
at scala.collection.SeqViewLike$AbstractTransformed.to(SeqViewLike.scala:43)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:257)
at scala.collection.SeqViewLike$AbstractTransformed.toList(SeqViewLike.scala:43)
at org.apache.spark.sql.sources.druid.DruidStrategy.apply(DruidStrategy.scala:84)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:47)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:45)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:52)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:52)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2134)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1413)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1495)
at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:171)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:394)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:355)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:363)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:57)
at $iwC$$iwC$$iwC$$iwC.(:59)
at $iwC$$iwC$$iwC.(:61)
at $iwC$$iwC.(:63)
at $iwC.(:65)
at (:67)
at .(:71)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.expressions.AggregateExpression
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 171 more

Support pushing down nested aggregations/queries

Support pushing nested groupby to druid by using Query Data Source.
http://druid.io/docs/latest/querying/datasource.html

Need to check if this will suffer from border bottleneck issue (in which case its better to do second aggregation in spark).

handle tableau pattern for quarter aggregation

Show by QUARTER in Tableau

SELECT CAST(CONCAT(YEAR(CAST(CONCAT(TO_DATE(lineitem.l_shipdate),' 00:00:00') AS TIMESTAMP)), (CASE WHEN MONTH(CAST(CONCAT(TO_DATE(lineitem.l_shipdate),' 00:00:00') AS TIMESTAMP))<4 THEN '-01' WHEN MONTH(CAST(CONCAT(TO_DATE(lineitem.l_shipdate),' 00:00:00') AS TIMESTAMP))<7 THEN '-04' WHEN MONTH(CAST(CONCAT(TO_DATE(lineitem.l_shipdate),' 00:00:00') AS TIMESTAMP))<10 THEN '-07' ELSE '-10' END), '-01 00:00:00') AS TIMESTAMP) AS tqr_l_shipdate_ok FROM ( select * from lineitembase ) lineitem JOIN ( select * from orders ) orders ON (lineitem.l_orderkey = orders.o_orderkey) JOIN ( select * from customer ) customer ON (orders.o_custkey = customer.c_custkey) JOIN ( select * from custnation ) custnation ON (customer.c_nationkey = custnation.cn_nationkey) JOIN ( select * from custregion ) custregion ON (custnation.cn_regionkey = custregion.cr_regionkey) GROUP BY CAST(CONCAT(YEAR(CAST(CONCAT(TO_DATE(lineitem.l_shipdate),' 00:00:00') AS TIMESTAMP)), (CASE WHEN MONTH(CAST(CONCAT(TO_DATE(lineitem.l_shipdate),' 00:00:00') AS TIMESTAMP))<4 THEN '-01' WHEN MONTH(CAST(CONCAT(TO_DATE(lineitem.l_shipdate),' 00:00:00') AS TIMESTAMP))<7 THEN '-04' WHEN MONTH(CAST(CONCAT(TO_DATE(lineitem.l_shipdate),' 00:00:00') AS TIMESTAMP))<10 THEN '-07' ELSE '-10' END), '-01 00:00:00') AS TIMESTAMP)

/ by zero when run query of sample retail dataset

I got an error when I run "select count(*) from sp_demo_retail;" in beeline.

The error message is:

Error: java.lang.ArithmeticException: / by zero (state=,code=0)
java.sql.SQLException: java.lang.ArithmeticException: / by zero
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:296)
at org.apache.hive.beeline.Commands.execute(Commands.java:848)
at org.apache.hive.beeline.Commands.sql(Commands.java:713)
at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973)
at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813)
at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771)
at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484)
at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467)

Here is my ddl:

`CREATE TABLE sp_demo_retail_base (
invoiceno string
,stockcode string
,description string
, quantity bigint
, invoicedate string
, unitprice double
, customerid string
, country string
, count int
)
USING com.databricks.spark.csv
OPTIONS (path "/opt/retails.csv",
header "false", delimiter ",")

CREATE TABLE sp_demo_retail
USING org.sparklinedata.druid
OPTIONS (
sourceDataframe "sp_demo_retail_base",
timeDimensionColumn "invoicedate",
druidDatasource "retail",
druidHost "10.25.2.91",
zkQualifyDiscoveryNames "false",
queryHistoricalServers "true",
numSegmentsPerHistoricalQuery "1",
columnMapping '{ } ',
functionalDependencies '[] ',
starSchema ' { "factTable" : "sp_demo_retail_base", "relations" : [] } ')`

0: jdbc:hive2://localhost:10000/> explain select * from sp_demo_retail limit 10;
Getting log thread is interrupted, since query is done!
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| plan |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| == Physical Plan == |
| Limit 10 |
| +- ConvertToSafe |
| +- Project [invoiceno#9,stockcode#10,description#11,quantity#12L,invoicedate#13,unitprice#14,customerid#15,country#16,count#17] |
| +- Scan DruidRelationInfo(fullName = DruidRelationName(sp_demo_retail_base,10.25.2.91,retail), sourceDFName = sp_demo_retail_base, |
| timeDimensionCol = invoicedate, |
| options = DruidRelationOptions(1000000,100000,true,true,true,30000,true,/druid,true,false,1,None))[invoiceno#9,stockcode#10,description#11,quantity#12L,invoicedate#13,unitprice#14,customerid#15,country#16,count#17] |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
7 rows selected (0.144 seconds)
0: jdbc:hive2://localhost:10000/> explain select count(1) from sp_demo_retail;
Getting log thread is interrupted, since query is done!
+-------------------------------------------+--+
| plan |
+-------------------------------------------+--+
| == Physical Plan == |
| java.lang.ArithmeticException: / by zero |
+-------------------------------------------+--+
2 rows selected (0.069 seconds)
0: jdbc:hive2://localhost:10000/>

Support GBy queries with no Aggregates

The following query errors out. Query as seen on Spark UI monitoring.( 4040 port)
SELECT customer.c_mktsegment AS c_mktsegment FROM ( select * from lineitemindexed ) lineitem JOIN ( select * from orders ) orders ON (lineitem.l_orderkey = orders.o_orderkey) JOIN ( select * from customer ) customer ON (orders.o_custkey = customer.c_custkey) JOIN ( select * from custnation ) custnation ON (customer.c_nationkey = custnation.cn_nationkey) JOIN ( select * from custregion ) custregion ON (custnation.cn_regionkey = custregion.cr_regionkey) GROUP BY customer.c_mktsegment

Job aborted due to stage failure: Task 0 in stage 18.0 failed 1 times, most recent failure: Lost task 0.0 in stage 18.0 (TID 1061, localhost): UnknownReason + details
Job aborted due to stage failure: Task 0 in stage 18.0 failed 1 times, most recent failure: Lost task 0.0 in stage 18.0 (TID 1061, localhost): UnknownReason
Driver stacktrace:

why we need sourceDataframe?

hi:
as my understanding, druid already indexed data. why we still have a sourceDataframe like orderLineItemPartSupplierBase which providing a data source path?
creating a table schema maping to druid internal schema that should be enough. so would somebody explain it?

Additional aggregation cols produce exception

select o_orderstatus as x, cast(o_orderdate as date) as y, count(*) as z
from orderLineItemPartSupplierBase
where o_orderdate <= '1993-06-30'
group by o_orderstatus, cast(o_orderdate as date)
order by x, y, z

Detailed logging

Sparkline should write to a log showing the SQL, and related druid queries and also write whether a query is using the index or not.

Rewrite Rule for Rollup/Cube

Rewrite Rollup/Cube clauses into a Union of Druid Queries over each grouping set

Support Boolean data type

Add support for Boolean data type for query pushdown to druid.

Filter on decimal data type cause java.util.NoSuchElementException: None.get

Filter on decimal data type cause java.util.NoSuchElementException: None.get error message.

For example:
select name, type, count(1) from table where decimal = 10 group by name, type; -> java.util.NoSuchElementException: None.get error message, but
select name, type, count(1) from table where decimal in (10) group by name, type; -> OK

This is true for the following operators: =, >=, <=, >, <

With Tungsten engine, all of the queries can run without error.

Spark version: 1.6.1

Support Druid Index at a grain other than the event grain

Currently translation assumes Druid Index is at the grain of individual events. To support higher grains in the index, translation layer must ensure that the time aggregation is at or above the Grain in the Druid Index.

Using longsum vs count for queries.

select date_int, count(*) from MyTable group by date_int order by date_int produces the following JSON
The aggr type should be longsum to get the exact number of rows in the table.

......
"aggregations" : [ {
"jsonClass" : "FunctionAggregationSpec",
"type" : "count",
"name" : "alias-1",
"fieldName" : "count"
} ],

Is there an option available to use druid index for any kind of queries (including normal select * queries)

I have created a spark underlying table and druid data source table. I am planning not to store the raw data for underlying table and looking for an option to use druid index data for all kind queries. Please let me know if you such option to fetch data from druid for all queries.

Currently 'select * from dds_table' query fetching 0 result (Since no raw data stored) and
'select os, count (*) from dds_table group by os' fetching the actual result (fetching from druid).

Please suggest.

Query on the Last segment of a datasource is not returning the expected results

: jdbc:hive2://spl08.dev.dw.sc.gwallet.com:1> explain select count(*) from moat_daily where cast(ts_local as timestamp) >= cast ("2016-08-19 00:00:00Z" as timestamp);
+-----------------------------------------------------------------------------------------------------------------------+--+
| plan |
+-----------------------------------------------------------------------------------------------------------------------+--+
| == Physical Plan == |
| Project [alias-3#740L AS _c0#739L] |
| +- SortBasedAggregate(key=[], functions=[(sum(alias-3#740L),mode=Complete,isDistinct=false)], output=[alias-3#740L]) |
| +- ConvertToSafe |
| +- TungstenExchange SinglePartition, None |
| +- ConvertToUnsafe |
| +- Scan DruidQuery(19982889): { |
| "q" : { |
| "jsonClass" : "TimeSeriesQuerySpec", |
| "queryType" : "timeseries", |
| "dataSource" : "moat_daily", |
| "intervals" : [ "2016-08-19T00:00:00.000Z/2016-08-19T00:00:01.000Z" ], |
| "granularity" : "all", |
| "aggregations" : [ { |
| "jsonClass" : "FunctionAggregationSpec", |
| "type" : "longSum", |
| "name" : "alias-3", |
| "fieldName" : "count" |
| } ] |
| }, |
| "useSmile" : true, |
| "queryHistoricalServer" : true, |
| "numSegmentsPerQuery" : 2, |
| "intervalSplits" : [ { |
| "start" : 1471564800000, |
| "end" : 1471564801000 |
| } ], |
| "outputAttrSpec" : [ { |
| "exprId" : { |
| "id" : 740, |
| "jvmId" : { } |
| }, |
| "name" : "alias-3", |
| "dataType" : { }, |
| "tf" : "toLong" |
| } ] |
| }[alias-3#740L] |
+-----------------------------------------------------------------------------------------------------------------------+--+
37 rows selected (0.049 seconds)

The interval end does not include the last segment (it should be 2016-08-20T00:00:00.000Z)

Using spark-druid-olap in an external project

I couldn't find documentation on how to use this custom DataSource implementation and optimizations in an external project.

(a) Is the jar file published to an external repo such that it can be referred to via the build.sbt file in my project?

(b) Do I need to clone the repo and manually build the jar myself?

Thanks,
Jithin

Add Rewrite Rule for Limit clause

push limit clause to Druid

Investigate and document exposing Druid DataSource and Planner in JDBC

How to register the DruidPlanner as part of startup of the server.

handle tableau pattern for spark datetime filters on non time columns

----Date filters----
SELECT SUM(lineitem.l_extendedprice) AS sum_l_extendedprice_ok, CAST(CONCAT(TO_DATE(CAST(CONCAT(TO_DATE(lineitem.l_shipdate),' 00:00:00') AS TIMESTAMP)), ' 00:00:00') AS TIMESTAMP) AS tdy_l_shipdate_ok FROM ( select * from lineitembase ) lineitem JOIN ( select * from orders ) orders ON (lineitem.l_orderkey = orders.o_orderkey) JOIN ( select * from customer ) customer ON (orders.o_custkey = customer.c_custkey) JOIN ( select * from custnation ) custnation ON (customer.c_nationkey = custnation.cn_nationkey) JOIN ( select * from custregion ) custregion ON (custnation.cn_regionkey = custregion.cr_regionkey) WHERE ((CAST(CONCAT(TO_DATE(orders.o_orderdate),' 00:00:00') AS TIMESTAMP) >= CAST('1993-05-19 00:00:00' AS TIMESTAMP)) AND (CAST(CONCAT(TO_DATE(orders.o_orderdate),' 00:00:00') AS TIMESTAMP) <= CAST('1998-08-02 00:00:00' AS TIMESTAMP))) GROUP BY CAST(CONCAT(TO_DATE(CAST(CONCAT(TO_DATE(lineitem.l_shipdate),' 00:00:00') AS TIMESTAMP)), ' 00:00:00') AS TIMESTAMP)

Project cols in order different from GB cols results in exception

When cols are specified in an order different from GB, query fails.

select cast(o_orderdate as date) as y, o_orderstatus as x
from orderLineItemPartSupplier
where o_orderdate = '1994-06-30'
group by o_orderstatus, cast(o_orderdate as date)
order by x, y

Support regex functions like RLIKE as in the one below.

SELECT viewability_5.campaign_name AS campaign_name, viewability_5.country AS country, viewability_5.creative_size AS creative_size FROM viewability2.viewability_5 viewability_5 WHERE ((viewability_5.advertiser_name = 'XXX') AND (CAST(viewability_5.date_string AS TIMESTAMP) >= CAST('2016-03-01 16:00:00' AS TIMESTAMP)) AND (CAST(viewability_5.date_string AS TIMESTAMP) <= CAST('2016-03-21 23:59:59' AS TIMESTAMP)) AND LOWER(viewability_5.line_name) RLIKE CONCAT('.', 'YYY', '.')) GROUP BY viewability_5.campaign_name, viewability_5.country, viewability_5.creative_size

Allow mapping a metric column to different metrics in Druid

For example suppose the raw data is capturing wind_speed and the index is at hourly grain. At the hourly level the index captures min, max, sum and count. Translation from sql must use these metrics appropriately.

druid's metrics are:
{
"type": "doubleMax",
"name": "?",
"fieldName": "wind_speed"
},
{
"type": "doubleMin",
"name": "?",
"fieldName": "wind_speed"
},
{
"type" : "doubleSum",
"name" : "?",
"fieldName": "wind_speed"
}
{
"type" : "longSum",
"name" : "?",
"fieldName": "wind_speed"
}

Avoid Druid Broker Bottleneck

Eliminate broker as bottleneck in cases where large amount of data needs to be pulled out from Druid for subsequent processing in spark. One possible solution is to talk directly to Historical nodes.

For example:
SELECT c_name,
bal,
sales_prospects_amount
FROM (SELECT c_name,
Sum(c_acctbal) bal
FROM orderlineitempartsupplier
GROUP BY c_name
HAVING Sum(c_acctbal) > 1000)r1
JOIN (SELECT cname,
Sum(sales_prospects_amount) AS sales_prospects_amount
FROM sales_leads
GROUP BY c_name) r2
ON r1.c_name = r2.cname

handle generic grouping expressions on a single dimension

SELECT avg(lineitem.l_extendedprice) AS avg_l_extendedprice_ok,
customers.c_mktsegment AS c_mktsegment,
custnation.cn_name AS cn_name,
custregion.cr_name AS cr_name,
(((year(cast(lineitem.l_commitdate AS timestamp)) * 10000) + (month(cast(lineitem.l_commitdate AS timestamp)) * 100)) + day(cast(lineitem.l_commitdate AS timestamp))) AS md_l_commitdate_ok,
cast((month(cast(lineitem.l_commitdate AS timestamp)) - 1) / 3 + 1 AS BIGINT) AS qr_l_commitdate_ok
FROM (
SELECT *
FROM lineitemindexed ) lineitem
JOIN
(
SELECT *
FROM orders ) orders
ON (
lineitem.l_orderkey = orders.o_orderkey)
JOIN
(
SELECT *
FROM customer ) customers
ON (
orders.o_custkey = customers.c_custkey)
JOIN
(
SELECT *
FROM custnation ) custnation
ON (
customers.c_nationkey = custnation.cn_nationkey)
JOIN
(
SELECT *
FROM custregion ) custregion
ON (
custnation.cn_regionkey = custregion.cr_regionkey)
GROUP BY customers.c_mktsegment,
custnation.cn_name,
custregion.cr_name,
(((year(cast(lineitem.l_commitdate AS timestamp)) * 10000) + (month(cast(lineitem.l_commitdate AS timestamp)) * 100)) + day(cast(lineitem.l_commitdate AS timestamp))),
cast((month(cast(lineitem.l_commitdate AS timestamp)) - 1) / 3 + 1 AS BIGINT)

Support the case when the Druid Index doesn't encompass the entire Time Interval

Currently we assume all events are indexed in Druid. But users may want to maintain an Index only for the last 90 days or last year. In this case the translation should try to generate a Union Plan across Druid and the raw event store.