Giter VIP home page Giter VIP logo

rumbledb / rumble Goto Github PK

View Code? Open in Web Editor NEW
207.0 207.0 80.0 106.07 MB

⛈️ RumbleDB 1.21.0 "Hawthorn blossom" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more

Home Page: http://rumbledb.org/

License: Other

Java 83.52% ANTLR 1.53% JSONiq 6.82% HTML 0.11% Jupyter Notebook 3.58% jq 4.44%
avro azure csv data-science dataframes hdfs json jsoniq machine-learning nested parquet query query-engine s3 scale schemaless spark svm text yaml

rumble's People

Contributors

andrearinaldi1 avatar canberker avatar codingkaiser avatar darioackermann avatar daviddao avatar dependabot[bot] avatar ghislainfourny avatar ingomueller-net avatar ioanas96 avatar lulunac27a avatar mstevan avatar pierremotard avatar thadguidry avatar wscsprint3r avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rumble's Issues

ANTLR runtime version collision

Spark 2.1.1 uses 4.5.3, which leads to an error if using version 4.6 to generate the code. We need to either ask users to use the same version as Spark, or (better) to find a way to prioritize our version on the classpath.

count(json-file("...")) returns wrong result

In my case, it returns always 100 for a file that at least millions of records. I guess that this is the number of lines that are displayed if if the result is larger. This "display filter" seems to be applied before sum.

Bubble-up mechanism for exceptions

In the future (beta, RC, release), the clean approach will be to catch exceptions in the closure, and if there is an exception or error, have the Spark task return some special internal sequence of items or tuple stream that carries (encapsulates) the exception or error. Then you can test for that special value received on the caller side, and if it is an exception you can "unwrap" it and print it nicely, as if it had been executed locally (which is the "ideal" way the engine should feel like to the user).

The detection of these special values can be done lazily, meaning that you don't need to consume sequences eagerly. As soon as, upon materializing a sequence of items, you notice that there is an encapsulation exception at any depth and stage, you can "bubble it up up up" all the way, and even across multiple level of Spark jobs, to the main part of the program (with this "special item" mechanism") where the exception is finally printed.

Of course, fatal errors will still be fatal errors. But everything non-fatal we should be able to catch, encapsulate and print nicely.

Dynamic object construction issue

This query

{|
   for $i in 1 to 10
   return { concat("Square of ", $i) : $i * $i }
 |}

raises the error

[ERROR] sparksoniq.spark.iterator.flowr.FlworExpressionSparkRuntimeIterator cannot be cast to sparksoniq.jsoniq.runtime.iterator.primary.ObjectConstructorRuntimeIterator

Ctrl+D causes null pointer exception

If I press Ctrl+D in the Sparksoniq shell, I get the following error:

Exception in thread "main" java.lang.NullPointerException
        at sparksoniq.io.shell.JiqsJLineShell.handleException(JiqsJLineShell.java:122)
        at sparksoniq.io.shell.JiqsJLineShell.launch(JiqsJLineShell.java:77)
        at sparksoniq.ShellStart.main(ShellStart.java:65)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I would expect the shell to quit without error.

Regex functions

Implement:

matches#2, matches#3, replace#3, replace#4, tokenize#3

Top-level arithmetic expression returns result of first FLOWR expression

The following query (run against the language game dataset) produces the wrong result:

(for $o in json-file("wasb:///sample.json")
where $o.choices[[1]] = $o.target
return $o) div count(json-file("wasb:///sample.json")))

I am not sure what it should do (I forgot to call count on the result of the first FLOWR expression), but I think it should do what it does, namely return the result of the first FLOWR expression.

div should return a decimal

4 div 3

should return a decimal (1.333333...), not an integer. "div" is the only operation that doesn't return a value with the same type as one of its operands. "idiv" does return an integer though.

String functions

Functions left to implement:

codepoints-to-string#1, string-to-codepoints#1, compare#2, compare#3, codepoint-equal#2, string-length#0, string-length#1, normalize-space#0, normalize-space#1, normalize-unicode#1, normalize-unicode#2, upper-case#1, lower-case#1, translate#3, contains#2, substring-before#2, substring-before#3, substring-after#2, substring-after#3

Later (needs collations):
ends-with#3
contains#3
starts-with#3

Support non-Spark-enabled FLWOR expressions

Allow for "smaller" FLWOR expressions executed locally.

Now that we have refactored the iterators, we need to:

  • Convert the ReturnClauseSparkIterator to a HybridIterator and add the local iteration functions (reading through the local iteration API of the child clause iterator).

  • Add support for local iteration (also reading via the local iteration API of the child clause iterator) to all other clauses (suggested order: first with let, then for, then when, then group by, then order by). For them, we may or may not use a hybrid iterator model (mostly for code consistency), but this doesn't change anything because a potential materialization will always happen at the return clause and never in the middle of a FLWOR expression.

  • For a let clause, isRDD() returns always false if it is the first clause, otherwise it forwards isRDD() from its child iterator. A let iterator always materializes its expression's result sequence (another issue exists to tackle further push-downs).

  • For a for clause, isRDD() returns true if the expression's iterator's isRDD() returns true OR if the child iterator exists and its isRDD() call returns true

  • for all other clauses (including the return item iterator), isRDD() forwards the results from the child's isRDD() call.

Sequence value functions

Implement:
distinct-values#1, distinct-values#2, index-of#2, index-of#3, deep-equal#2. deep-equal#3

Multiplicative Operation - Cosmetic Improvement

These results present a noticable number of zeroes in the decimal part during multiplicative operation and use of bigdecimals in such operations

1e0 * 2.0 -> 2.00
1.0000 * 2e0 -> 2.00000
1.0 * 2e4 -> 20000.00

4 div 2 -> 2
4 div 2.0 -> 2.0000000000
4 div 2e0 -> 2.0
4.0 div 2e0 -> 2.0000000000
4.0 div 2 -> 2.0000000000
4e0 div 2 -> 2.0
4e0 div 2.0 -> 2.0000000000

Dynamic key bug

This query

{ concat("Integer ", 2) : 2 * 2 }

returns

[ERROR] Error [err: XPDY0130]LINE:1:COLUMN:0:Object must have string keys!

Error returned instead of returning empty sequence

The following query (where array access [[0]] is mistakenly written as array constructor [0]) causes a job abortion:

jiqs$ for $o in json-file("wasb:///sample.json")
>>> where $o.choices[0] eq $o.target
>>> return $o
>>>
>>>
[ERROR] Job aborted due to stage failure: Task 0 in stage 25.0 failed 4 times, most recent failure: Lost task 0.3 in stage 25.0 (TID 29, wn0-testsp.mjhvh3c1nveurdkkutux1zeupg.cx.internal.cloudapp.net, executor 2): sparksoniq.exceptions.IteratorFlowException: Error [err: XPDY0130]LINE:2:COLUMN:6:Invalid next() call;

I guess a syntax error (or at least some more high-level error) would be more appropriate.

Issue with nested Spark actions

let $p := count(
for $i in json-file("hdfs:///data/small.json")
where $i.guess=$i.choices[[1]] and $i.guess=$i.target
return $i)
let $t := count(
for $j in json-file("hdfs:///data/small.json")
where $j.guess=$j.choices[[1]]
return $j)
return {"res": ($p+0.0) div $t}

Job aborts because of stage failure.

max() does not work on strings, head() function todo

When I run the following query (against the language game dataset), Sparksoniq enters an invalid state:

jiqs$ for $o in json-file("wasb:///sample.json")
>>> where $o.date eq max(for $o in json-file("wasb:///sample.json") return $o.date)
>>> return $o

After that, all queries fail with an error similar to the following one:

[ERROR] Job aborted due to stage failure: Task 0 in stage 57.0 failed 4 times, most recent failure: Lost task 0.3 in stage 57.0 (TID 72, wn0-testsp.mjhvh3c1nveurdkkutux1zeupg.cx.internal.cloudapp.net, executor 3): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_70_piece0 of broadcast_70

Note that the max expression cannot be evaluated. This is signaled when run on top-level:

jiqs$ max(for $o in json-file("wasb:///sample.json") return $o.date)
>>>
>>>
[ERROR] Error [err: XPTY0004]LINE:1:COLUMN:0:Max expression has non numeric args 2013-08-19

Stage failure

This query (from the JSONiq tutorial) fails. It is worth investigating, but may be complex to solve (it may come down to the nested for, which we can discuss)

let $stores :=
[
{ "store number" : 1, "state" : "MA" },
{ "store number" : 2, "state" : "MA" },
{ "store number" : 3, "state" : "CA" },
{ "store number" : 4, "state" : "CA" }
]
let $sales := [
{ "product" : "broiler", "store number" : 1, "quantity" : 20 },
{ "product" : "toaster", "store number" : 2, "quantity" : 100 },
{ "product" : "toaster", "store number" : 2, "quantity" : 50 },
{ "product" : "toaster", "store number" : 3, "quantity" : 50 },
{ "product" : "blender", "store number" : 3, "quantity" : 100 },
{ "product" : "blender", "store number" : 3, "quantity" : 150 },
{ "product" : "socks", "store number" : 1, "quantity" : 500 },
{ "product" : "socks", "store number" : 2, "quantity" : 10 },
{ "product" : "shirt", "store number" : 3, "quantity" : 10 }
]
let $join :=
for $store in $stores[], $sale in $sales[]
where $store."store number" eq $sale."store number"
return {
"nb" : $store."store number",
"state" : $store.state,
"sold" : $sale.product
}
return [$join]

Parsing fails

let $person := {
"name" : "Sarah",
"age" : 13,
"gender" : "female",
"friends" : [ "Jim", "Mary", "Jennifer"]
}
return { "how many friends" : size($person.friends)) }

Try-catch

Support try-catch expressions.

Note: depends on, and integrates with, the bubble-up mechanism.

A few useful sequence functions

The following functions would be very useful because they are very commonly used:

empty((2, 3)) -> false
exists((2, 3)) -> true
tail((2, 3, 4)) -> (3, 4)
head((2, 3)) -> 2

as well as

insert-before#3
remove#2
reverse#1
subsequence#2
subsequence#3,

The first three can be optimized, because only the first value (if here) needs to be accessed, and the rest ignored.

Report missing function correctly

If a function doesn't exist, the current error message is misleading. I do not have a cluster running at the moment, so I can't copy the output, but it is along the lines of "The function has the wrong arity." Instead, it should say something like "The function does not exist."

Double parameters on CLI

Some parameters such as --master local[*] need to be written both before and after the JAR file in the spark-submit command. I think that there must be some things to fine tune in the Java code to avoid this redundancy and consume all parameters before the JAR file.

Object lookup only returns value of first object

({"foo" : "bar"}, {"foo" : "foobar"}).foo

only returns "bar"
but should return the sequence "bar", "foobar"

Likewise,

().foo

should return the empty sequence and currently returns an error.

Improve test suite and take over relevant W3C and Zorba tests

Many tests from XQuery (including 3.1 if we rewrite them) and from Zorba can be taken over to our test suite. We may need to make the testcase syntax more flexible to take them over (e.g., with outputs made of large sequences of JSON objects).

Conflicting versions of ANTLR

I am trying to run Sparksoniq on an HDInsight 3.6 cluster with Spark 2.2.0. I can start the shell, but running any query results in the following error:

ANTLR Tool version 4.6 used for code generation does not match the current runtime version 4.5.3ANTLR Runtime version 4.6 used for parser compilation does not match the current runtime version 4.5.3ANTLR Tool version 4.6 used for code generation does not match the current runtime version 4.5.3ANTLR Runtime version 4.6 used for parser compilation does not match the current runtime version 4.5.34

If I add some tracing options via spark-submit --conf 'spark.driver.extraJavaOptions=-verbose:class' --conf 'spark.executor.extraJavaOptions=-verbose:class ..., I get many lines similar to the following:

[Loaded org.antlr.v4.runtime.misc.ParseCancellationException from file:/usr/hdp/2.6.3.61-4/spark2/jars/antlr4-runtime-4.5.3.jar]

This resembles this issue of another Spark application. It seems like the Sparksoniq version I am using (the precompiled jsoniq-spark-app-0.9.1-jar-with-dependencies.jar from Github) expects ANTLR 4.6, but Spark loads 4.5.3. If Spark 2.3.0 uses ANTLR 4.7 like that other issue claims, changing the Spark version would fix the problem, but I haven't tried.

It could be that shading the dependencies could help.

Support for variables wrapped around RDD

In the medium to remote future, we will want to smartly bind a variable with, instead of a materialized sequence of items, an "RDD wrapper" acting as a proxy to Spark in local expressions. This requires adapting the code of the dynamic context.

Example:

let $a := json-text("hdfs://.../file.json")
let $b := json-text("hdfs://.../file2.json")
return { a: count($a), b: count($b) }

The above FLWOR expression is local (i.e., the let clauses are executed locally, but wrapping on the RDD returned by json-text as if it were a local value in a blackbox), so that a prerequisite will be that local FLWORs are supported.

Note that this feature will be incompatible with FLWORs running on Spark, i.e., only "materialized" dynamic contexts can be used as RDDs because Spark forbids nesting.

Support FLWOR count clauses with zipWithIndex()

If I understand correctly, count clauses are not supported in the outer-most FLOWR expression. I think that there is actually a relatively easy way to get a trivial implementation for it, namely using Spark's zipWithIndex. I do not urgently need this feature now, but I want to be sure that this idea doesn't get lost.

Logic should return error

boolean() exposes the "Effective Boolean Value" to the user. It already exists internally, can be exposed as a fn. call.

For the following logics evaluation - a fix has to be made.
"If the input sequence has more than one item, and the first item is not an object or array, an error is raised." :
( 1, 2, 3 ) or false

returns true in Sparksoniq
returns the error in Zorba:
(no URI):1,2: dynamic error [err:FORG0006]: invalid argument type for function fn:boolean(): effective boolean value not defined for sequence of more than one item that starts with "xs:integer"

( 1, 2, 3) needs to be converted to its Effective Boolean Value, which is the boolean() call, and which throws an error.

Mathematical functions

Implement:

abs#1, ceilingabs#1, floorabs#1, roundabs#1, round-half-to-even#1
pi#0, exp#1, exp10#1, log#1, log10#1, pow#2, sqrt#1, sin#1, cos#1, tan#1, asin#1, acos#1, atan#1, atan2#1

idiv returns decimals

4 idiv 2 -> 2
4 idiv 2.0 -> 2
4 idiv 2e0 -> 2.0
4.0 idiv 2e0 -> 2
4.0 idiv 2 -> 2.0
4e0 idiv 2 -> 2.0
4e0 idiv 2.0 -> 2

Bug with if-then-else

for $x in 1 to 10
return if ($x lt 5) then $x
               else -$x

This returns an error

[ERROR] Job aborted due to stage failure: Task 0 in stage 10.0 failed 1 times, most recent failure: Lost task 0.0 in stage 10.0 (TID 10, localhost, executor driver): sparksoniq.exceptions.IteratorFlowException: Error [err: XPDY0130]Invalid next() call; If expr

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.