Comments (5)
Solved in a26b8af
from db-benchmark.
To clarify @MichaelChirico concerns, first run on Spark does not include start-up time of cluster (in this case a single node cluster). Cluster is already started and data were read into it, cached into memory.
Already had little discussion on that with @st-pasha and problem is not trivial to resolve.
The non trivial parts are:
- how deep warm up should be, different solutions might benefit differently from warm-up, larger warm-up might give better improvement for one solution and lower for another. For example
data.table
has minor overhead as described in speed up first[.data.table
call. - what is the subject of benchmark, if we want to simulate an user workflow where first query suffers from that overhead, then it makes perfect sense to keep it. If we agree that an user before doing his processing always runs some warm-up on his node, then of course make sense to do warm-up.
As for now I am not seeing reasons good enough, and strategy fair enough, to include warming up solutions for "groupby" task.
What looks to be proper way to address that is to add new task for grouping on warmed-up/analyzed/sorted/indexed data.
from db-benchmark.
The dataset is being loaded from file I think. Could it be that Spark is very fast at file load but isn't materializing the data. Then when the first group by comes along, that's when it actually does the load from file. (Adding load times to the report was on the todo list regardless.) If lazy data ingest doesn't explain it, can an issue be raised in spark SO tag or code-review site to see if they know.
from db-benchmark.
@mattdowle I'm not sure whether this is what's going on, but yes, operations are generally lazy in Spark.
This code will be almost instant:
spark.read.parquet('s3://path/to/folder')
Even adding some filtering & basic things will do nothing.
Can force-overcome lazy eval by doing something inexpensive like:
SDF = spark.read.parquet('s3://path/to/folder')
SDF.count()
Open to debate whether something like SDF.cache()
is legit for comparison
from db-benchmark.
- It is likely that spark csv reading time is included in the first grouping time. I will add
.count()
before grouping as suggested by Michael, but it has to be added for all tools, as this is already "collecting statistics" about the data. - It is desired to use
.cache
, otherwise spark would be re-reading csv on each query(?, according to design). It is even more desired to use.cache
for results of queries, other tools do cache answer on side, and it can be accessed later on, unlike AFAIR impala and presto where you needed to useCREATE TABLE AS SELECT
to actually keep query results. - Instead of
.cache
method we use.persist(pyspark.StorageLevel.MEMORY_ONLY)
as in recent versions of spark.cache
only wraps to.persist
but does not let you to choseMEMORY_ONLY
. This has to be adjusted when we go for 1e10 grouping benchmark (500GB) to.persist
to memory and disk.
from db-benchmark.
Related Issues (20)
- data.table uses keyby in place of by
- developer's script location is accidentally left in the source code HOT 1
- de-serialization cost? HOT 1
- Consider renaming "Arrow" case? HOT 12
- Mind re-running with DuckDB 0.2.8? Thanks! HOT 1
- allow solutions to load data on demand for joining task HOT 2
- Why Spark produces performance data based on csv dataset HOT 8
- Steps of running benchmarks in Windows HOT 1
- pyarrow supports groupby operations now.
- Join Data generation script gets stuck with e9 rows HOT 3
- Add q/shakti HOT 2
- Make datasets more accessible HOT 4
- Get DataFusion added to H2O AI DB-Benchmark HOT 1
- Add Pyspark.pandas to benchmark HOT 1
- Has anyone following this created a dockerfile to run this?
- update the benchmarks? HOT 6
- CUDF Package Issue: Merging on categorical variables with mismatched ordering is ambiguous HOT 1
- Ruby Dataframes
- Where I can download duckdb-latest 0.8.0 for test HOT 1
- h2oai Database-like OPS Benchmark Foster Innovation and Competition
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from db-benchmark.