pola-rs / tpch Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Each query is run through a Python subprocess (subprocess.run
),
Don't you think it brings a lot of overhead as Python has to initiate its environment before running any logic ?
Line 70 in e636fa9
Polars can run them for sure. Do you want a contribution?
Following the README.md, we executed the code. However, we get errors saying that 'parquet files are missing for example:
No such file or directory: '/tpch/tables_scale_1/supplier.parquet'
Note: tables_scale_1 folder does not exist.
With that, I get empty plots.
Is there any further documentation on how to generate/populate the tables_scale folder or any other required data.
In both the pandas and modin queries:
lineitem_filtered["l_year"] = lineitem_filtered["l_shipdate"].apply(
lambda x: x.year
)
should be
lineitem_filtered["l_year"] = lineitem_filtered["l_shipdate"].dt.year
The polars_queries version uses the analogous idiom. This made a pretty big difference locally.
Hi there,
Would be worth having the queries executed in SQL to test SQL coverage of TPCH by polars and also to make it easier to write the tests.
Here's the line:
Line 24 in 6c5bbe9
If these benchmarks are being run on a single node, we should probably set the shuffle partitions to be like 1-4 instead of 200 (which is the default).
~/tpch$ make tables
make -C tpch-dbgen all
make[1]: Entering directory '/home/anatoly/tpch/tpch-dbgen'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/anatoly/tpch/tpch-dbgen'
cd tpch-dbgen && ./dbgen -vf -s 10 && cd ..
TPC-H Population Generator (Version 2.17.2)
Copyright Transaction Processing Performance Council 1994 - 2010
Generating data for suppliers table/
Preloading text ... 100%
done.
Generating data for customers tabledone.
Generating data for orders/lineitem tablesdone.
Generating data for part/partsupplier tablesdone.
Generating data for nation tabledone.
Generating data for region tabledone.
mkdir -p "data/tables/scale-10"
mv tpch-dbgen/*.tbl data/tables/scale-10/
.venv/bin/python scripts/prepare_data.py 10
Traceback (most recent call last):
File "/home/anatoly/tpch/scripts/prepare_data.py", line 4, in <module>
import polars as pl
ModuleNotFoundError: No module named 'polars'
make: *** [Makefile:39: tables] Error 1
It looks like the virtual environment is not being activated correctly.
make
version: GNU Make 4.3
Perhaps the problem is in the name of the prerequisite .venv
.
It would be great to have a comparison to datafusion as well!
q2 on new release
Q4, Q18, and Q20 contain the EXISTS
keyword, which should be translated to a semi-join. This should speed up the queries.
Hi,
Is there a way to get the results without cloning and launching the code :-)?
If not, can a GitHub page be automatically deployed after the code has been run with GitHub actions?
When trying to compile the tpch-dbgen
tool on macOS, users may encounter the documented error related to the malloc.h
header file. This is a known issue and is mentioned in the README.
To help to save time and effort for macOS users who encounter this issue and improve usability, would it be beneficial to provide this command as a shortcut?
sed -i.bak 's/#include <malloc.h>/#include <sys\/malloc.h>/g' tpch-dbgen/bm_utils.c tpch-dbgen/varsub.c
This command replaces the #include <malloc.h>
line with #include <sys/malloc.h>
in the bm_utils.c and varsub.c files.
Trying this out, make run_polars
is showing
TypeError: scan_csv() got an unexpected keyword argument 'sep'
[...]
TypeError: LazyFrame.sort() got an unexpected keyword argument 'reverse'
[...]
AttributeError: 'LazyFrame' object has no attribute 'with_column'. Did you mean: 'with_columns'?
[...]
Please advise. I'm a polars newbie.
When using polars 0.8.3, the plot_results failed.
I fixed by changing in line 65
.with_columns(pl.col("labels").arr.join(",\n"))
.with_columns(pl.col("labels").list.join(",\n"))
There are some hand optimizations that should not be in there (e.g. filters before joins).
Hi, I noticed that the generated Parquet files are extremely fragmented in terms of rowgroups. This likely indicates a bug/issue in the Polars Parquet writer, but definitely also affects the results of the benchmarks.
For a SCALE_FACTOR=10 table generation, the Parquet files have a staggering 20,000 rowgroups!
Each rowgroup only has about 3,400 rows and a size of 117kB. For reference, Parquet rowgroups are often suggested to be in the range of about 128MB. Because we have so many rowgroups, the Parquet metadata itself is 27MB and it likely introduces a ton of hops in the process of reading the file ๐
Writing this instead with PyArrow (I amended the code in prepare_data.py
), we get much more well-behaved rowgroups:
Still fairly small as rowgroups go, but I think it's much more reasonable and represents Parquet data in the wild a little better!
When running query number 2 separately from other queries, it outputs the error that it is unable to find the column s_acctbal. I believe the error lies in the order of joining the result table in the final query which causes loss of the desired columns in the final output. I think this is the reason why query #2 times are much lower compared to other queries, because the query isn't executed completely as it throws an error.
In the interest of clear communications around benchmarks, I'd suggest explaining what tpch stands for in the first line of the ReadMe.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.