pola-rs / tpch Goto Github PK

View Code? Open in Web Editor NEW

71.0 71.0 37.0 580 KB

License: MIT License

Makefile 2.64% Python 45.66% C 47.27% Shell 1.99% Perl 1.82% Rust 0.63%

tpch's People

Contributors

Stargazers

Watchers

tpch's Issues

Usage of Python subprocess.run

Each query is run through a Python subprocess (subprocess.run),
Don't you think it brings a lot of overhead as Python has to initiate its environment before running any logic ?

tpch/common_utils.py

Line 70 in e636fa9

run([sys.executable, "-m", f"{package_name}.q{i}"])

The rest of the queries?

Polars can run them for sure. Do you want a contribution?

Missing parquet files

Following the README.md, we executed the code. However, we get errors saying that 'parquet files are missing for example:
No such file or directory: '/tpch/tables_scale_1/supplier.parquet'

Note: tables_scale_1 folder does not exist.

With that, I get empty plots.

Is there any further documentation on how to generate/populate the tables_scale folder or any other required data.

Non-idiomatic usage in q7

In both the pandas and modin queries:

lineitem_filtered["l_year"] = lineitem_filtered["l_shipdate"].apply(
    lambda x: x.year
)

should be

lineitem_filtered["l_year"] = lineitem_filtered["l_shipdate"].dt.year

The polars_queries version uses the analogous idiom. This made a pretty big difference locally.

Queries should be expressed in SQL as well to test polars SQL coverage of TPCH

Hi there,

Would be worth having the queries executed in SQL to test SQL coverage of TPCH by polars and also to make it easier to write the tests.

Why is the Spark memory set to 2gb?

Here's the line:

tpch/queries/pyspark/utils.py

Line 24 in 6c5bbe9

.config("spark.driver.memory", settings.run.spark_driver_memory)

If these benchmarks are being run on a single node, we should probably set the shuffle partitions to be like 1-4 instead of 200 (which is the default).

`make tables` failed

~/tpch$ make tables
make -C tpch-dbgen all
make[1]: Entering directory '/home/anatoly/tpch/tpch-dbgen'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/home/anatoly/tpch/tpch-dbgen'
cd tpch-dbgen && ./dbgen -vf -s 10 && cd ..
TPC-H Population Generator (Version 2.17.2)
Copyright Transaction Processing Performance Council 1994 - 2010
Generating data for suppliers table/
Preloading text ... 100%
done.
Generating data for customers tabledone.
Generating data for orders/lineitem tablesdone.
Generating data for part/partsupplier tablesdone.
Generating data for nation tabledone.
Generating data for region tabledone.
mkdir -p "data/tables/scale-10"
mv tpch-dbgen/*.tbl data/tables/scale-10/
.venv/bin/python scripts/prepare_data.py 10
Traceback (most recent call last):
  File "/home/anatoly/tpch/scripts/prepare_data.py", line 4, in <module>
    import polars as pl
ModuleNotFoundError: No module named 'polars'
make: *** [Makefile:39: tables] Error 1

It looks like the virtual environment is not being activated correctly.

make version: GNU Make 4.3

Perhaps the problem is in the name of the prerequisite .venv.

Are TPCH Benchmark results actual or not?

Hi!

Polars' performance is very impressive. I would like to know if the results are up to date, because I did not find the library versions used. Do you have this information?

Add DataFusion queries

It would be great to have a comparison to datafusion as well!

use `ends_with` instead of `contains`

q2 on new release

Update `EXISTS` queries to use semi-join

See coiled/benchmarks#1515

Q4, Q18, and Q20 contain the EXISTS keyword, which should be translated to a semi-join. This should speed up the queries.

Get the results without having to run it

Hi,
Is there a way to get the results without cloning and launching the code :-)?

If not, can a GitHub page be automatically deployed after the code has been run with GitHub actions?

Add potential quick fix for macOS compilation error in README

When trying to compile the tpch-dbgen tool on macOS, users may encounter the documented error related to the malloc.h header file. This is a known issue and is mentioned in the README.

To help to save time and effort for macOS users who encounter this issue and improve usability, would it be beneficial to provide this command as a shortcut?

sed -i.bak 's/#include <malloc.h>/#include <sys\/malloc.h>/g' tpch-dbgen/bm_utils.c tpch-dbgen/varsub.c

This command replaces the #include <malloc.h> line with #include <sys/malloc.h> in the bm_utils.c and varsub.c files.

Plot results failed - arr was deprecated

When using polars 0.8.3, the plot_results failed.
I fixed by changing in line 65

.with_columns(pl.col("labels").arr.join(",\n"))

.with_columns(pl.col("labels").list.join(",\n"))

Update pandas queries to better match original SQL queries

There are some hand optimizations that should not be in there (e.g. filters before joins).

Generated Parquet files are extremely fragmented

Hi, I noticed that the generated Parquet files are extremely fragmented in terms of rowgroups. This likely indicates a bug/issue in the Polars Parquet writer, but definitely also affects the results of the benchmarks.

For a SCALE_FACTOR=10 table generation, the Parquet files have a staggering 20,000 rowgroups!

Each rowgroup only has about 3,400 rows and a size of 117kB. For reference, Parquet rowgroups are often suggested to be in the range of about 128MB. Because we have so many rowgroups, the Parquet metadata itself is 27MB and it likely introduces a ton of hops in the process of reading the file 😅

Writing this instead with PyArrow (I amended the code in prepare_data.py), we get much more well-behaved rowgroups:

Still fairly small as rowgroups go, but I think it's much more reasonable and represents Parquet data in the wild a little better!

Query #2 Inaccurate Output

When running query number 2 separately from other queries, it outputs the error that it is unable to find the column s_acctbal. I believe the error lies in the order of joining the result table in the final query which causes loss of the desired columns in the final output. I think this is the reason why query #2 times are much lower compared to other queries, because the query isn't executed completely as it throws an error.

Explain tpch in readme

In the interest of clear communications around benchmarks, I'd suggest explaining what tpch stands for in the first line of the ReadMe.

pola-rs / tpch Goto Github PK

tpch's People

Contributors

Stargazers

Watchers

Forkers

tpch's Issues

Recommend Projects

Recommend Topics

Recommend Org