gregrahn / join-order-benchmark Goto Github PK

Join Order Benchmark (JOB)

join-order-benchmark's Introduction

join-order-benchmark

This package contains the Join Order Benchmark (JOB) queries from:
"How Good Are Query Optimizers, Really?"
by Viktor Leis, Andrey Gubichev, Atans Mirchev, Peter Boncz, Alfons Kemper, Thomas Neumann
PVLDB Volume 9, No. 3, 2015
http://www.vldb.org/pvldb/vol9/p204-leis.pdf

IMDB Data Set

The CSV files used in the paper, which are from May 2013, can be found at http://homepages.cwi.nl/~boncz/job/imdb.tgz

The license and links to the current version IMDB data set can be found at http://www.imdb.com/interfaces

Step-by-step instructions

download *gz files (unpacking not necessary)

wget ftp://ftp.fu-berlin.de/misc/movies/database/frozendata/*gz

download and unpack imdbpy and the imdbpy2sql.py script

wget https://bitbucket.org/alberanid/imdbpy/get/5.0.zip

create PostgreSQL database (e.g., name imdbload):

createdb imdbload

transform *gz files to relational schema (takes a while)

imdbpy2sql.py -d PATH_TO_GZ_FILES -u postgres://username:password@hostname/imdbload

Now you should have a PostgreSQL database named imdbload with the imdb data. Note that this database has some secondary indexes (but not on all foreign key attributes). You can export all tables to CSV:

\copy aka_name to 'PATH/aka_name.csv' csv
\copy aka_title to 'PATH/aka_title.csv' csv
\copy cast_info to 'PATH/cast_info.csv' csv
\copy char_name to 'PATH/char_name.csv' csv
\copy comp_cast_type to 'PATH/comp_cast_type.csv' csv
\copy company_name to 'PATH/company_name.csv' csv
\copy company_type to 'PATH/company_type.csv' csv
\copy complete_cast to 'PATH/complete_cast.csv' csv
\copy info_type to 'PATH/info_type.csv' csv
\copy keyword to 'PATH/keyword.csv' csv
\copy kind_type to 'PATH/kind_type.csv' csv
\copy link_type to 'PATH/link_type.csv' csv
\copy movie_companies to 'PATH/movie_companies.csv' csv
\copy movie_info to 'PATH/movie_info.csv' csv
\copy movie_info_idx to 'PATH/movie_info_idx.csv' csv
\copy movie_keyword to 'PATH/movie_keyword.csv' csv
\copy movie_link to 'PATH/movie_link.csv' csv
\copy name to 'PATH/name.csv' csv
\copy person_info to 'PATH/person_info.csv' csv
\copy role_type to 'PATH/role_type.csv' csv
\copy title to 'PATH/title.csv' csv

To import the CSV files to another database, create all tables (see schema.sql and optionally fkindexes.sql) and run the same copy as above statements but replace the keyword "to" by "from".

Questions

Contact Viktor Leis ([email protected]) if you have any questions.

join-order-benchmark's People

Contributors

Stargazers

Watchers

Forkers

rjnn ngaut shenli waltercai mattperron stanxii maahl kgtdbx mrzzzrm yingjunwu tomasrelai spetrunia sandeep06011991 chunhui-shi qsguo robert-s-lee fenggeorgeyu dqmdev nimorun queryproc wukongjiuwoa qw4990 greatji for0nething yerzhaisang karlarao gabtakt zhengtongyan icepigzdb zhihuifan bensanhuan danielmao1 issam28 cgkbfrd mickeysclubhouse rebelice denis0x0d errno1024 khalefa-phd pngu23 wangt1xiuyi jaystarshot hughzz gaoyusong zhengchencs leonshi-hub zhangysh1995 jetmew songenjie galler-alexander makai-china avinashpandey0702 yunjiazhang siddharth9820 hannes jackwener iamfork yikangping valteruo xil-db qvad ayoubmekk zhouxh19 geelen bajinsheng happenlee andygrove fisherkkk bouncner wmsj2000 alkim0 fsfish dehasara ethicalsecurity-agency dothebart xiaoshao dedic247 zixiao-ye peimu lmwnshn jamesryan7 xhevx wangpatrick57 course-files lilelr

join-order-benchmark's Issues

Simple queries

Hello @gregrahn,

I was wondering if a "light" version of the JOB benchmark might be available. By this I mean queries that only use simple predicates such as a = between two attributes. I'm writing a cardinality estimation method and for the while I don't want to handle complex predicates such as LIKE.

Is something like this available? I thought I read about it in some paper but I seem to have forgotten...

Help

When I run the imdbpy2sql.py, I get some problems.
AttributeError: 'module' object has no attribute 'getLogger'

Can't transform *gz files to postgresql

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.9/bin/imdbpy2sql.py", line 505, in
conn = setConnection(URI, DB_TABLES)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/imdb/parser/sql/alchemyadapter.py", line 455, in setConnection
engine = create_engine(uri, **params)
File "", line 2, in create_engine
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sqlalchemy/util/deprecations.py", line 277, in warned
return fn(*args, **kwargs) # type: ignore[no-any-return]
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 693, in create_engine
raise TypeError(
TypeError: Invalid argument(s) 'encoding' sent to create_engine(), using configuration PGDialect_psycopg2/QueuePool/Engine. Please check that the keyword arguments are appropriate for this combination of components.

Availability of a Bigger Dataset

The current Dataset is only 2-3Gb. Is there a bigger data set for 2TB or plans to make a new one

wget Link not working !

The link in step 1
(wget ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata/*gz)

isn't working. I believe its been removed.
Is there any place else where its hosted?

Step 4 taking too long

My code has been running at step 4: "transform *gz files to relational schema (takes a while)" for 12 consecutive hours. Is this normal and expected? Thank you

Question on running all queries automatically or Pem File for Apache JMeter

Is there a .Pem File available for using with Apache JMeter or is there any suggested/reference tools for running all queries automatically?

Data set mismatch: empty queries results and reproducibility issues

We execute the join order benchmark as one of our "default" benchmarks in Hyrise. We recently found that several queries yield empty results and wondered if this might be a problem with Hyrise, the data set, or if it is part of the JOB by intention.

One example is query 32a, which selects k.keyword ='10,000-mile-club'. For our data set (we generated the data using the "frozen data set"), there is no such keyword but rather 10000-mile-club (see also here). Interestingly, this keyword with the comma exists in the CWI data set.

As a query's performance can thus differ vastly between the two data sets due to the empty results, we would consider it a bug in either the CWI data set or the query's SQL. Should this be fixed by adjusting the dataset generation from frozen data or by adjusting the query's SQL by removing the , from 10,000...?

For other queries, we found that inner joins following several filters lead to empty result sets (e.g., query 5a). We are again curious if this is done by intention? As far as we can tell, this is the case for both data sets (CWI and frozen).

Pinging @Bensk1 here as well as he did the data generation test.

cannot download file from step 2

Malformed CSVs

Downloaded CSV tarball. Trying to upload to SQL Azure using bcp proved to be really hard as CSVs are malformed.

Sample CSV row in aka_name.csv
220222,538021,""Borolas", Joaquín García Vargas",,B6425,J2526,B642,6526774f1ce04414f56476409ce59060

CSV expects quotation marks to be escaped as "", not "
220222,538021,"""Borolas"", Joaquín García Vargas",,B6425,J2526,B642,6526774f1ce04414f56476409ce59060

Can not load data to duckdb

Hi,

Using the files from the link, I found several tables, there were some errors in the data. And I can't load them to DuckDB. What should I do?
Eg. Error: Invalid Input Error: CSV Error on Line: 610 Value with unterminated quote found.

Thanks!