Giter VIP home page Giter VIP logo

join-order-benchmark's Introduction

join-order-benchmark

This package contains the Join Order Benchmark (JOB) queries from:
"How Good Are Query Optimizers, Really?"
by Viktor Leis, Andrey Gubichev, Atans Mirchev, Peter Boncz, Alfons Kemper, Thomas Neumann
PVLDB Volume 9, No. 3, 2015
http://www.vldb.org/pvldb/vol9/p204-leis.pdf

IMDB Data Set

The CSV files used in the paper, which are from May 2013, can be found at http://homepages.cwi.nl/~boncz/job/imdb.tgz

The license and links to the current version IMDB data set can be found at http://www.imdb.com/interfaces

Step-by-step instructions

  1. download *gz files (unpacking not necessary)
wget ftp://ftp.fu-berlin.de/misc/movies/database/frozendata/*gz
  1. download and unpack imdbpy and the imdbpy2sql.py script
wget https://bitbucket.org/alberanid/imdbpy/get/5.0.zip
  1. create PostgreSQL database (e.g., name imdbload):
createdb imdbload
  1. transform *gz files to relational schema (takes a while)
imdbpy2sql.py -d PATH_TO_GZ_FILES -u postgres://username:password@hostname/imdbload

Now you should have a PostgreSQL database named imdbload with the imdb data. Note that this database has some secondary indexes (but not on all foreign key attributes). You can export all tables to CSV:

\copy aka_name to 'PATH/aka_name.csv' csv
\copy aka_title to 'PATH/aka_title.csv' csv
\copy cast_info to 'PATH/cast_info.csv' csv
\copy char_name to 'PATH/char_name.csv' csv
\copy comp_cast_type to 'PATH/comp_cast_type.csv' csv
\copy company_name to 'PATH/company_name.csv' csv
\copy company_type to 'PATH/company_type.csv' csv
\copy complete_cast to 'PATH/complete_cast.csv' csv
\copy info_type to 'PATH/info_type.csv' csv
\copy keyword to 'PATH/keyword.csv' csv
\copy kind_type to 'PATH/kind_type.csv' csv
\copy link_type to 'PATH/link_type.csv' csv
\copy movie_companies to 'PATH/movie_companies.csv' csv
\copy movie_info to 'PATH/movie_info.csv' csv
\copy movie_info_idx to 'PATH/movie_info_idx.csv' csv
\copy movie_keyword to 'PATH/movie_keyword.csv' csv
\copy movie_link to 'PATH/movie_link.csv' csv
\copy name to 'PATH/name.csv' csv
\copy person_info to 'PATH/person_info.csv' csv
\copy role_type to 'PATH/role_type.csv' csv
\copy title to 'PATH/title.csv' csv

To import the CSV files to another database, create all tables (see schema.sql and optionally fkindexes.sql) and run the same copy as above statements but replace the keyword "to" by "from".

Questions

Contact Viktor Leis ([email protected]) if you have any questions.

join-order-benchmark's People

Contributors

gregrahn avatar maahl avatar maxhalford avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

join-order-benchmark's Issues

Simple queries

Hello @gregrahn,

I was wondering if a "light" version of the JOB benchmark might be available. By this I mean queries that only use simple predicates such as a = between two attributes. I'm writing a cardinality estimation method and for the while I don't want to handle complex predicates such as LIKE.

Is something like this available? I thought I read about it in some paper but I seem to have forgotten...

Help

When I run the imdbpy2sql.py, I get some problems.
AttributeError: 'module' object has no attribute 'getLogger'
A`TXE(GWZHYY@RG)W~HB4)W

Can't transform *gz files to postgresql

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.9/bin/imdbpy2sql.py", line 505, in
conn = setConnection(URI, DB_TABLES)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/imdb/parser/sql/alchemyadapter.py", line 455, in setConnection
engine = create_engine(uri, **params)
File "", line 2, in create_engine
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sqlalchemy/util/deprecations.py", line 277, in warned
return fn(*args, **kwargs) # type: ignore[no-any-return]
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 693, in create_engine
raise TypeError(
TypeError: Invalid argument(s) 'encoding' sent to create_engine(), using configuration PGDialect_psycopg2/QueuePool/Engine. Please check that the keyword arguments are appropriate for this combination of components.

wget Link not working !

The link in step 1
(wget ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata/*gz)

isn't working. I believe its been removed.
Is there any place else where its hosted?

Step 4 taking too long

My code has been running at step 4: "transform *gz files to relational schema (takes a while)" for 12 consecutive hours. Is this normal and expected? Thank you

Data set mismatch: empty queries results and reproducibility issues

We execute the join order benchmark as one of our "default" benchmarks in Hyrise. We recently found that several queries yield empty results and wondered if this might be a problem with Hyrise, the data set, or if it is part of the JOB by intention.

One example is query 32a, which selects k.keyword ='10,000-mile-club'. For our data set (we generated the data using the "frozen data set"), there is no such keyword but rather 10000-mile-club (see also here). Interestingly, this keyword with the comma exists in the CWI data set.

As a query's performance can thus differ vastly between the two data sets due to the empty results, we would consider it a bug in either the CWI data set or the query's SQL. Should this be fixed by adjusting the dataset generation from frozen data or by adjusting the query's SQL by removing the , from 10,000...?

For other queries, we found that inner joins following several filters lead to empty result sets (e.g., query 5a). We are again curious if this is done by intention? As far as we can tell, this is the case for both data sets (CWI and frozen).

Pinging @Bensk1 here as well as he did the data generation test.

Malformed CSVs

Downloaded CSV tarball. Trying to upload to SQL Azure using bcp proved to be really hard as CSVs are malformed.

Sample CSV row in aka_name.csv
220222,538021,""Borolas", Joaquín García Vargas",,B6425,J2526,B642,6526774f1ce04414f56476409ce59060

CSV expects quotation marks to be escaped as "", not "
220222,538021,"""Borolas"", Joaquín García Vargas",,B6425,J2526,B642,6526774f1ce04414f56476409ce59060

Can not load data to duckdb

Hi,

Using the files from the link, I found several tables, there were some errors in the data. And I can't load them to DuckDB. What should I do?
Eg. Error: Invalid Input Error: CSV Error on Line: 610 Value with unterminated quote found.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.