This repository is moved to the Hortonworks GitHub.
Make pull requests against that repository.
Testbench for experimenting with Apache Hive at any data scale.
This repository is moved to the Hortonworks GitHub.
Make pull requests against that repository.
According to TPC-DS specification (section 7.4.3) data generation time should not be measured. Right now in hive-testbench it is impossible to separate the generation from loading into the database, so it's difficult to measure only the data load (the generation and the load is done by running the single tpcds-setup.sh
script).
Correct me if I haven't understood the specification or if I don't know how to use hive-testbench.
EDIT: After a moment of consideration I have this doubt: does hive-testbench do "in-line load" (as described in section 7.4.3.2 of TPC-DS specification) and the generation actually should contribute to the load time?
The directory structure of the tpcds data has changed from DS Tools
to TPC-DS v1.3.0
.
I'll send a pull request once fixed.
hive-testbench]$ ./tpcds-build.sh
Maven not found, automatically installing it.
Building TPC-DS Data Generator
curl --output tpcds_kit.zip http://www.tpc.org/tpcds/dsgen/dsgen-download-files.asp?download_key=NaN
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1805k 100 1805k 0 0 664k 0 0:00:02 0:00:02 --:--:-- 715k
mkdir -p target/
cp tpcds_kit.zip target/tpcds_kit.zip
test -d target/tools/ || (cd target; unzip tpcds_kit.zip)
Archive: tpcds_kit.zip
creating: TPC-DS v1.3.0/
[... snipped ...]
inflating: TPC-DS v1.3.0/tools/y.tab.h
test -d target/tools/ || (cd target; mv "DS Tools/tools" tools)
mv: cannot stat `DS Tools/tools': No such file or directory
make: *** [target/tools/dsdgen] Error 1
TPC-DS Data Generator built, you can now use tpcds-setup.sh to generate data.
hive-testbench is not Hive 0.13.1 compatible as it stands in trunk.
See http://stackoverflow.com/questions/24316492/unable-to-configure-hive-exec-hooks-due-to-missing-jar for more info.
What is the purpose of these settings, and are they a requirement?
Error: java.lang.InterruptedException: Process failed with status code 1
./dsdgen: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by ./dsdgen)
at org.notmysock.tpcds.GenTable$DSDGen.map(GenTable.java:194)
Trying to build on Mac OSX and got the error below:
Some googling showed that values.h is deprecated in ANSI C and should be limits.h instead. If it is changed in file porting.h, then build proceeds.
In file included from mkheader.c:37:
./porting.h:46:10: fatal error: 'values.h' file not found
^
1 error generated.
make[1]: *** [mkheader.o] Error 1
Hello!
Please help me to run the hive-testbench.
We have a yarn-scheduler config which split the queue into 2 (dev1 70% and dev2 30%).
Once i tried to run $ ./tpcds-setup.sh 5.
It's giving me error
Exception in thread "main" java.io.IOException: Failed to run job : Application application_1444364220514_0016 submitted by user hive to unknown queue: default
I've tried to add the user hive to queue on yarn scheduler, but failed.
Where do I need to set the queue on hive-testbench settings?
Any help is highly appreciated! Thanks!
For example, tables nation
and region
are generated not once, but the number of the scale factor. e.g. at 300, there is 300x the data generated for each. See commit t3rmin4t0r/tpch-gen@ec46191
Can you provide some clarity about what branch should be used for what version of Hive/Tez?
The hive13
branch README.md references testbench.settings
but the file is init.settings
in that branch. That settings file also sets hive.execution.engine=tez
.
The master
branch uses testbench.settings
and has a second file, not in the README - testbench-withATS.settings
, but those settings use hive.execution.engine=mr
.
Hi,Please help me ๏ผ
run hive query24.sql error info:
FAILED: ParseException line 46:23 cannot recognize input near 'select' '0.05' '*' in expression specification
<-----hive version:hive1.1.0--->
Running tpcds_build.sh gives:
In file included from w_store_returns.c:40:
./w_store_sales.h:36:9: warning: 'W_STORE_SALES_H' is used as a header guard here, followed by #define of a different macro [-Wheader-guard]
^~~~~~~~~~~~~~~
./w_store_sales.h:37:9: note: 'W_STORE_SLAES_H' is defined here; did you mean 'W_STORE_SALES_H'?
^~~~~~~~~~~~~~~
W_STORE_SALES_H
1 warning generated.
gcc -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -DYYDEBUG -DLINUX -g -Wall -c -o w_store_sales.o w_store_sales.c
In file included from w_store_sales.c:39:
./w_store_sales.h:36:9: warning: 'W_STORE_SALES_H' is used as a header guard here, followed by #define of a different macro [-Wheader-guard]
^~~~~~~~~~~~~~~
./w_store_sales.h:37:9: note: 'W_STORE_SLAES_H' is defined here; did you mean 'W_STORE_SALES_H'?
^~~~~~~~~~~~~~~
W_STORE_SALES_H
Error: java.lang.InterruptedException: Process failed with status code 1
./dsdgen: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by ./dsdgen)
at org.notmysock.tpcds.GenTable$DSDGen.map(GenTable.java:194)
In your blog post Benchmarking Apache Hive 13 for Enterprise Hadoop you site this repo as the souce, but Hive 0.10 requires ANSI SQL-92 join syntax and the hive13
branch contains only ANSI SQL-89 versions. In keeping with complete openness, can you add the ANSI SQL-92 versions of the queries that were used for the Hive 0.10 benchmark? I'll also point out that there are ANSI SQL-92 versions of TPC-DS queries in the 1.1
branch, but the filters do not match the version in the hive13
or master
branch so they can not be used to compare without modifications.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.