cartershanklin / hive-testbench Goto Github PK

View Code? Open in Web Editor NEW

65.0 15.0 195.0 1022 KB

Testbench for experimenting with Apache Hive at any data scale.

Perl 6 15.12% Perl 5.89% Shell 22.43% Makefile 5.11% Java 51.45%

hive-testbench's Introduction

RETIRED

This repository is moved to the Hortonworks GitHub.

Make pull requests against that repository.

hive-testbench's People

Contributors

Stargazers

Watchers

Forkers

sarosaravanan jahubba yhuai rnirmal arnayani harschware vishnu-kumar simonzhangsm wakamori meraboxer concretevitamin kousikan sungsoo dr-riz kwangnam m2willi brockn jxiang mhittesdorf mwinkle kiranbhakre seanorama haiyang1987 tddisser ooq codeaudit alzarei liruisheng t3rmin4t0r youngwookim nsabharwal valtri colaberry shivajid wangtaothetonic abannon lestermartin gallenvara yqzhang t-ivanov watermen shuuuuua hsubramaniyan turknatdanai morti2c leoricklin minyk zachahuy-zz ramks raajay alphalzh abhilashinfoworks ajak6 adjyoucmp ronymin tomz gk84 tspannhw bhattabhijeet tuliobraga milanage chetnachaudhari loudongfeng henriquevarellaehrenfried geetha-hortonworks yuananf fengshenwu ganeshrajulinaro winningsix summer-3 manivas999 yintengfei jaceksan wowjason dharmeshkakadia unclegen thbeh dongjoon-hyun developerswithpassion bh-lushuai ferhui gjhkael youfuli stevenmphillips huxiao64 jianguotian hasonhai noormustafa himani1 bryantchang siva1987c klchejian bleachzk kellyzly luyizhizaio bgsanthosh alanfgates seanmikha wasamk caichangqi

hive-testbench's Issues

Shouldn't data generation and loading be separated?

According to TPC-DS specification (section 7.4.3) data generation time should not be measured. Right now in hive-testbench it is impossible to separate the generation from loading into the database, so it's difficult to measure only the data load (the generation and the load is done by running the single tpcds-setup.sh script).

Correct me if I haven't understood the specification or if I don't know how to use hive-testbench.

EDIT: After a moment of consideration I have this doubt: does hive-testbench do "in-line load" (as described in section 7.4.3.2 of TPC-DS specification) and the generation actually should contribute to the load time?

tpcds-build.sh fails due to change in data structure

The directory structure of the tpcds data has changed from DS Tools to TPC-DS v1.3.0.

I'll send a pull request once fixed.

hive-testbench]$ ./tpcds-build.sh
Maven not found, automatically installing it.
Building TPC-DS Data Generator
curl --output tpcds_kit.zip http://www.tpc.org/tpcds/dsgen/dsgen-download-files.asp?download_key=NaN
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1805k  100 1805k    0     0   664k      0  0:00:02  0:00:02 --:--:--  715k
mkdir -p target/
cp tpcds_kit.zip target/tpcds_kit.zip
test -d target/tools/ || (cd target; unzip tpcds_kit.zip)
Archive:  tpcds_kit.zip
   creating: TPC-DS v1.3.0/
[... snipped ...]
  inflating: TPC-DS v1.3.0/tools/y.tab.h
test -d target/tools/ || (cd target; mv "DS Tools/tools" tools)
mv: cannot stat `DS Tools/tools': No such file or directory
make: *** [target/tools/dsdgen] Error 1
TPC-DS Data Generator built, you can now use tpcds-setup.sh to generate data.

properties in testbench settings make hive-testbench require Hive 0.14.0-SNAPSHOT jars

hive-testbench is not Hive 0.13.1 compatible as it stands in trunk.
See http://stackoverflow.com/questions/24316492/unable-to-configure-hive-exec-hooks-due-to-missing-jar for more info.

What is the purpose of these settings, and are they a requirement?

Generate Data Error:./dsdgen: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by ./dsdgen)

Error: java.lang.InterruptedException: Process failed with status code 1
./dsdgen: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by ./dsdgen)
at org.notmysock.tpcds.GenTable$DSDGen.map(GenTable.java:194)

values.h should be limits.h

Trying to build on Mac OSX and got the error below:
Some googling showed that values.h is deprecated in ANSI C and should be limits.h instead. If it is changed in file porting.h, then build proceeds.

In file included from mkheader.c:37:
./porting.h:46:10: fatal error: 'values.h' file not found

include <values.h>

1 error generated.
make[1]: *** [mkheader.o] Error 1

Failed to run job: user <user> to unknown queue: default

Hello!

Please help me to run the hive-testbench.
We have a yarn-scheduler config which split the queue into 2 (dev1 70% and dev2 30%).
Once i tried to run $ ./tpcds-setup.sh 5.
It's giving me error
Exception in thread "main" java.io.IOException: Failed to run job : Application application_1444364220514_0016 submitted by user hive to unknown queue: default

I've tried to add the user hive to queue on yarn scheduler, but failed.
Where do I need to set the queue on hive-testbench settings?

Any help is highly appreciated! Thanks!

hive13 branch of tpch-gen lags t3rmin4t0r/tpch-gen

For example, tables nation and region are generated not once, but the number of the scale factor. e.g. at 300, there is 300x the data generated for each. See commit t3rmin4t0r/tpch-gen@ec46191

clarity on branches and Hive/Tez versions

Can you provide some clarity about what branch should be used for what version of Hive/Tez?

The hive13 branch README.md references testbench.settings but the file is init.settings in that branch. That settings file also sets hive.execution.engine=tez.

The master branch uses testbench.settings and has a second file, not in the README - testbench-withATS.settings, but those settings use hive.execution.engine=mr.

hive run query24.sql error

Hi,Please help me ：
run hive query24.sql error info:
FAILED: ParseException line 46:23 cannot recognize input near 'select' '0.05' '*' in expression specification

                                             <-----hive version:hive1.1.0--->

misspelled defines

Running tpcds_build.sh gives:

In file included from w_store_returns.c:40:
./w_store_sales.h:36:9: warning: 'W_STORE_SALES_H' is used as a header guard here, followed by #define of a different macro [-Wheader-guard]

ifndef W_STORE_SALES_H

    ^~~~~~~~~~~~~~~

./w_store_sales.h:37:9: note: 'W_STORE_SLAES_H' is defined here; did you mean 'W_STORE_SALES_H'?

define W_STORE_SLAES_H

    ^~~~~~~~~~~~~~~
    W_STORE_SALES_H

1 warning generated.
gcc -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -DYYDEBUG -DLINUX -g -Wall -c -o w_store_sales.o w_store_sales.c
In file included from w_store_sales.c:39:
./w_store_sales.h:36:9: warning: 'W_STORE_SALES_H' is used as a header guard here, followed by #define of a different macro [-Wheader-guard]

ifndef W_STORE_SALES_H

    ^~~~~~~~~~~~~~~

./w_store_sales.h:37:9: note: 'W_STORE_SLAES_H' is defined here; did you mean 'W_STORE_SALES_H'?

define W_STORE_SLAES_H

    ^~~~~~~~~~~~~~~
    W_STORE_SALES_H

ERROR:Build 30 TB of text formatted TPC-DS data: FORMAT=textfile ./tpcds-setup 30000

ANSI SQL-92 version of hive13 branch queries

In your blog post Benchmarking Apache Hive 13 for Enterprise Hadoop you site this repo as the souce, but Hive 0.10 requires ANSI SQL-92 join syntax and the hive13 branch contains only ANSI SQL-89 versions. In keeping with complete openness, can you add the ANSI SQL-92 versions of the queries that were used for the Hive 0.10 benchmark? I'll also point out that there are ANSI SQL-92 versions of TPC-DS queries in the 1.1 branch, but the filters do not match the version in the hive13 or master branch so they can not be used to compare without modifications.

cartershanklin / hive-testbench Goto Github PK

hive-testbench's Introduction

RETIRED

hive-testbench's People

Contributors

Stargazers

Watchers

Forkers

hive-testbench's Issues

include <values.h>

ifndef W_STORE_SALES_H

define W_STORE_SLAES_H

ifndef W_STORE_SALES_H

define W_STORE_SLAES_H

Recommend Projects

Recommend Topics

Recommend Org