Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar

The hive testbench consists of a data generator and a standard set of queries typically used for benchmarking hive performance. This article describes how to generate data and run a query in using beeline and Hive 2.0 with and without LLAP. It also shows how to use explain to see the difference in query plans.

If you don't have a cluster already configured for LLAP, you can provision one in AWS using Hortonworks Cloud. See this article for instructions on how to provision a 2.5 tech preview with LLAP enabled.

1. Log into the master node in you cluster where Hive is installed. If you used Hortonworks Cloud to create your instance, locate the node with a name ending in master. The ssh command is shown next to the master instance. If you are logging in from a linux host, click on the icon to the right of the ssh command to select the command text and copy the command. In the linux shell, change to the directory containing your AWS key .pem file and the run the copied command.

If you are logging in from Windows, consult the AWS user guide for instructions on how to log in using putty with the user name cloudbreak and authenticating with the key file.

7208-000-locate-master.png

2. Sudo to the hdfs user to begin generating data. Change to the home directory for the hdfs user:

sudo -u hdfs -s 
cd /home/hdfs

3. Download the testbench utilities from Github and unzip them:

wget https://github.com/hortonworks/hive-testbench/archive/hive14.zip  
unzip hive14.zip

4. Open the load_partitioned.sql file in an editor:

vi hive-testbench-hive14/settings/load-partitioned.sql

5. Correct the hive.tez.java.opts setting:

Comment out the line below by adding -- at the beginning of the line:

-- set hive.tez.java.opts=-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/;

Add the line below:

set hive.tez.java.opts=-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/;

Save the file and exit.

6. Generate 30G of test data:

/* In case GCC is not installed */
yum install gcc

/* If javac is not found */
export JAVA_HOME=/usr/jdk64/jdk1.8.0_77
export PATH=$JAVA_HOME/bin:$PATH

cd hive-testbench-hive14/ 
sudo ./tpcds-build.sh 
./tpcds-setup.sh 30

7. A map reduce job runs to create the data and load the data into hive. This will take some time to complete. The last line in the script is:

Data loaded into database tpcds_bin_partitioned_orc_30.

8. Choose a query to run for benchmarking. For example query55.sql. Copy the query of of your choice and make an explain version of the query. The explain query will be helpful later on to see how hive is planning the query.

cd sample-queries-tpcds 
cp query55.sql explainquery55.sql 
vi explainquery55.sql

Add the keyword explain before the query. For example the first line of the explain of query 55:

explain select i_brand_id brand_id, i_brand brand,

Save and quit out of the file.

9. You are now ready to issue a benchmark query. Start the beeline hive2 cli.

beeline -i testbench.settings -u jdbc:hive2://localhost:10500/tpcds_bin_partitioned_orc_30

10. To try a query without LLAP, set hive.llap.execution.mode=none and run a query. For example, the command line below will run benchmark query 55:

set hive.llap.execution.mode=none; 
!run query55.sql

7214-screen-shot-2016-08-30-at-63155-pm.png

Note the completion time at the end of the query is 18.984 without LLAP:

7213-screen-shot-2016-08-31-at-61159-pm.png

11. Now try the query with LLAP, set hive.llap.execution.mode=all and run the query again:

set hive.llap.execution.mode=all; 
!run query55.sql

7215-screen-shot-2016-08-31-at-61303-pm.png

12. Notice that the query with LLAP completes much more quickly. If you don’t see a significant speed up at first, try the same query again. As the LLAP cache fills with data, queries respond more quickly. Below are the results of the next two runs of the same query with LLAP set to all. The second query returned in 8.455 seconds and a subsequent query in 2.745 seconds. If your cluster has been up and you have been doing LLAP queries on this data your performance my be in the 2 second range on the first try:

7216-screen-shot-2016-08-31-at-61303-pm.png

7217-screen-shot-2016-08-31-at-61357-pm.png

13. To see the difference between the query plans, use the explain query to show the plan for a query with no LLAP. Take note of the vectorized outlined in red in the screen shot below:

set hive.llap.execution.mode=none;
!run explainquery55.sql

7209-screen-shot-2016-08-30-at-65117-pm.png

14. Try the explain again, with LLAP enabled:

set hive.llap.execution.mode=all; 
!run explainquery55.sql

15. Notice in the explain plan for the LLAP query, LLAP is shown after the vectorized keyword.

7210-screen-shot-2016-08-30-at-65450-pm.png

References:

Beeline

https://github.com/hortonworks/hive-testbench

https://community.hortonworks.com/questions/51333/hive-testbench-error-when-generating-data.html

https://community.hortonworks.com/questions/23988/not-able-to-run-hive-benchmark-test.html


screen-shot-2016-08-31-at-61159-pm.pngscreen-shot-2016-08-30-at-63155-pm.png
20,842 Views
Comments
avatar
Super Guru

@cduby

At step 6 you need to be sudo to execute ./tpcds-build.sh

Data generation at the next step is fine with hdfs user.

avatar
Super Guru

@cduby

At step 5, you changed the GC mode. The recommended for LLAP is G1GC. Is there a reason for the change you proposed?

@Constantin Stanca The change to the gc parameters is to work around this problem in the data generation script:

https://community.hortonworks.com/questions/51333/hive-testbench-error-when-generating-data.html

The LLAP daemon process is running on the server with the correct GC parameters.

@Constantin Stanca I added a sudo before the build command. Thanks for the correction.

avatar
New Contributor

step #15: I do not see llap. looks like "set hive.llap.execution.mode=all;" does not work as set hive.llap.execution.mode returns hive.llap.execution.mode is undefined. Please advice.

@Serge Kazarin Check the port in the hive connection string and make sure the port is 10500 (hive2 interface). The hive1 interface will not understand the LLAP configuration settings.

avatar
Super Collaborator

Hi! May I ask what's the content of testbench.settings file? I can't find the file.. 😞

beeline -i testbench.settings -u jdbc:hive2://localhost:10500/tpcds_bin_partitioned_orc_30

Thanks!

avatar
Super Collaborator

Already found the file 🙂 sorry for the trouble...

just had some error running testbench

Error: Error while compiling statement: FAILED: SemanticException [Error 10001]: Line 3:6 Table not found 'date_dim' (state=42S02,code=10001)

Error: Error while compiling statement: FAILED: SemanticException [Error 10001]: Line 13:1 Table not found 'lineitem' (state=42S02,code=10001)

@Michael Dennis "MD" Danang I think there may have been an issue generating the test data. It looks like some of the benchmark tables did not get generated correctly. Try the generation again.

avatar
Expert Contributor

[hdfs@c****-node* hive-testbench-hive14]$ ./tpcds-build.sh 
Building TPC-DS Data Generator
make: Nothing to be done for `all’.
TPC-DS Data Generator built, you can now use tpcds-setup.sh to generate data.
[hdfs@c4237-node2 hive-testbench-hive14]$ ./tpcds-setup.sh 2
TPC-DS text data generation complete.
Loading text data into external tables.
make: *** [time_dim] Error 1
make: *** Waiting for unfinished jobs....
make: *** [date_dim] Error 1
Data loaded into database tpcds_bin_partitioned_orc_2.

 

INFO  : OK
+---------------------+
| database_name |
+---------------------+
| default |
| information_schema |
| sys |
+---------------------+
3 rows selected (1.955 seconds)
0: jdbc:hive2://c4237-node2.coelab.cloudera.c>

tpcds_bin_partitioned_orc_2 database is not created, I have some issues in testing the tpcds queries

 

sudo -u hdfs -s

   13  cd /home/hdfs

   14  wget https://github.com/hortonworks/hive-testbench/archive/hive14.zip 

   15  unzip hive14.zip

  17  export JAVA_HOME=/usr/jdk64/jdk1.8.0_77

   18  export PATH=$JAVA_HOME/bin:$PATH

./tpcds-build.sh

beeline -i testbench.settings -u "jdbc:hive2://c****-node9.coe***.*****.com:10500/tpcds_bin_partitioned_orc_2"

 

I'm not able to test the tpcds queries, any help would be appreciated.