Created on 09-01-201601:26 AM - edited 08-17-201910:25 AM
The hive testbench consists of a data generator and a standard set of queries typically used for benchmarking hive performance. This article describes how to generate data and run a query in using beeline and Hive 2.0 with and without LLAP. It also shows how to use explain to see the difference in query plans.
If you don't have a cluster already configured for LLAP, you can provision one in AWS using Hortonworks Cloud. See this article for instructions on how to provision a 2.5 tech preview with LLAP enabled.
1. Log into the master node in you cluster where Hive is installed. If you used Hortonworks Cloud to create your instance, locate the node with a name ending in master.
The ssh command is shown next to the master instance. If you are logging in from a linux host, click on the icon to the right of the ssh command to select the command text and copy the command.
In the linux shell, change to the directory containing your AWS key .pem file and the run the copied command.
If you are logging in from Windows, consult the AWS user guide for instructions on how to log in using putty with the user name cloudbreak and authenticating with the key file.
2. Sudo to the hdfs user to begin generating data. Change to the home directory for the hdfs user:
sudo -u hdfs -s
3. Download the testbench utilities from Github and unzip them:
4. Open the load_partitioned.sql file in an editor:
5. Correct the hive.tez.java.opts setting:
Comment out the line below by adding -- at the beginning of the line:
-- set hive.tez.java.opts=-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/;
Add the line below:
set hive.tez.java.opts=-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/;
Save the file and exit.
6. Generate 30G of test data:
/* In case GCC is not installed */
yum install gcc
/* If javac is not found */
7. A map reduce job runs to create the data and load the data into hive. This will take some time to complete. The last line in the script is:
Data loaded into database tpcds_bin_partitioned_orc_30.
8. Choose a query to run for benchmarking. For example query55.sql. Copy the query of of your choice and make an explain version of the query. The explain query will be helpful later on to see how hive is planning the query.
cp query55.sql explainquery55.sql
Add the keyword explain before the query. For example the first line of the explain of query 55:
10. To try a query without LLAP, set hive.llap.execution.mode=none and run a query. For example, the command line below will run benchmark query 55:
Note the completion time at the end of the query is 18.984 without LLAP:
11. Now try the query with LLAP, set hive.llap.execution.mode=all and run the query again:
12. Notice that the query with LLAP completes much more quickly. If you don’t see a significant speed up at first, try the same query again. As the LLAP cache fills with data, queries respond more quickly. Below are the results of the next two runs of the same query with LLAP set to all. The second query returned in 8.455 seconds and a subsequent query in 2.745 seconds. If your cluster has been up and you have been doing LLAP queries on this data your performance my be in the 2 second range on the first try:
13. To see the difference between the query plans, use the explain query to show the plan for a query with no LLAP. Take note of the vectorized outlined in red in the screen shot below:
14. Try the explain again, with LLAP enabled:
15. Notice in the explain plan for the LLAP query, LLAP is shown after the vectorized keyword.