Community Articles

Find and share helpful community-sourced technical articles.
avatar
Cloudera Employee

Overview

TPC-DS (Transaction Processing Performance Council - Decision Support) is a benchmark suite for evaluating the performance of data warehouse and decision support workloads. This guide provides a step-by-step approach to setting up and executing TPC-DS on Cloudera Datahub Spark Cluster, covering data generation, query execution, and performance analysis.

Note: Required files for implementation can be found in the attached GitHub repository:
👉tpcds-cloudera GitHub Repository

Datagen

This document provides step-by-step instructions to set up and run TPC-DS Data Generation on a Cloudera Datahub Spark Cluster.

Step 1: Create spark-sql-perf Directory

  • Create the directory on the master node:
  • The directory structure and name are not mandatory, but ensure to update the commands accordingly if changed.

Step 2: Setup Storage Bucket Directory

  • Create the TPC-DS directory inside the data bucket:

Step 3: Copy the Datagen JAR to Master Node

  • Copy the spark-tpcds-datagen-2.12.jar from the local system to the Datahub Master node:
  • Copy the JAR file to HDFS User Directory:
  • Verify the copied file

Step 4: Copy hive-site.xml to spark-sql-perf Directory

  • Copy the hive-site.xml configuration file:
  • Change permissions to restrict access

Step 5: Copy hive-site.xml to HDFS user directory

  • Verify the copied file

Step 6: Copy and Extract TPC-DS Tools to All Nodes

  • Distribute tpcds_tools.zip to all nodes (Master, Compute, and Worker):
  • Unzip the tools on each node (use the script available on the Github or copy the tools.zip and unzip the directory manually)
    mkdir -p /home/<username>/spark-sql-perf
    --s3a://<aws-cloud-storage-bucket>/data/tpcds/--
    scp spark-tpcds-datagen-2.12.jar <username>@<Master-Public/PrivateIP>:/home/<username>/
    hdfs dfs -copyFromLocal spark-tpcds-datagen-2.12.jar /user/<username>/
    hdfs dfs -ls hdfs://<Master-Hostname>:8020/user/<username>/spark-tpcds-datagen-2.12.jar
    sudo cp /etc/hive/conf/hive-site.xml /home/<username>/spark-sql-perf/
    chmod -R 700 /home/<username>/spark-sql-perf/
    hdfs dfs -copyFromLocal /home/<username>/spark-sql-perf/hive-site.xml /user/<username>
    hdfs dfs -ls hdfs://<Master-Hostname>:8020/user/<username>/hive-site.xml
    
    scp tpcds_tools.zip <username>@<Master-Node-IP>:/home/<username>/
    scp tpcds_tools.zip <username>@<Compute-Node-IP>:/home/<username>/
    scp tpcds_tools.zip <username>@<Worker-Node-IP>:/home/<username>/
    unzip /home/<username>/tpcds_tools.zip -d /home/<username>/
  • Ensure the same directory structure is maintained across all nodes.

Step 7: Run the Data Generation Command

Before running, verify that:

  • hive-site.xml exists in HDFS: /user/<username>/hive-site.xml
  • spark-tpcds-datagen-2.12.jar exists in HDFS: /user/<username>/spark-tpcds-datagen-2.12.jar
  • tools directory is correctly placed: /home/<username>/tools

Modify the dataset size using -s:

  • -s 1000 (1TB)
  • -s 100 (100GB)
  • -s 1 (1GB)

Datagen jar Parameters

Following parameters are passed to datagen jar

Name

Description

- p

HDFS path to generate data

- s

Scale, in GBs

-- skipDatagen

Set to true to setup tables on already existing data. Helpful to just delete the tables and create again

-- toolsDir

TPCDS tool kit path (Local path in NodeManager nodes, not HDFS path)

Run Spark3-submit Command

Notes:

  1. Validate file paths before running:
    • hive-site.xml in HDFS
    • spark-tpcds-datagen-2.12.jar in HDFS
    • tools directory on all nodes
  2. Adjust dataset size (-s 1000 for 1TB, -s 1 for 1GB)
    nohup spark3-submit\
        --master yarn\
        --deploy-mode cluster\
        --files hdfs://<Master-hostname>:8020/user/<username>/hive-site.xml\
        --num-executors 5\
        --executor-cores 4\
        --executor-memory 32G\
        --driver-memory 16G\
        --driver-cores 2\
        --queue default\
        --class com.databricks.spark.sql.perf.TPCDSDataGen\
        --conf spark.locality.wait=0s\
        hdfs://<Master-hostname>:8020/user/<username>/spark-tpcds-datagen-2.12.jar\
        --path "s3a://<aws-cloud-storage-bucket>/data/tpcds/"\
        -s 1000\
        --skipDatagen false\
        --toolsDir /home/<username>/tools > spark-job.log 2>&1 &

 

 

Dataset:

DatasetDataset

Query Execution

Query Execution Jar Parameters:

Name

Description

-- hdfs

HDFS path, please use /tmp

-- path

query files path in HDFS

-- database

database name <copy from datagen>

-- iterations

number of times to execute a query

-- warmupiterations

To start executors and keep it running for next query runs

-- q

query name (without .sql extension). This option can be passed multiple times to run multiple queries one by one

-- metricsPath

location where query execution metrics will be stored

-- metricsFs

'HDFS' always

Step 1: Copy the Query Execution JAR to Master Node

  • Copy the spark-query-executor-3.2.0-1.0.jar from the local system to the Datahub Master node:
  • Copy the JAR file to HDFS User Directory:
  • Verify the copied file:
    scp spark-query-executor-3.2.0-1.0.jar <username>@<Master-Public/PrivateIP>:/home/<username>/
    hdfs dfs -copyFromLocal spark-query-executor-3.2.0-1.0.jar /user/<username>/
    hdfs dfs -ls hdfs://<Master-Hostname>:8020/user/<username>/spark-query-executor-3.2.0-1.0.jar

 

 

Step 2: Run tpc-ds queries

nohup spark3-submit\
    --master yarn\
    --deploy-mode cluster\
--num-executors 5\
    --executor-cores 4\
    --executor-memory 32G\
    --driver-memory 16G\
    --driver-cores 2\
    --conf spark.driver.memoryOverhead=2G\
    --conf spark.executor.memoryOverhead=2G\
    --queue default\
    --class org.hw.qe.hcube.BatchSparkSQLExecutor\
    --files hdfs://<spark-master-host>:8020/user/<username>/hive-site.xml\
    --conf spark.dynamicAllocation.enabled=false\
    --conf spark.dynamicAllocation.shuffleTracking.enabled=false\
    --conf spark.locality.wait=0s\
    --conf spark.yarn.access.hadoopFileSystems=hdfs://<spark-master-host>:8020/\
    --conf spark.sql.optimizer.dynamicPartitionPruning.enabled=true\
    --conf spark.sql.shuffle.partitions=300\
    --conf spark.sql.autoBroadcastJoinThreshold=128M\
    --conf spark.sql.files.maxPartitionBytes=256MB\
    hdfs://<spark-master-host>:8020/user/<username>/spark-query-executor-3.2.0-1.0.jar\
    --hdfs hdfs://<spark-master-host>:8020/tmp\
    --path hdfs://<spark-master-host>:8020/user/<username>/queries\
    --metricsPath hdfs://<spark-master-host>:8020/tmp/metrics-test1-results/\
    --metricsFs HDFS\
    --database dex_tpcds_sf10_withdecimal_withdate_withnulls\
    --sequential\
    --iterations 5\
    --warmupIterations 0\
    -q query1 -q query2 -q query3 -q query4 -q query5\
    -q query6 -q query7 -q query8 -q query9 -q query10 > metrics-test-results.log 2>&1 &

 

 

Query Execution Results

Query execution metrics, such as execution time, are stored in HDFS at:

```hdfs://<spark-master-host>:8020/tmp/metrics-test1-results/```

This location is specified using the --metricsPath parameter.

The query execution results are generated in JSON format. To extract the exact query runtime, use the collect_query_exec_time.py script.

Command to Run collect_query_exec_time.py:

 

python collect_query_exec_time.py metrics-test1-results/ metrics-test1-results1.csv

Output:
Query_Name, QET1, QET2, QET3, QET4, QET5
query8,5.011,4.219,3.857,3.601,3.586
query1,39.316,19.436,18.397,19.069,17.789
query9,53.364,52.683,53.938,57.91
query10,10.834,8.696,8.755,9.032,8.522
query2,108.595,86.667,82.888,85.525,86.256
query3,10.781,4.195,4.12,4.153,4.408
query4,71.095,67.605,66.681,66.521,68.228
query5,78.931,63.523,64.148,63.177,62.556
query6,3.62,2.808,3.025,2.966,2.761
query7,14.525,10.978,10.667,11.209,10.836

 

 

442 Views
Comments
avatar
Cloudera Employee

Great article, thank you!

Small typo in the "Datagen jar parameters" section --  Scale option is `-s`, not `-p`.

  

avatar
Cloudera Employee

@Alex Dedov Thank you for your feedback! I appreciate you catching that typo. You're absolutely right the Scale option should be -s, not -p. I'll get that corrected. Thanks again for your keen eye!