Created on
03-05-2025
10:55 AM
- edited on
03-11-2025
12:31 AM
by
VidyaSargur
TPC-DS (Transaction Processing Performance Council - Decision Support) is a benchmark suite for evaluating the performance of data warehouse and decision support workloads. This guide provides a step-by-step approach to setting up and executing TPC-DS on Cloudera Datahub Spark Cluster, covering data generation, query execution, and performance analysis.
Note: Required files for implementation can be found in the attached GitHub repository:
👉tpcds-cloudera GitHub Repository
This document provides step-by-step instructions to set up and run TPC-DS Data Generation on a Cloudera Datahub Spark Cluster.
mkdir -p /home/<username>/spark-sql-perf
--s3a://<aws-cloud-storage-bucket>/data/tpcds/--
scp spark-tpcds-datagen-2.12.jar <username>@<Master-Public/PrivateIP>:/home/<username>/
hdfs dfs -copyFromLocal spark-tpcds-datagen-2.12.jar /user/<username>/
hdfs dfs -ls hdfs://<Master-Hostname>:8020/user/<username>/spark-tpcds-datagen-2.12.jar
sudo cp /etc/hive/conf/hive-site.xml /home/<username>/spark-sql-perf/
chmod -R 700 /home/<username>/spark-sql-perf/
hdfs dfs -copyFromLocal /home/<username>/spark-sql-perf/hive-site.xml /user/<username>
hdfs dfs -ls hdfs://<Master-Hostname>:8020/user/<username>/hive-site.xml
scp tpcds_tools.zip <username>@<Master-Node-IP>:/home/<username>/
scp tpcds_tools.zip <username>@<Compute-Node-IP>:/home/<username>/
scp tpcds_tools.zip <username>@<Worker-Node-IP>:/home/<username>/
unzip /home/<username>/tpcds_tools.zip -d /home/<username>/
Before running, verify that:
Modify the dataset size using -s:
Following parameters are passed to datagen jar
Name |
Description |
- p |
HDFS path to generate data |
- s |
Scale, in GBs |
-- skipDatagen |
Set to true to setup tables on already existing data. Helpful to just delete the tables and create again |
-- toolsDir |
TPCDS tool kit path (Local path in NodeManager nodes, not HDFS path) |
nohup spark3-submit\
--master yarn\
--deploy-mode cluster\
--files hdfs://<Master-hostname>:8020/user/<username>/hive-site.xml\
--num-executors 5\
--executor-cores 4\
--executor-memory 32G\
--driver-memory 16G\
--driver-cores 2\
--queue default\
--class com.databricks.spark.sql.perf.TPCDSDataGen\
--conf spark.locality.wait=0s\
hdfs://<Master-hostname>:8020/user/<username>/spark-tpcds-datagen-2.12.jar\
--path "s3a://<aws-cloud-storage-bucket>/data/tpcds/"\
-s 1000\
--skipDatagen false\
--toolsDir /home/<username>/tools > spark-job.log 2>&1 &
Dataset:
Dataset
Query Execution Jar Parameters:
Name |
Description |
-- hdfs |
HDFS path, please use /tmp |
-- path |
query files path in HDFS |
-- database |
database name <copy from datagen> |
-- iterations |
number of times to execute a query |
-- warmupiterations |
To start executors and keep it running for next query runs |
-- q |
query name (without .sql extension). This option can be passed multiple times to run multiple queries one by one |
-- metricsPath |
location where query execution metrics will be stored |
-- metricsFs |
'HDFS' always |
scp spark-query-executor-3.2.0-1.0.jar <username>@<Master-Public/PrivateIP>:/home/<username>/
hdfs dfs -copyFromLocal spark-query-executor-3.2.0-1.0.jar /user/<username>/
hdfs dfs -ls hdfs://<Master-Hostname>:8020/user/<username>/spark-query-executor-3.2.0-1.0.jar
nohup spark3-submit\
--master yarn\
--deploy-mode cluster\
--num-executors 5\
--executor-cores 4\
--executor-memory 32G\
--driver-memory 16G\
--driver-cores 2\
--conf spark.driver.memoryOverhead=2G\
--conf spark.executor.memoryOverhead=2G\
--queue default\
--class org.hw.qe.hcube.BatchSparkSQLExecutor\
--files hdfs://<spark-master-host>:8020/user/<username>/hive-site.xml\
--conf spark.dynamicAllocation.enabled=false\
--conf spark.dynamicAllocation.shuffleTracking.enabled=false\
--conf spark.locality.wait=0s\
--conf spark.yarn.access.hadoopFileSystems=hdfs://<spark-master-host>:8020/\
--conf spark.sql.optimizer.dynamicPartitionPruning.enabled=true\
--conf spark.sql.shuffle.partitions=300\
--conf spark.sql.autoBroadcastJoinThreshold=128M\
--conf spark.sql.files.maxPartitionBytes=256MB\
hdfs://<spark-master-host>:8020/user/<username>/spark-query-executor-3.2.0-1.0.jar\
--hdfs hdfs://<spark-master-host>:8020/tmp\
--path hdfs://<spark-master-host>:8020/user/<username>/queries\
--metricsPath hdfs://<spark-master-host>:8020/tmp/metrics-test1-results/\
--metricsFs HDFS\
--database dex_tpcds_sf10_withdecimal_withdate_withnulls\
--sequential\
--iterations 5\
--warmupIterations 0\
-q query1 -q query2 -q query3 -q query4 -q query5\
-q query6 -q query7 -q query8 -q query9 -q query10 > metrics-test-results.log 2>&1 &
Query execution metrics, such as execution time, are stored in HDFS at:
```hdfs://<spark-master-host>:8020/tmp/metrics-test1-results/```
This location is specified using the --metricsPath parameter.
The query execution results are generated in JSON format. To extract the exact query runtime, use the collect_query_exec_time.py script.
python collect_query_exec_time.py metrics-test1-results/ metrics-test1-results1.csv
Output:
Query_Name, QET1, QET2, QET3, QET4, QET5
query8,5.011,4.219,3.857,3.601,3.586
query1,39.316,19.436,18.397,19.069,17.789
query9,53.364,52.683,53.938,57.91
query10,10.834,8.696,8.755,9.032,8.522
query2,108.595,86.667,82.888,85.525,86.256
query3,10.781,4.195,4.12,4.153,4.408
query4,71.095,67.605,66.681,66.521,68.228
query5,78.931,63.523,64.148,63.177,62.556
query6,3.62,2.808,3.025,2.966,2.761
query7,14.525,10.978,10.667,11.209,10.836
Created on 03-07-2025 06:52 AM
Great article, thank you!
Small typo in the "Datagen jar parameters" section -- Scale option is `-s`, not `-p`.
Created on 03-07-2025 09:17 AM
@Alex Dedov Thank you for your feedback! I appreciate you catching that typo. You're absolutely right the Scale option should be -s, not -p. I'll get that corrected. Thanks again for your keen eye!