About gsandeep

gsandeep · ‎03-07-2025

@Alex Dedov Thank you for your feedback! I appreciate you catching that typo. You're absolutely right the Scale option should be -s, not -p. I'll get that corrected. Thanks again for your keen eye!

gsandeep · ‎03-05-2025

Overview TPC-DS (Transaction Processing Performance Council - Decision Support) is a benchmark suite for evaluating the performance of data warehouse and decision support workloads. This guide provides a step-by-step approach to setting up and executing TPC-DS on Cloudera Datahub Spark Cluster, covering data generation, query execution, and performance analysis. Note: Required files for implementation can be found in the attached GitHub repository: 👉tpcds-cloudera GitHub Repository Datagen This document provides step-by-step instructions to set up and run TPC-DS Data Generation on a Cloudera Datahub Spark Cluster. Step 1: Create spark-sql-perf Directory Create the directory on the master node: The directory structure and name are not mandatory, but ensure to update the commands accordingly if changed. Step 2: Setup Storage Bucket Directory Create the TPC-DS directory inside the data bucket: Step 3: Copy the Datagen JAR to Master Node Copy the spark-tpcds-datagen-2.12.jar from the local system to the Datahub Master node: Copy the JAR file to HDFS User Directory: Verify the copied file Step 4: Copy hive-site.xml to spark-sql-perf Directory Copy the hive-site.xml configuration file: Change permissions to restrict access Step 5: Copy hive-site.xml to HDFS user directory Verify the copied file Step 6: Copy and Extract TPC-DS Tools to All Nodes Distribute tpcds_tools.zip to all nodes (Master, Compute, and Worker): Unzip the tools on each node (use the script available on the Github or copy the tools.zip and unzip the directory manually) mkdir -p /home/<username>/spark-sql-perf --s3a://<aws-cloud-storage-bucket>/data/tpcds/-- scp spark-tpcds-datagen-2.12.jar <username>@<Master-Public/PrivateIP>:/home/<username>/ hdfs dfs -copyFromLocal spark-tpcds-datagen-2.12.jar /user/<username>/ hdfs dfs -ls hdfs://<Master-Hostname>:8020/user/<username>/spark-tpcds-datagen-2.12.jar sudo cp /etc/hive/conf/hive-site.xml /home/<username>/spark-sql-perf/ chmod -R 700 /home/<username>/spark-sql-perf/ hdfs dfs -copyFromLocal /home/<username>/spark-sql-perf/hive-site.xml /user/<username> hdfs dfs -ls hdfs://<Master-Hostname>:8020/user/<username>/hive-site.xml scp tpcds_tools.zip <username>@<Master-Node-IP>:/home/<username>/ scp tpcds_tools.zip <username>@<Compute-Node-IP>:/home/<username>/ scp tpcds_tools.zip <username>@<Worker-Node-IP>:/home/<username>/ unzip /home/<username>/tpcds_tools.zip -d /home/<username>/ Ensure the same directory structure is maintained across all nodes. Step 7: Run the Data Generation Command Before running, verify that: hive-site.xml exists in HDFS: /user/<username>/hive-site.xml spark-tpcds-datagen-2.12.jar exists in HDFS: /user/<username>/spark-tpcds-datagen-2.12.jar tools directory is correctly placed: /home/<username>/tools Modify the dataset size using -s: -s 1000 (1TB) -s 100 (100GB) -s 1 (1GB) Datagen jar Parameters Following parameters are passed to datagen jar Name Description - p HDFS path to generate data - s Scale, in GBs -- skipDatagen Set to true to setup tables on already existing data. Helpful to just delete the tables and create again -- toolsDir TPCDS tool kit path (Local path in NodeManager nodes, not HDFS path) Run Spark3-submit Command Notes: Validate file paths before running: hive-site.xml in HDFS spark-tpcds-datagen-2.12.jar in HDFS tools directory on all nodes Adjust dataset size (-s 1000 for 1TB, -s 1 for 1GB) nohup spark3-submit\ --master yarn\ --deploy-mode cluster\ --files hdfs://<Master-hostname>:8020/user/<username>/hive-site.xml\ --num-executors 5\ --executor-cores 4\ --executor-memory 32G\ --driver-memory 16G\ --driver-cores 2\ --queue default\ --class com.databricks.spark.sql.perf.TPCDSDataGen\ --conf spark.locality.wait=0s\ hdfs://<Master-hostname>:8020/user/<username>/spark-tpcds-datagen-2.12.jar\ --path "s3a://<aws-cloud-storage-bucket>/data/tpcds/"\ -s 1000\ --skipDatagen false\ --toolsDir /home/<username>/tools > spark-job.log 2>&1 & Dataset: Dataset Query Execution Query Execution Jar Parameters: Name Description -- hdfs HDFS path, please use /tmp -- path query files path in HDFS -- database database name <copy from datagen> -- iterations number of times to execute a query -- warmupiterations To start executors and keep it running for next query runs -- q query name (without .sql extension). This option can be passed multiple times to run multiple queries one by one -- metricsPath location where query execution metrics will be stored -- metricsFs 'HDFS' always Step 1: Copy the Query Execution JAR to Master Node Copy the spark-query-executor-3.2.0-1.0.jar from the local system to the Datahub Master node: Copy the JAR file to HDFS User Directory: Verify the copied file: scp spark-query-executor-3.2.0-1.0.jar <username>@<Master-Public/PrivateIP>:/home/<username>/ hdfs dfs -copyFromLocal spark-query-executor-3.2.0-1.0.jar /user/<username>/ hdfs dfs -ls hdfs://<Master-Hostname>:8020/user/<username>/spark-query-executor-3.2.0-1.0.jar Step 2: Run tpc-ds queries nohup spark3-submit\ --master yarn\ --deploy-mode cluster\ --num-executors 5\ --executor-cores 4\ --executor-memory 32G\ --driver-memory 16G\ --driver-cores 2\ --conf spark.driver.memoryOverhead=2G\ --conf spark.executor.memoryOverhead=2G\ --queue default\ --class org.hw.qe.hcube.BatchSparkSQLExecutor\ --files hdfs://<spark-master-host>:8020/user/<username>/hive-site.xml\ --conf spark.dynamicAllocation.enabled=false\ --conf spark.dynamicAllocation.shuffleTracking.enabled=false\ --conf spark.locality.wait=0s\ --conf spark.yarn.access.hadoopFileSystems=hdfs://<spark-master-host>:8020/\ --conf spark.sql.optimizer.dynamicPartitionPruning.enabled=true\ --conf spark.sql.shuffle.partitions=300\ --conf spark.sql.autoBroadcastJoinThreshold=128M\ --conf spark.sql.files.maxPartitionBytes=256MB\ hdfs://<spark-master-host>:8020/user/<username>/spark-query-executor-3.2.0-1.0.jar\ --hdfs hdfs://<spark-master-host>:8020/tmp\ --path hdfs://<spark-master-host>:8020/user/<username>/queries\ --metricsPath hdfs://<spark-master-host>:8020/tmp/metrics-test1-results/\ --metricsFs HDFS\ --database dex_tpcds_sf10_withdecimal_withdate_withnulls\ --sequential\ --iterations 5\ --warmupIterations 0\ -q query1 -q query2 -q query3 -q query4 -q query5\ -q query6 -q query7 -q query8 -q query9 -q query10 > metrics-test-results.log 2>&1 & Query Execution Results Query execution metrics, such as execution time, are stored in HDFS at: ```hdfs://<spark-master-host>:8020/tmp/metrics-test1-results/``` This location is specified using the --metricsPath parameter. The query execution results are generated in JSON format. To extract the exact query runtime, use the collect_query_exec_time.py script. Command to Run collect_query_exec_time.py: python collect_query_exec_time.py metrics-test1-results/ metrics-test1-results1.csv Output: Query_Name, QET1, QET2, QET3, QET4, QET5 query8,5.011,4.219,3.857,3.601,3.586 query1,39.316,19.436,18.397,19.069,17.789 query9,53.364,52.683,53.938,57.91 query10,10.834,8.696,8.755,9.032,8.522 query2,108.595,86.667,82.888,85.525,86.256 query3,10.781,4.195,4.12,4.153,4.408 query4,71.095,67.605,66.681,66.521,68.228 query5,78.931,63.523,64.148,63.177,62.556 query6,3.62,2.808,3.025,2.966,2.761 query7,14.525,10.978,10.667,11.209,10.836

Online	Offline
Last Visited	‎03-24-2025 05:29 PM

Member Since	‎08-14-2023 02:34 AM
Last Visited	‎03-24-2025 05:29 PM
Posts	2
Kudos received	1

Cloudera Community

Re: TPC-DS on Cloudera

TPC-DS on Cloudera