Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Could you please help me evaluate my TPC-DS Benchmark numbers for Hawq & SparkSQL?

avatar
Explorer

Hello everyone,

We ran a subset of 19 TPC-DS queries for HAWQ & SparkSQL on Hadoop.

Ref:

  1. https://github.com/databricks/spark-sql-perf
  2. https://github.com/pivotalguru/TPC-DS

I would like to request the community to help me validate the numbers. I am attaching the results and the system set-up. Please share your comments.

Thanks,

Tania

1 ACCEPTED SOLUTION

avatar
Contributor
  • Why did you cherrypick the queries instead of running all 99?
  • The databricks queries are not dynamic as the TPC-DS benchmark calls for. Instead, they are using constant values which may help them in tuning their queries.
  • 500GB and 1TB are rather small. I guess the goal here is to keep the data in memory? You have enough RAM to keep the largest table in memory which would definitely skew things towards Spark.
  • If you are only concerned with single user query performance, there are some configuration changes you can make to get HAWQ to perform better.
  • How did the concurrency tests perform?
  • I noticed in the databricks DDL that they don't have a single bigint datatype listed. ss_ticket_number in store_sales for instance, will need to be a bigint when you make the scale 3TB. I'm assuming they don't intend to test at scale?
  • Databricks is using a fork of dsdgen to generate the data. Why is that? What are they changing? They also label it in Github as "1.1". Did they fork from the older and deprecated version of TPC-DS? The HAWQ benchmark is using the unmodified TPC-DS toolkit version 2.1.0.
  • I've seen some benchmarks from other vendors misrepresent the execution time. They will capture the time reported by the database rather than the time it takes to start the "xxxxxx-shell" program, execute the query, and then present the results. I've seen reported execution times of 1-2 seconds but the clock time be at least 10 seconds more than that. You may want to verify Spark is really executing as fast as reported. For the HAWQ benchmark, it is the actual time it takes from start to finish just as the TPC-DS benchmark calls for.

View solution in original post

5 REPLIES 5

avatar
Explorer

avatar
Contributor
  • Why did you cherrypick the queries instead of running all 99?
  • The databricks queries are not dynamic as the TPC-DS benchmark calls for. Instead, they are using constant values which may help them in tuning their queries.
  • 500GB and 1TB are rather small. I guess the goal here is to keep the data in memory? You have enough RAM to keep the largest table in memory which would definitely skew things towards Spark.
  • If you are only concerned with single user query performance, there are some configuration changes you can make to get HAWQ to perform better.
  • How did the concurrency tests perform?
  • I noticed in the databricks DDL that they don't have a single bigint datatype listed. ss_ticket_number in store_sales for instance, will need to be a bigint when you make the scale 3TB. I'm assuming they don't intend to test at scale?
  • Databricks is using a fork of dsdgen to generate the data. Why is that? What are they changing? They also label it in Github as "1.1". Did they fork from the older and deprecated version of TPC-DS? The HAWQ benchmark is using the unmodified TPC-DS toolkit version 2.1.0.
  • I've seen some benchmarks from other vendors misrepresent the execution time. They will capture the time reported by the database rather than the time it takes to start the "xxxxxx-shell" program, execute the query, and then present the results. I've seen reported execution times of 1-2 seconds but the clock time be at least 10 seconds more than that. You may want to verify Spark is really executing as fast as reported. For the HAWQ benchmark, it is the actual time it takes from start to finish just as the TPC-DS benchmark calls for.

avatar
Explorer

@Jon Roberts

Thanks, Jon for your valuable inputs. I too was skeptical about the results, so wanted some feedback.

As you mentioned " If you are only concerned with single user query performance, there are some configuration changes you can make to get HAWQ to perform better." Could you please help me to tune hawq for single user?

How do you run concurrency tests for hawq? I was not able to find any query set or stream_map.txt for hawq under TPC-DS/07_multi_userfolder. Could you please guide me on how to perform this test?

Again, Thanks a ton! I will rerun the tests, considering the above inputs with max number of queries possible.

avatar
Contributor

Make sure you have RANDOM_DISTRIBUTION="true" in the tpcds_variables.sh file.

Make sure you have memory set correctly. With 128GB of RAM, you should set the following:

hawq_rm_memory_limit_perseg = 121

In Ambari, this parameter is set in the "Ambari Memory Usage Limit" textbox.

Set hawq_rm_stmt_vseg_memory to '16gb' in the hawq-site.xml file. This is done in the "Custom hawq-site" section in Ambari.

sysctl.conf file:

vm.overcommit=2

vm.overcommit_ratio=100

Create 4GB swap file on each node.

Make sure you have a temp directory for each drive. By default, there will only be one but use all of the disks you have available. This is just a comma delimited list of directories on each node and you will have to create the directories manually.

Change hawq_rm_nvseg_perquery_perseg_limit from the default of 6 to 8, 10, 12, 14, or 16 to improve performance. This specifies how many virtual segments (vseg) are allowed to be created when a query is executed. Each vseg will consume resources so there will be a point of diminishing returns as your disk, memory, or CPU become the bottleneck.

The concurrency test runs by default. It is set to 5 users in the tpcds_variables.sh file.

avatar
Explorer

I will verify these parameters and run my tests. Thanks Jon.