We ran a subset of 19 TPC-DS queries for HAWQ & SparkSQL on Hadoop.
I would like to request the community to help me validate the numbers. I am attaching the results and the system set-up. Please share your comments.
Thanks, Jon for your valuable inputs. I too was skeptical about the results, so wanted some feedback.
As you mentioned " If you are only concerned with single user query performance, there are some configuration changes you can make to get HAWQ to perform better." Could you please help me to tune hawq for single user?
How do you run concurrency tests for hawq? I was not able to find any query set or stream_map.txt for hawq under TPC-DS/07_multi_userfolder. Could you please guide me on how to perform this test?
Again, Thanks a ton! I will rerun the tests, considering the above inputs with max number of queries possible.
Make sure you have RANDOM_DISTRIBUTION="true" in the tpcds_variables.sh file.
Make sure you have memory set correctly. With 128GB of RAM, you should set the following:
hawq_rm_memory_limit_perseg = 121
In Ambari, this parameter is set in the "Ambari Memory Usage Limit" textbox.
Set hawq_rm_stmt_vseg_memory to '16gb' in the hawq-site.xml file. This is done in the "Custom hawq-site" section in Ambari.
Create 4GB swap file on each node.
Make sure you have a temp directory for each drive. By default, there will only be one but use all of the disks you have available. This is just a comma delimited list of directories on each node and you will have to create the directories manually.
Change hawq_rm_nvseg_perquery_perseg_limit from the default of 6 to 8, 10, 12, 14, or 16 to improve performance. This specifies how many virtual segments (vseg) are allowed to be created when a query is executed. Each vseg will consume resources so there will be a point of diminishing returns as your disk, memory, or CPU become the bottleneck.
The concurrency test runs by default. It is set to 5 users in the tpcds_variables.sh file.