04-13-2017 04:40 PM
I would like to know if there are any tools or best practices for benchmarking the query performance of Impala.
For example, if I had a specific SQL query that was representative of the types of queries typically run on the system, how could I quantify the amount of time it takes to run this query when there is a single request, five concurrent requests, 10 concurrent requests, etc.
The only thing that jumps to mind is to make 1, 5, and 10 consecutive calls to impala-shell sending each call to the background and to measure the total time it takes the script to finish in each case.
impala-shell -q "select foo,bar from baz where foo = 2" &
Do I need to worry about Impala caching the results and throwing off the time measurements in this case? I wasn't sure if Impala actually cached results.
Is there a better way to submit these query requests simultaneously?
Thanks in advance!
04-14-2017 04:31 PM
Your impala-shell approach is reasonable for a quick benchmark. The numbers will include the time taken for impala-shell to start up and connect so may end up being a little too high.
We have various perf-workload running infrastructure in the Impala codebase but it may be overkill for what you're trying to do.
Impala doesn't cache results. Running the query multiple times will warn up the file system and metadata caches, so usually the second and later runs are faster. Often the faster runs are more representative of real-world performance since the same data does tend to be repeatedly queried.
04-14-2017 04:41 PM
The numbers will include the time taken for impala-shell to start up and connect so may end up being a little too high.
Thanks for your answer, Tim. In regards to the impala-shell overhead you mention, would connecting to Impala with a database driver such as ODBC negate this overhead? I'm assuming this is the preferred method for production usage, then.
Do you know off-hand if the ODBC/JDBC drivers are available when using the community edition or is that reserved for enterprise customers? I ask because the download page seems to suggest it is for enterprise only: https://www.cloudera.com/downloads/connectors/impala/jdbc/2-5-5.html.
04-14-2017 05:04 PM
It's not a big deal if the query runs for a while, or if you keep impala-shell open to run multiple queries, just something to be aware of if you're driving it from a script.
The driver works fine with the community edition (there is no difference in the version of Impala shipped). I'm not a lawyer or a legal representative of Cloudera but I don't believe the license terms require you to be a customer.