Support Questions
Find answers, ask questions, and share your expertise

Running TPC-H on Impala

New Contributor

We have tried running TPC-H on Impala at Amazon EC2.  This is part of a set of performance profiling of OpenLink Virtuoso, our scale-out SQL column store.

The Virtuoso numbers at Amazon can be found at http://www.openlinksw.com/weblogs/oerling/

We know about a set of 30 TPC-DS queries that were used to benchmark Actian Vortex against Impala.  Do you have the running instructions for these?

We plan to publish a set of comparisons of Virtuoso against other column stores but to make this relevent we would like to have all contestants properly configured at first.

Our initial experiment with Impala webt as follows:
- Get 2 EC2 R3.8 instances, same as was used for the Virtuoso experiments
- Copy the 100G TPC-H data to HDFS, link it as csv tables, then copy these to Parquet tables.
- Run stats.
- Run a single steram of queries.
- About 10 queries return 0 rows, which is incorrect.  The remaining queriees return correct results.
- All CPU utilization seems to go to one of the boxes, with almost nothing on the other
- Many simple scans with group by run at only 200% CPU while being trivially parallel.
- Some queries run at full platform on one of the boxes, but most are around the 200% mark.

We conclude that we must be doing something wrong.

We would therefore appreciate any hints on how to make this a meaningful benchmark and what behavior to expect from Impala for this type of workload.

The complete details of schema and scripts etc are found in the Virtuoso v7fasttrack git in
https://github.com/v7fasttrack/virtuoso-opensource/tree/feature/analytics/binsrc/tests/tpc-h/im

0 REPLIES 0