Support Questions
Find answers, ask questions, and share your expertise

Slow SQL Access using Lookup Table through Hive and Phoenix

Slow SQL Access using Lookup Table through Hive and Phoenix

Explorer

Hi,

We have setup a Phoenix table which we access through an external Hive table using the org.apache.phoenix.hive.PhoenixStorageHandler.
We are using HBase version 1.1.2.2.6 and Hive version 1.2.1.2.6 (HDP 2.6.0.3).

When we are accessing the data in data_table (through Hive or Phoenix (sqlline)) using a sql statement with the first element of the Phoenix composite row-key (mk0) in the where clause (mk0 = 12345) the performance is fine.
However we got an additional requirement for queries based on a different column (mk1), which is not part of the composite phoenix row key.

Therefore we created a lookup table (lookup_table) in Phoenix (also exposed though an external Hive table) with only a composite row key consisting of mk1 and mk0, in that order.
Using that lookup table the query in the phoenix shell

select <md1> from <data_table> where <mk0> in (select <mk0> from <lookup_table> where <mk1> = 12345);

has an acceptable performance which is not really surprising since the queries should access the the phoenix / hbase tables using a known row-key prefix.

The Phoenix query plan:

CLIENT 22955-CHUNK 12756848943 ROWS 7200887136211 BYTES PARALLEL 1-WAY ROUND ROBIN FULL SCAN OVER 
<data_table>  
SKIP-SCAN-JOIN TABLE 0                                                                                CLIENT 1-CHUNK 0 ROWS 0 BYTES PARALLEL 1-WAY RANGE SCAN OVER <lookup_table> [12345]               
SERVER FILTER BY FIRST KEY ONLY                                                                      SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY [<mk0>]                                                CLIENT MERGE SORT                                                                                     DYNAMIC SERVER FILTER BY <data_table>.<mk0> IN ($1.$3)

However when we use the same table setup through hive and hive.PhoenixStorageHandler the query gets executed as map-reduce job which takes way too long (using DISTINCT and ORDER BY does not change query performance in a noticeable fashion).

Hive query:

select b.<md1> from <data_table> as b where <mk0> in (select distinct t.<mk0> from <lookup_table> as t where t.<mk1> = 12345 order by mk0);

The Hive query plan:

Plan not optimized by CBO.

Vertex dependency in root stage
Map 1 <- Reducer 4 (BROADCAST_EDGE)
Reducer 3 <- Map 2 (SIMPLE_EDGE)
Reducer 4 <- Reducer 3 (SIMPLE_EDGE)

Stage-0
   Fetch Operator
      limit:-1
      Stage-1
         Map 1
         File Output Operator [FS_19]
            compressed:false
            Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
            Select Operator [SEL_18]
               outputColumnNames:["_col0"]
               Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
               Map Join Operator [MAPJOIN_24]
               |  condition map:[{"":"Left Semi Join 0 to 1"}]
               |  HybridGraceHashJoin:true
               |  keys:{"Map 1":"<mk0> (type: int)","Reducer 4":"_col0 (type: int)"}
               |  outputColumnNames:["_col11"]
               |  Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
               |<-Reducer 4 [BROADCAST_EDGE]
               |  Reduce Output Operator [RS_15]
               |     key expressions:_col0 (type: int)
               |     Map-reduce partition columns:_col0 (type: int)
               |     sort order:+
               |     Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
               |     Group By Operator [GBY_11]
               |        keys:_col0 (type: int)
               |        outputColumnNames:["_col0"]
               |        Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
               |        Select Operator [SEL_9]
               |        |  outputColumnNames:["_col0"]
               |        |  Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
               |        |<-Reducer 3 [SIMPLE_EDGE] vectorized
               |           Reduce Output Operator [RS_27]
               |              key expressions:_col0 (type: int)
               |              sort order:+
               |              Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
               |              Group By Operator [OP_26]
               |              |  keys:KEY._col0 (type: int)
               |              |  outputColumnNames:["_col0"]
               |              |  Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
               |              |<-Map 2 [SIMPLE_EDGE]
               |                 Reduce Output Operator [RS_5]
               |                    key expressions:_col0 (type: int)
               |                    Map-reduce partition columns:_col0 (type: int)
               |                    sort order:+
               |                    Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
               |                    Group By Operator [GBY_4]
               |                       keys:<mk0> (type: int)
               |                       outputColumnNames:["_col0"]
               |                       Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
               |                       Filter Operator [FIL_23]
               |                          predicate:<mk0> is not null (type: boolean)
               |                          Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
               |                          TableScan [TS_1]
               |                             alias:t
               |                             Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
               |<-Filter Operator [FIL_22]
                     predicate:<mk0> is not null (type: boolean)
                     Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                     TableScan [TS_0]
                        alias:b
                        Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE


Since the hive table sits on top of the phoenix table I had expected hive to directly hand over the query to phoenix for execution and not running a map-reduce job. Also according to the hive query plan, the equals condition in th e where clause is not used as filter for HBase.
We have set TEZ as execution engine and enabled CBO.

Is there a way to speed up the Hive query using basically data_table(lookup_table(rowKey)) instead of runnning a map-reduce job?

Maybe indexing is another option, but would queries from hive actually use a Phoenix index?

Maybe an index in Hive would be another option.

1 REPLY 1
Highlighted

Re: Slow SQL Access using Lookup Table through Hive and Phoenix

I don't believe Hive's CBO is ever consulted for StorageHandlers (aside from maybe trivial data size computations). Similarly, I don't believe Hive supports indexes on (external) tables from a StorageHandler. Phoenix should be pushing down the predicate in your inner-query (as you can see in the Phoenix explain plan above)

If you create a Phoenix Index, I believe the PhoenixStorageHandler will pick it up (assuming that the query considers that index table to satisfy the Phoenix query that Hive generates).

I like giving the suggestion that the PhoenixStorageHandler is a "bulldozer" (big, slow moving, but can push large amounts of data). It can help run some queries, but Phoenix "natively" is nearly always going to perform better than the Hive integration will.