Created on 10-14-201611:47 PM - edited 08-17-201908:48 AM
Phoenix secondary indexes are often misunderstood. Those coming from the relational world mistakenly apply the same prinples to apache phoenix.
A simple data model will be used for this article. Airplane table with attributes carrier ID, Tail Number, Origin airport code, destination airport code, and flight date.
The physical data model has been identified with primary access path of carreirID and TailNum, essentially the rowkey.
note - it is important to understand the order the "primary access path" in which fields will be accessed. Here I have identified the first key in the access path is carrierID. If that is not the case, benefits from the underline database capabilities, HBase may be realized. Think of primary access path not as primary key but more as the core identified access pattern for reads & writes. Secondary indexes enrich & extend this functionality.
What are Apache Phoenix secondary indexes? "Secondary indexes are an orthogonal way to access data from its primary access path." Orthogonal is key here. Think of this as an intersection. Personally I would argue this is different then RDBMS as RDBMS adheres to relational theory. HBase/Phoenix does not. So start training your mind to think of intersections when it comes to secondary indexes
Use case example:
For the airline table, origin airport code is starting to emerge as alternate intersection pattern. Meaning frequently the core access path + origin airport code are used for various processing and/or access. Options are either create a new phoenix table using this core access path pattern or create/apply secondary. Lets go with secondary index
So what are my options?
Lets start with global. Global indexes are used for read heavy use cases. why? Global indexes are not co-located (region server) with the primary table. Therefore with global indexes are dispersing the read load by have the main and secondary index table on different region servers and serving different set of access patterns. Think of it as load balancing.
Simply create a secondary index on origin airport code
CREATE INDEX indexname ON airline(origin_airport_code);
This new secondary index is orthogonal. Meaning an intersection of the primary row key and secondary key (s). Now the data model will support this query
SELECT * FROM AIRLINE WHERE CARRIERID = 'A12' AND TAILNUM = '123' AND ORGIN_AIRPORT_CODE = 'ORD'
Perfect point lookup
SELECT * FROM AIRLINE WHERE CARRIERID = 'A12' AND TAILNUM = '123' AND ORGIN_AIRPORT_CODE = 'ORD' AND DEST_AIRPORT_CODE = 'DFW'
Perfect point lookup with server filter on DFW
Notice the secondary index is an INTERSECTION of the primary key.
What if I ran this:
SELECT * FROM AIRLINE WHERE ORGIN_AIRPORT_CODE = 'ORD'
This would run a full table scan. Why? This is not an intersection of primary row key with secondary row key. To solve this challenge you have options such as covered index or using a hint.
SELECT /*+ /*+ INDEX(AIRLINE indexname) */ * FROM AIRLINE WHERE ORGIN_AIRPORT_CODE = 'ORD'
This will cause each data row to be retrieved when the index is traversed to find the missing column values. Use this with care as you may find performance is better with covered index. You can always force hints to the optimizer to use the index of your choice.
Covered index is a way to bundle data based on alternative access path. If the index can "cover" all fields in your select statement then only the index will be hit during the query. To continue from previous example, I would create covered index as follows
CREATE INDEX indexname ON airline(origin_airport_code) INCLUDE (ALL THE FIELDS YOU WILL COVER IN YOUR SELECT STATEMENT)
Issuing SELECT * FROM AIRLINE WHERE ORGIN_AIRPORT_CODE = 'ORD' will only hit the index table.
Local indexes are used for write heavy use cases. why? Local indexes are co-located (Region server) with the primary table. "Unlike global indexes, local indexes will use an index even when all columns referenced in the query are not contained in the index. This is done by default for local indexes because we know that the table and index data co-reside on the same region server thus ensuring the lookup is local."
CREATE LOCAL INDEX indexname ON airline(origin_airport_code)
Running SELECT * FROM AIRLINE WHERE ORGIN_AIRPORT_CODE = 'ORD' should take advantage of the secondary index
That is a ton of info. It is important to understand secondary indexes on NoSQL databases do not adhere to relational theory. There is no substitute for understanding the principles. Now go create some smart secondary indexes 🙂