I think I still tend to look at HBASE as SQL databases and to clarify my understanding I was hoping someone could explain how a rowkey is found in such optimal return performance.For example I have a table that stores requests to an API rest endpoint in the form of ipaddress, url and a few other columns. Using the ipaddress as the unique identifier (rowkey) how does the matching occur? Is it a direct string match lookup, left to right?
Your IP address is a string which is treated as a sequence of bytes. There is a natural ordering to this byte sequence as well that we call "sorted order" (e.g. "alice" comes before "bob" but after "adam"). HBase tables are sorted in this manner by key (first the row, then family, the qualifier, etc).
A table is partitioned into one to many regions. Each region has a start rokey and end rowkey. The start rowkey is inclusive (contained in this region) and the end rowkey is exclusive (not contained). This effectively allows HBase to prune many regions down to the single region which contains the data you have asked for.
Within a Region, there are one to many column families, each of which have zero to many files associated with them. The families can be immediately pruned based on your query (if you fetch a set of families, we can only access the families that are cared about). Inside of each file there is an index structure and a number of blocks. The index allows access into a specific block that contains the rowkey your query asked for (finding a block in a file is similar to finding the region for a table). This index gives O(log n) lookup to find the block in the HFile containing the data you asked about while it is then a O(n) lookup inside that (small) block to find the actual row data.