Support Questions

mbigelow · ‎01-16-2017

I struggled to scope this reponse so it wouldn't ballon out. Lets separate the discussion into reads and writes. HBase isn't terrible great at data ingestion. It has been the achilies heel of the system and probably were most large orgs eventually start looking elsewhere. So I am not suprised that MySQL did better at ingestion data. When it comes to large data sets HBase shines in a few ways. First it lives on HDFS which is an excellent distributed file system specifically for large datasets. HBase doesn't store empty columns. HBase also splits the data into regions. This makes it efficient and effective to either fetch a single row or column or scan through and grab many. For what it is worth, the latency of HBase in retrieving a single record is comparable to other RDBMS; I don't think it is necessarily faster. So think of in terms of "hey I have this petabyte of data but I want sub-second latency to get one or set of rows". HBase can do that while MySQL cannot.

I thought of using an example but maybe a summary is better.

HBase writes are comparable to most RDBMS for transactions per second.

HBase has comparable latency on single row lookup to most RDBMS.

HBase can effeciently and effectively retrieve a large chunk of rows better than any RDBMS.

HBase, thanks to HDFS, can operation on Petabytes and larger datasets.

So if you need to scale to TBs and PBs of data, want comparable performance and latency to RDBMS then go HBase.

There is something about how HBase stores the data to makes it efficient for pulling a single column but I don't recall off the top of my head and am not digging through the HBase doc tonight. The gist of it would, due to the storage format HBase can find the specific row in the region's index file, comparable to MySQL, and then fetch the specific column and only return that. MySQL on the other hand can do a fast index search, find the row, and return the entire row.

View solution in original post

Cloudera Community

Support Questions

Who agreed with this solution