Created on 01-10-2017 08:47 AM - edited 09-16-2022 03:53 AM
i have made a test with more than 800 data inside by Hbase and MySQL. I want to compare the performance of the server with that size of data. The result is :
the data that stored in the Hbase table slower than in MySQL. there are so many external noise when i install hadoop and Hbase. for example like the RAM wasn't enough and the harddisk wasn't enough too. so i install Hadoop in an environment that lower that the requirement one.
But what i want to ask is why Hbase faster than MySQL when the data is large ?
alright if you tell me that Hbase have map reduce for Hadoop. But i don't know how Hbase read the column inside the table when we query it.
If MySQL , depend on what i read before. they said that MySQL use row based to search for a data. But why Hbase with column based faster than row based ?
anyone can explain to me why ?
thank you very much
Created 01-16-2017 09:53 PM
I struggled to scope this reponse so it wouldn't ballon out. Lets separate the discussion into reads and writes. HBase isn't terrible great at data ingestion. It has been the achilies heel of the system and probably were most large orgs eventually start looking elsewhere. So I am not suprised that MySQL did better at ingestion data. When it comes to large data sets HBase shines in a few ways. First it lives on HDFS which is an excellent distributed file system specifically for large datasets. HBase doesn't store empty columns. HBase also splits the data into regions. This makes it efficient and effective to either fetch a single row or column or scan through and grab many. For what it is worth, the latency of HBase in retrieving a single record is comparable to other RDBMS; I don't think it is necessarily faster. So think of in terms of "hey I have this petabyte of data but I want sub-second latency to get one or set of rows". HBase can do that while MySQL cannot.
I thought of using an example but maybe a summary is better.
HBase writes are comparable to most RDBMS for transactions per second.
HBase has comparable latency on single row lookup to most RDBMS.
HBase can effeciently and effectively retrieve a large chunk of rows better than any RDBMS.
HBase, thanks to HDFS, can operation on Petabytes and larger datasets.
So if you need to scale to TBs and PBs of data, want comparable performance and latency to RDBMS then go HBase.
There is something about how HBase stores the data to makes it efficient for pulling a single column but I don't recall off the top of my head and am not digging through the HBase doc tonight. The gist of it would, due to the storage format HBase can find the specific row in the region's index file, comparable to MySQL, and then fetch the specific column and only return that. MySQL on the other hand can do a fast index search, find the row, and return the entire row.
Created 01-16-2017 09:53 PM
I struggled to scope this reponse so it wouldn't ballon out. Lets separate the discussion into reads and writes. HBase isn't terrible great at data ingestion. It has been the achilies heel of the system and probably were most large orgs eventually start looking elsewhere. So I am not suprised that MySQL did better at ingestion data. When it comes to large data sets HBase shines in a few ways. First it lives on HDFS which is an excellent distributed file system specifically for large datasets. HBase doesn't store empty columns. HBase also splits the data into regions. This makes it efficient and effective to either fetch a single row or column or scan through and grab many. For what it is worth, the latency of HBase in retrieving a single record is comparable to other RDBMS; I don't think it is necessarily faster. So think of in terms of "hey I have this petabyte of data but I want sub-second latency to get one or set of rows". HBase can do that while MySQL cannot.
I thought of using an example but maybe a summary is better.
HBase writes are comparable to most RDBMS for transactions per second.
HBase has comparable latency on single row lookup to most RDBMS.
HBase can effeciently and effectively retrieve a large chunk of rows better than any RDBMS.
HBase, thanks to HDFS, can operation on Petabytes and larger datasets.
So if you need to scale to TBs and PBs of data, want comparable performance and latency to RDBMS then go HBase.
There is something about how HBase stores the data to makes it efficient for pulling a single column but I don't recall off the top of my head and am not digging through the HBase doc tonight. The gist of it would, due to the storage format HBase can find the specific row in the region's index file, comparable to MySQL, and then fetch the specific column and only return that. MySQL on the other hand can do a fast index search, find the row, and return the entire row.
Created 10-01-2020 04:29 AM
Hbase stores data as a sorted map by keys. HBase is considered a persistent, multidimensional, sorted map, where each cell is indexed by a row key and column key (family and qualifier). A rowkey, which is immutable and uniquely defines a row, usually spans multiple HFiles. Rowkeys are treated as byte arrays (byte[]) and are stored in a sorted order in the multi-dimensional sorted map. If you look for a row_key, Hbase is able to identify the node where this data is present. Hadoop runs its computation on the same node where the key is present and hence the performance with technologies like Spark is really good. This is called data localization.