Created 05-12-2021 03:08 AM
I am wondering what difference in IO can be expected for hbase with storage in the cloud VS storage on hdfs.
I would expect that when data is retrieved from hdfs, it will be a lot faster than from the cloud (like adls-in my specific case adls gen2=abfs).
Is there somewhere where I can test this?
Or find a previous study for this?
If this is the case, then one would expect that current hbase performance for reading data is a bit less than some years ago when everything was on premise and typically hdfs was used?
Maybe I am missing something obvious here, so any insight is appreciated !
Created 07-04-2021 05:06 AM
Hello @JB0000000000001
Thanks for using Cloudera Community. This is an Old Post, as such I am unsure whether you have found the details being shared below.
Having said that, [1] by Cloudera HBase Team offers a few details of S3 Performance in the final paragrapgh. As mentioned in the concerned Paragraph, BucketCache is Critical for Performance & AWS suggest the same in [2] under "Operational considerations" Section.
In short, there are definitely areas for Performance Degradation yet there are few Links is LinkedIn Slideshare around Microsoft, Airbnb, Huawei HBase Usage on Cloud & they offer a lot of details into the Observation made & Optimisation performed. The Optimisation aren't always centered on HBase & pivot towards Cloud Infrastructure as well. Such Slideshare should offer additional insight to your Team.
Being an Old Post, We wish to check if your Team have come across findings, which can be shared with us. This would allow the Post to be used by fellow community Users, who may be considering the Cloud Storage Latency aspect as well.
- Smarak
[1] https://blog.cloudera.com/how-hbase-in-cdp-can-leverage-amazons-s3/
[2] https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hbase-s3.html
Created 07-04-2021 05:06 AM
Hello @JB0000000000001
Thanks for using Cloudera Community. This is an Old Post, as such I am unsure whether you have found the details being shared below.
Having said that, [1] by Cloudera HBase Team offers a few details of S3 Performance in the final paragrapgh. As mentioned in the concerned Paragraph, BucketCache is Critical for Performance & AWS suggest the same in [2] under "Operational considerations" Section.
In short, there are definitely areas for Performance Degradation yet there are few Links is LinkedIn Slideshare around Microsoft, Airbnb, Huawei HBase Usage on Cloud & they offer a lot of details into the Observation made & Optimisation performed. The Optimisation aren't always centered on HBase & pivot towards Cloud Infrastructure as well. Such Slideshare should offer additional insight to your Team.
Being an Old Post, We wish to check if your Team have come across findings, which can be shared with us. This would allow the Post to be used by fellow community Users, who may be considering the Cloud Storage Latency aspect as well.
- Smarak
[1] https://blog.cloudera.com/how-hbase-in-cdp-can-leverage-amazons-s3/
[2] https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hbase-s3.html
Created 07-23-2021 02:31 AM
Hello @JB0000000000001
We wish to follow-up with you on the Post & confirm if you have any additional Observation to be shared with respect to studying or implementing HBase on Cloud Storage. Or, our response to your Post was helpful in getting the required Cloud Storage possible latencies.
- Smarak
Created 08-06-2021 08:05 PM
Hello @JB0000000000001
As we haven't heard from your side, We assume the Queries posted by you has been addressed & marking the Post as Solved. When you have the time, Feel free to share your Observation with respect to studying or implementing HBase on Cloud Storage.
Thanks again for sharing your thoughts on Cloudera Community.
- Smarak
Created on 08-11-2021 07:02 AM - edited 08-11-2021 07:05 AM
Thank you for this very valuable input!
(I had somehow missed the response).
I see indeed increased latencies, but see that should be neglectable for hot data.
I have observed this, but think there is a limit to how much data you can keep 'hot'. This depends on a combination of settings at the level of the hbase catalog properties and the hbcase cluster. We have discussed this also in following thread: https://community.cloudera.com/t5/Support-Questions/simplest-method-to-read-a-full-hbase-table-so-it...
It would be very interesting if a more in depth study would ever be conducted and reported, as this is very relevant for applications with hbase as back-end that require some more advanced querying of the data (like in my case aggregations to compute a heatmap using a high volume of data points).