Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Can I use Hbase as a datalake

avatar
Rising Star

Hello,

in short : Can I use HBase over HDFS as a datalake ?

in detail : as Hadoop has been designed to store massive amounts of data(as big files), i was wondering if according to my use case (storing lot of small files) HBase is will be more suitable ? of course data in HBase is stored in HDFS but what about metadata and when HBase runs into operational issues like compaction,node rebuild and load distribution ?

thanks in advance.

tazimehdi.com
1 ACCEPTED SOLUTION

avatar
Master Guru

Hi @Mehdi TAZI I cannot recommend using HBase for data lake. It's not designed for that, but to provide quick access to stored data. If your total data size grows into hundreds of terabytes or into petabyte range it won't work well. I mean, it cannot replace a file system. You can combine small files into Sequence files or something similar but the best solution would be a kind of object store for Hadoop/HDFS. And indeed there is such a solution called Ozone. It's under active development and it's supposed to appear soon. More details can be found here.

View solution in original post

8 REPLIES 8

avatar
Master Mentor
@Mehdi TAZI

Hi Mehdi,

You can definitely store all your data in HBASE but it will requires lot of work from your end. You have to format the data as per CF defined in HBASE table.

I believe that you should research on HDFS vs. HBASE.

Link

avatar
Rising Star

no worry for that, i'm more talking about performance while reading, i know that hbase performs well in range scan but it is still true with huge amounts of data when it comes to run into operational issues like compaction,node rebuild and load distribution ?

tazimehdi.com

avatar
Master Mentor

@Mehdi TAZI If you have a plan in place to load data into HBase and looking for fast response time then HBase is is a good solution. HBase resides on HDFS

avatar

@Mehdi TAZI

Having small files in HDFS will create issues with Namenode filling up quickly and the blocks being too small. There are number of ways you can combine the files to create a right sized Files. You can also try and see if HAR is an option.

But Hbase can be an option. The Key design will be critical. You can also look at OpenTSDB if it is time series kind of data. Yes, you will have to deal with Hbase compaction, node rebuild etc.

avatar
Master Guru

Hi @Mehdi TAZI I cannot recommend using HBase for data lake. It's not designed for that, but to provide quick access to stored data. If your total data size grows into hundreds of terabytes or into petabyte range it won't work well. I mean, it cannot replace a file system. You can combine small files into Sequence files or something similar but the best solution would be a kind of object store for Hadoop/HDFS. And indeed there is such a solution called Ozone. It's under active development and it's supposed to appear soon. More details can be found here.

avatar
Rising Star

Hello back ! sorry for 6days latency of my answer,

otherwise i couldn't find how Ozone stores data on HDFS , in order to see how is it handling small files. do you have any idea ? thanks a lot 🙂

tazimehdi.com

avatar
Master Guru

@Mehdi TAZI AFAIK, Ozone is a key-object store like AWS S3. Keys/objects are organized into buckets with unique set of keys. Bucket data and Ozone metadata stored in Storage Containers (SC) which coexist with HDFS blocks on Data nodes in a separate block pool. Ozone metadata distributed on SCs, no central NN. Buckets can be huge and are divided into partitions also stored in SCs. R/W supported, append and update not. SC implementaion to use LevelDB or RocksDB. Ozone architecture doc and all details are here. So, it's not on top of HDFS, it's going to coexist with HDFS and share DNs with HDFS.

avatar

I dont think HBase should be a DataLake (storing many files with different sizes and formats), but you can certainly use HBase to store the content of your small files (depending on the content, whats in those files?).

HBase is massively scalable, look at this example https://www.facebook.com/notes/facebook-engineering/the-underlying-technology-of-messages/4549916089... Facebook is storing billions of messages in their HBase (Hydrabase) setup and Bloomberg is using HBase to store TB of data and respond to about 5billion requests per day (http://www.slideshare.net/HBaseCon/case-studies-session-4a-35937605)