Created 01-20-2016 06:12 PM
Hello,
in short : Can I use HBase over HDFS as a datalake ?
in detail : as Hadoop has been designed to store massive amounts of data(as big files), i was wondering if according to my use case (storing lot of small files) HBase is will be more suitable ? of course data in HBase is stored in HDFS but what about metadata and when HBase runs into operational issues like compaction,node rebuild and load distribution ?
thanks in advance.
Created 01-21-2016 04:26 AM
Hi @Mehdi TAZI I cannot recommend using HBase for data lake. It's not designed for that, but to provide quick access to stored data. If your total data size grows into hundreds of terabytes or into petabyte range it won't work well. I mean, it cannot replace a file system. You can combine small files into Sequence files or something similar but the best solution would be a kind of object store for Hadoop/HDFS. And indeed there is such a solution called Ozone. It's under active development and it's supposed to appear soon. More details can be found here.
Created 01-20-2016 06:47 PM
Hi Mehdi,
You can definitely store all your data in HBASE but it will requires lot of work from your end. You have to format the data as per CF defined in HBASE table.
I believe that you should research on HDFS vs. HBASE.
Created 01-20-2016 10:24 PM
no worry for that, i'm more talking about performance while reading, i know that hbase performs well in range scan but it is still true with huge amounts of data when it comes to run into operational issues like compaction,node rebuild and load distribution ?
Created 01-21-2016 12:38 PM
@Mehdi TAZI If you have a plan in place to load data into HBase and looking for fast response time then HBase is is a good solution. HBase resides on HDFS
Created 01-20-2016 11:38 PM
Having small files in HDFS will create issues with Namenode filling up quickly and the blocks being too small. There are number of ways you can combine the files to create a right sized Files. You can also try and see if HAR is an option.
But Hbase can be an option. The Key design will be critical. You can also look at OpenTSDB if it is time series kind of data. Yes, you will have to deal with Hbase compaction, node rebuild etc.
Created 01-21-2016 04:26 AM
Hi @Mehdi TAZI I cannot recommend using HBase for data lake. It's not designed for that, but to provide quick access to stored data. If your total data size grows into hundreds of terabytes or into petabyte range it won't work well. I mean, it cannot replace a file system. You can combine small files into Sequence files or something similar but the best solution would be a kind of object store for Hadoop/HDFS. And indeed there is such a solution called Ozone. It's under active development and it's supposed to appear soon. More details can be found here.
Created 01-27-2016 05:38 PM
Hello back ! sorry for 6days latency of my answer,
otherwise i couldn't find how Ozone stores data on HDFS , in order to see how is it handling small files. do you have any idea ? thanks a lot 🙂
Created 01-28-2016 02:11 AM
@Mehdi TAZI AFAIK, Ozone is a key-object store like AWS S3. Keys/objects are organized into buckets with unique set of keys. Bucket data and Ozone metadata stored in Storage Containers (SC) which coexist with HDFS blocks on Data nodes in a separate block pool. Ozone metadata distributed on SCs, no central NN. Buckets can be huge and are divided into partitions also stored in SCs. R/W supported, append and update not. SC implementaion to use LevelDB or RocksDB. Ozone architecture doc and all details are here. So, it's not on top of HDFS, it's going to coexist with HDFS and share DNs with HDFS.
Created 01-21-2016 06:14 AM
I dont think HBase should be a DataLake (storing many files with different sizes and formats), but you can certainly use HBase to store the content of your small files (depending on the content, whats in those files?).
HBase is massively scalable, look at this example https://www.facebook.com/notes/facebook-engineering/the-underlying-technology-of-messages/4549916089... Facebook is storing billions of messages in their HBase (Hydrabase) setup and Bloomberg is using HBase to store TB of data and respond to about 5billion requests per day (http://www.slideshare.net/HBaseCon/case-studies-session-4a-35937605)