Created 02-07-2017 06:53 AM
We have a requirement to store large data in hdfs and keep a pointe of the same in hbase table. What we understand is its a very common scenario and this is why hbase came into picture to handle this as Hbase cannot handle large file in the column.
My Question is from a java client if we need to get hold of the file in hdfs, i need to make 2 calls, one to hbase table fetch the url and then call the url and open socket in hdfs and stream the file.
This make 2 calls from my java app to hbase/hdfs. Since hbase is sitting on top of HDFS is there a way to get the data into a single call. Should we use Hive here? Any suggestion. on what is better in terms of performance.
First of all, while this is a common use case, this is not the raison d'être for HBase.
Your solution of making two calls is the right way to do this. Think about it this way. Let's say you have a very large email application running on HBase (Gmail runs on Big Table - HBase is based on Google's Big Table paper). 80 percent of your emails are without attachment (or may be very small attachments). You retrieve email from HBase and your responses are in milli seconds. This is exactly how your user expects. Now imagine there is a 100 MB attachment (Store small attachments in HBase using MOB - medium object, for faster retrieval). The user knows he is trying to open a 100 MB attachment and does not expect an instantaneous result. In this case you make a second call to get the hold of the file in HDFS.
Now you may not have emails but your application is similar. This should not require you to use Hive. Just a second call when needed. Now if all you have in HBase are links to these files and nothing else, then I'd like to understand more about your application because it wouldn't make much sense to use HBase to begin with.
@mqureshi, Thanks for your suggestion. The scenario here is we have an option to store it in hdfs and stream is from there. Now with your suggestion to use a MOB object of hdfs, do you have any stats in terms of performance whether reading a MOB object performs better of is it reading from HDFS is better, I know row key search from HDFS add advantage just want to get an openion, whether we can extend the storage limit to 10MB and use MOB will perform better than storing it into HDFS.