I am working on POC where images of small size around 1MB to 5MB get stored in HDFS and its respective metadata in HBase. Please note image processing/aggregation is not under consideration as of now.
It would be a simple model where one record have multiple images to be get stored in HDFS with respective data in HBase.
Question here is what would be the user interface for this design to view Images in browser of respective rowkey:-
1. Create Hbase over Hive table and view images using Hue(does hdfs url behave like a Hyperlink to image location)
2. Use Phoenix over HBase to display images(does column which have hdfs url will display images).
Aggregation and image processing is not under consideration as of now.
I tried option 1 above and found that images link in Hue is not showing as hyperlink even external url like google.com is also appearing a plain text. Does it required any setting on Hue. See my result in below Image:-
It appears that change is not included in the version of Hue shipped with HDP. At the very least, I have been unable to find any commit message that references HUE-2034.
May be things have changed now but Hue was not meant for production use. I can also understand that internally you might want to use HUE to provide an interface for your business.
I like your approach number 2 but Phoenix is just a SQL tool. Are you bulding your own interface to run Phoenix over HBase under the hood?
Three years ago I was working on a HBase application which would store emails. Emails were stored in HBase cell directly but their attachments, which could be as much as 25 GB would be stored in HDFS. The mechanism was very similar to what you are thinking but instead of Phoenix, we had the rest api and obviously a custom email interface in which when a user clicks an attachment, the request goes to HDFS to pull the attachment an display.
Sorry, I don't have code but you definitely have the right approach and the code part should be easy. It's the scaling and architecture that you need to get right. Also, I would personally avoid using MOB's to make architecture more easily scalable. You can always change that to use MOBs in a future release when you have seen lot more successful use cases in the industry.
Hi @Mukesh Kumar. Storing the metadata in HBase is a great design.
Whether the content itself should go in HBase or HDFS directly depends on content size. HBase now has medium object support, which means content up to a few MB is fine, particularly if you store the metadata and actual content in separate column families.
On the UI front, if you have files stored in HDFS, you can use string concatenation to embed the filename in a WebHDFS url: <a href="http://<HOST>:<PORT>/webhdfs/v1/user/dev/images/img1.gif?op=OPEN">Link</a>, which will download as a file when clicked. Note, I've done this in Zeppelin, but haven't tried it in the Hive View or in Hue.
If you're accessing content from HBase, you'll need a service to front HTTP calls. The Phoenix Query Server may make this possible out of the box, but I haven't tried.
In addtion to the answer supplied by @Randy Gelhausen, the REST daemon can be used to access content from HBase. Start it with:
/usr/hdp/current/hbase-client/bin/hbase-daemon.sh start rest -p <PortNum> --infoport <AnotherPortNum>
Access a file using cURL (useful for extracting data):
curl -X GET -H "Accept: application/octet-stream" http://<host>:<PortNum>/<namespace>:<table>/<key>/<ColFam>:<Col>; -o File.out
Or via a browser (useful for a quick one-off view of HBase content, or for embedding links in HTML content or apps):