Support Questions
Find answers, ask questions, and share your expertise

Insert and retrive PDF from Hbase

Highlighted

Insert and retrive PDF from Hbase

New Contributor

Need help with loading and retrieving PDFs from/to Hbase. Appreciate if someone can point us to an example with sample scripts.

3 REPLIES 3
Highlighted

Re: Insert and retrive PDF from Hbase

Super Collaborator

Please take a look at the following feature (coming in HDP 2.5):

http://hbase.apache.org/book.html#hbase_mob

Highlighted

Re: Insert and retrive PDF from Hbase

New Contributor

Do we have some solution for HDP 2.3.2?

Highlighted

Re: Insert and retrive PDF from Hbase

@himanshu gupta There are a couple of options available in HDP 2.3.

Firstly, you can use the REST API. A simple test can be performed as follows:

1. Start a REST Server. On the Edge Node, as root do:

${HBASE_HOME}/bin/hbase-daemon.sh start rest -p 6080 --infoport 6081

eg

/usr/hdp/current/hbase-client/bin/hbase-daemon.sh start rest -p 6080 --infoport 6081

2. Test a simple file post using cURL:

curl -X POST -H "Content-Type: application/octet-stream" --data-binary @<filename> http://<server>:<HBase REST port>/<HBase Namespace>:<HBase Table>/<key>/<Column Family>:<Column>

eg

curl -X POST -H "Content-Type: application/octet-stream" --data-binary @file1.pdf http://edgenode:6080/NameSpace:PDFTable/DOCID001/cf1:col

3. Fetch it back by doing either

curl http://edgenode:6080/NameSpace:PDFTable/DOCID001/cf1:col -o file1.pdf

or, firewalls etc permitting, point your browser to

http://edgenode:6080/NameSpace:PDFTable/DOCID001/cf1:col

REST and cURL are good for quick testing, and can even handle a moderate workload with the appropriate scripts wrapped around the utilities.

For larger loads, Pig can store data in HBase. It's simpler if the LOAD operation is explained first. To load from HBase, do:

PDFs = LOAD 'hbase://NameSpace:PDFTable' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf1:col', '-loadKey true') as (id:chararray, PDFFile:bytearray);

To store, files in HBase, build the PDFs alias as above, ie with a key and bytearray. Storing to HBase is as simple as issuing the following statement in your script. Note, the key must be the first element in your alias.

STORE PDFs INTO 'hbase://NameSpace:PDFTable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf1:col');

Hope this helps. Good luck!