Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hbase Storing pdf and Retrieval

Solved Go to solution

Hbase Storing pdf and Retrieval

New Contributor

We are planning to store PDF and Word Documents in Hbase. Storing part is fine. Retrieval is part i have questions on.

1. If we need to query this - Is there a way to do it using any Reporting tools ? Hbase --> Hive External Table -->JDBC/ODBC --> Excel or any BI Tool. However how will the consumer app know that the field is a PDF FIle and not just a text field.

2. Is there a way for HBASE REST to handle this ?

Thanks in advance.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Hbase Storing pdf and Retrieval

Hi @Ash Pad, Phoenix has JDBC and REST APIs today. There is an ODBC driver under development which I believe is currently in beta. Thus you can do reporting style queries against whatever document metadata you store in "normal" column types.

To access the PDF object itself, you can use the JDBC/ODBC/REST apis to read/write the column as raw bytes. See the Phoenix DataTypes page to understand the various column types which support binary values.

Re: HBase REST- you could use this if desired, though I don't see why you would vs. using the built-in JDBC/ODBC capabilities.

7 REPLIES 7

Re: Hbase Storing pdf and Retrieval

Hi @Ash Pad, Phoenix has JDBC and REST APIs today. There is an ODBC driver under development which I believe is currently in beta. Thus you can do reporting style queries against whatever document metadata you store in "normal" column types.

To access the PDF object itself, you can use the JDBC/ODBC/REST apis to read/write the column as raw bytes. See the Phoenix DataTypes page to understand the various column types which support binary values.

Re: HBase REST- you could use this if desired, though I don't see why you would vs. using the built-in JDBC/ODBC capabilities.

Re: Hbase Storing pdf and Retrieval

Super Collaborator

Please take a look at:

https://issues.apache.org/jira/browse/HBASE-11339

which would reduce I/O amplification incurred by medium objects.

This feature is in the upcoming HDP 2.5 release.

Re: Hbase Storing pdf and Retrieval

Super Guru

@Ash Pad I personally like @Ted Yu answer. Until that is release I don't recommend storing these files on hbase. Instead a common practice to have the file save on HDFS and have the "pointer" stored in hbase.

Re: Hbase Storing pdf and Retrieval

@Ash Pad, how big are your PDFs?

As in all things, it depends on your use case. If you PDFs are not in the multi-megabyte range, you may be fine storing them in a second column family today. This has the advantage of letting you query against doc metadata very quickly without needing to load full file contents into RegionServer memory. In most document management systems, this is highly desirable, as there is far more searching/querying than there is actual full content access.

Re: Hbase Storing pdf and Retrieval

New Contributor

PDFs are 50KB Max. and each rowkey can have upto a max of 5 PDFs associated with it. And the total volume of records would be around 500K range. Like you suggest we have 2 column Families, one for the metadata and one for the documents. your suggestion actually gives a vote of confidence to our thought process.

Re: Hbase Storing pdf and Retrieval

New Contributor

Can you please tell me how to store PDF and Word Documents in Hbase? @Ash Pad

,

Can you please tell me how did you store PDF and Word Documents in Hbase? @Ash Pad

Re: Hbase Storing pdf and Retrieval

Super Collaborator

HDP 2.5 has been released.

You can use the MOB feature now.