Support Questions

Find answers, ask questions, and share your expertise

Hbase Storing pdf and Retrieval

avatar
Contributor

We are planning to store PDF and Word Documents in Hbase. Storing part is fine. Retrieval is part i have questions on.

1. If we need to query this - Is there a way to do it using any Reporting tools ? Hbase --> Hive External Table -->JDBC/ODBC --> Excel or any BI Tool. However how will the consumer app know that the field is a PDF FIle and not just a text field.

2. Is there a way for HBASE REST to handle this ?

Thanks in advance.

1 ACCEPTED SOLUTION

avatar

Hi @Ash Pad, Phoenix has JDBC and REST APIs today. There is an ODBC driver under development which I believe is currently in beta. Thus you can do reporting style queries against whatever document metadata you store in "normal" column types.

To access the PDF object itself, you can use the JDBC/ODBC/REST apis to read/write the column as raw bytes. See the Phoenix DataTypes page to understand the various column types which support binary values.

Re: HBase REST- you could use this if desired, though I don't see why you would vs. using the built-in JDBC/ODBC capabilities.

View solution in original post

7 REPLIES 7

avatar

Hi @Ash Pad, Phoenix has JDBC and REST APIs today. There is an ODBC driver under development which I believe is currently in beta. Thus you can do reporting style queries against whatever document metadata you store in "normal" column types.

To access the PDF object itself, you can use the JDBC/ODBC/REST apis to read/write the column as raw bytes. See the Phoenix DataTypes page to understand the various column types which support binary values.

Re: HBase REST- you could use this if desired, though I don't see why you would vs. using the built-in JDBC/ODBC capabilities.

avatar
Master Collaborator

Please take a look at:

https://issues.apache.org/jira/browse/HBASE-11339

which would reduce I/O amplification incurred by medium objects.

This feature is in the upcoming HDP 2.5 release.

avatar
Master Guru

@Ash Pad I personally like @Ted Yu answer. Until that is release I don't recommend storing these files on hbase. Instead a common practice to have the file save on HDFS and have the "pointer" stored in hbase.

avatar

@Ash Pad, how big are your PDFs?

As in all things, it depends on your use case. If you PDFs are not in the multi-megabyte range, you may be fine storing them in a second column family today. This has the advantage of letting you query against doc metadata very quickly without needing to load full file contents into RegionServer memory. In most document management systems, this is highly desirable, as there is far more searching/querying than there is actual full content access.

avatar
Contributor

PDFs are 50KB Max. and each rowkey can have upto a max of 5 PDFs associated with it. And the total volume of records would be around 500K range. Like you suggest we have 2 column Families, one for the metadata and one for the documents. your suggestion actually gives a vote of confidence to our thought process.

avatar
New Contributor

Can you please tell me how to store PDF and Word Documents in Hbase? @Ash Pad

,

Can you please tell me how did you store PDF and Word Documents in Hbase? @Ash Pad

avatar
Master Collaborator

HDP 2.5 has been released.

You can use the MOB feature now.