Created 07-18-2016 07:36 PM
We are planning to store PDF and Word Documents in Hbase. Storing part is fine. Retrieval is part i have questions on.
1. If we need to query this - Is there a way to do it using any Reporting tools ? Hbase --> Hive External Table -->JDBC/ODBC --> Excel or any BI Tool. However how will the consumer app know that the field is a PDF FIle and not just a text field.
2. Is there a way for HBASE REST to handle this ?
Thanks in advance.
Created 07-18-2016 07:56 PM
Hi @Ash Pad, Phoenix has JDBC and REST APIs today. There is an ODBC driver under development which I believe is currently in beta. Thus you can do reporting style queries against whatever document metadata you store in "normal" column types.
To access the PDF object itself, you can use the JDBC/ODBC/REST apis to read/write the column as raw bytes. See the Phoenix DataTypes page to understand the various column types which support binary values.
Re: HBase REST- you could use this if desired, though I don't see why you would vs. using the built-in JDBC/ODBC capabilities.
Created 07-18-2016 07:56 PM
Hi @Ash Pad, Phoenix has JDBC and REST APIs today. There is an ODBC driver under development which I believe is currently in beta. Thus you can do reporting style queries against whatever document metadata you store in "normal" column types.
To access the PDF object itself, you can use the JDBC/ODBC/REST apis to read/write the column as raw bytes. See the Phoenix DataTypes page to understand the various column types which support binary values.
Re: HBase REST- you could use this if desired, though I don't see why you would vs. using the built-in JDBC/ODBC capabilities.
Created 07-18-2016 09:01 PM
Please take a look at:
https://issues.apache.org/jira/browse/HBASE-11339
which would reduce I/O amplification incurred by medium objects.
This feature is in the upcoming HDP 2.5 release.
Created 07-19-2016 02:26 AM
Created 07-19-2016 03:24 PM
@Ash Pad, how big are your PDFs?
As in all things, it depends on your use case. If you PDFs are not in the multi-megabyte range, you may be fine storing them in a second column family today. This has the advantage of letting you query against doc metadata very quickly without needing to load full file contents into RegionServer memory. In most document management systems, this is highly desirable, as there is far more searching/querying than there is actual full content access.
Created 07-19-2016 05:51 PM
PDFs are 50KB Max. and each rowkey can have upto a max of 5 PDFs associated with it. And the total volume of records would be around 500K range. Like you suggest we have 2 column Families, one for the metadata and one for the documents. your suggestion actually gives a vote of confidence to our thought process.
Created 12-15-2016 07:05 PM
Created 12-15-2016 07:13 PM
HDP 2.5 has been released.
You can use the MOB feature now.