Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Best way to store PDF from Avro Binary in Nifi

avatar
Contributor

Hi,

so I figure with Nifi 1.3 the best way to send PDFs between two systems is to put it inside a binary field ("bytes" field) in Avro. This comes in as binary, and then using maybe UpdateRecord you can modify some fields, then I want to store it in HBase. Now here is where the problem comes in. The binary in Avro is working but HBase only allows JSON storage (Same with Solr or Elasticsearch). So when you convert the binary in Avro to JSON, JSON cannot store binary (it may convert it to a byte-array type of integers).

What's the best standard way of storing it in a JSON NoSQL database? Would it be smarter to convert to something like Base64 or whatever and have that as the JSON field?

Additionally, whenever I use Avros with binary, I seem to have some trouble converting it, I get a lot of "ArrayIndexOutOfBounds" whenever I try to use UpdateRecord for example.

I guess the only way is to splitAvro or SplitJSON and store the binary inside flowfile-content??? This would slow down the process a lot.

1 ACCEPTED SOLUTION

avatar
Master Guru

HBase is actually not a JSON store... the row id, column family, column qualifiers, and values are all stored as byte[] so it can be whatever you want.

In NiFi there is PutHBaseJson which is one way of using NiFi to get data into HBase because it is a common way to represents the columns of a row as key/value pairs in JSON, but it is NiFi that is choosing to use JSON here, not HBase. There is another processor PutHBaseCell which writes the byte contents of a flow file to a cell value in HBase, so this would make sense if you had a GetFile pick up a PDF from a directory and then wanted to store the PDF as the value of a cell in HBase.

Solr and ES are both text-indexing systems so they aren't really made to store binary data, although I believe they do have a a binary field type. Most likely you would use them to index the text content of the PDF which would be extracted with something like Tika, an example of that would be this:

https://community.hortonworks.com/articles/42210/using-solrs-extracting-request-handler-with-apache....

There is a pull request up for an issue that might be related to the ArrayIndexOutOfBounds exceptions...

https://github.com/apache/nifi/pull/2718

View solution in original post

2 REPLIES 2

avatar
Master Guru

HBase is actually not a JSON store... the row id, column family, column qualifiers, and values are all stored as byte[] so it can be whatever you want.

In NiFi there is PutHBaseJson which is one way of using NiFi to get data into HBase because it is a common way to represents the columns of a row as key/value pairs in JSON, but it is NiFi that is choosing to use JSON here, not HBase. There is another processor PutHBaseCell which writes the byte contents of a flow file to a cell value in HBase, so this would make sense if you had a GetFile pick up a PDF from a directory and then wanted to store the PDF as the value of a cell in HBase.

Solr and ES are both text-indexing systems so they aren't really made to store binary data, although I believe they do have a a binary field type. Most likely you would use them to index the text content of the PDF which would be extracted with something like Tika, an example of that would be this:

https://community.hortonworks.com/articles/42210/using-solrs-extracting-request-handler-with-apache....

There is a pull request up for an issue that might be related to the ArrayIndexOutOfBounds exceptions...

https://github.com/apache/nifi/pull/2718

avatar
Contributor

Right but my goal was to be able to write binary cells and other string cells using PutHBaseJson. My issue is instead of a GetFile, I'm getting it as a flowfile with a binary field "bytes" field. I'm having a bit of an issue trying to parse it so that I can maybe put it into flowfile-content while still keeping the other fields as attributes perhaps. So you send the attributes to PutHBaseJson, while you send the flowfile-content PutHBaseCell. For that I think I need to split all the avros into individual flowfiles, then the tricky part is converting the binary field into flowfile-content. How exactly would you convert one field in an avro into flowfile content ?