Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Encrypted NiFi Content Repository

avatar
Expert Contributor

Hi Guys,

I noticed NiFi has encrypted provenance repository in v1.3.

May I ask the timeline to release the encrypted content repository feature?

Since we fetch encrypted financial data to NiFi, then decrypt them for some fields transformations before encrypting them again with another algo.

Based on my understanding, the Decryption Processor will leave a copy of unencrypted data in disk, which is not acceptable for our compliance.

Any idea about that?

Thanks.

1 ACCEPTED SOLUTION

avatar

Alvin,

The encrypted content repository feature is actively being worked on. As a rule, we cannot make claims about the delivery dates or versions of active development features. Hope this helps.

If you have compliance requirements that sensitive data (such as PII, PCI/payment details, EPHI, etc.) is never stored on disk in plaintext, you can explore using the volatile content repository, but be aware there is the risk of data loss in the event of power failure, and this applies to all content objects, not just the sensitive records.

With Apache NiFi 1.3.0, you can also use the RecordReader and RecordSetWriter approach -- while the EncryptedRecordReader and EncryptedRecordSetWriter controller services are not yet available, you could use a custom ScriptedRecordReader and ScriptedRecordSetWriter to decrypt and re-encrypt on the fly. The intermediate "record" object is never persisted to disk, so at no point would the plaintext data be written outside of volatile memory regardless of the content repository implementation.

View solution in original post

8 REPLIES 8

avatar

Alvin,

The encrypted content repository feature is actively being worked on. As a rule, we cannot make claims about the delivery dates or versions of active development features. Hope this helps.

If you have compliance requirements that sensitive data (such as PII, PCI/payment details, EPHI, etc.) is never stored on disk in plaintext, you can explore using the volatile content repository, but be aware there is the risk of data loss in the event of power failure, and this applies to all content objects, not just the sensitive records.

With Apache NiFi 1.3.0, you can also use the RecordReader and RecordSetWriter approach -- while the EncryptedRecordReader and EncryptedRecordSetWriter controller services are not yet available, you could use a custom ScriptedRecordReader and ScriptedRecordSetWriter to decrypt and re-encrypt on the fly. The intermediate "record" object is never persisted to disk, so at no point would the plaintext data be written outside of volatile memory regardless of the content repository implementation.

avatar
Expert Contributor

Hi @Andy LoPresto,

Thank you for your work on the encrypted repositories tickets.

We considered about "volatile content repository", but it affects all the workflows with data loss risk.

"Decrypt and Re-encrypt on the fly" sounds like a better one for us. We can extract the non-sensitive fields as attributes, while leave the sensitive data in payload.

Thanks.

avatar
New Contributor

Hello @Andy LoPresto,

After reading here (and on other forum posts), it appears that there's a group of us who are seeking encryption of their "data at rest" (i.e. encryption of the flowfile, content, and provenance repositories). You all at Hortonworks have recently provided an encrypted provenance repository (thank you!), but if we also want the other two to be encrypted, it seems that we currently have five options:

1) Create our own encrypted versions of the pluggable writers/readers for the Content and Flowfile repositories.

2) Wait for you/Hortonworks to finish implementing encrypted versions of the pluggable writers/readers for the Content and Flowfile repositories.

3) Use an encrypted volume. (The OS / disk-driver handles the encryption and decryption transparently)

4) Use a VolatileContentRepository (at our own risk of data loss)

5) Create a ScriptedRecordReader and ScriptedRecordSetWriter (in lieu of the pending EncryptedRecordReader and EncryptedRecordSetWriter controller services).

You alluded to #5 in your response to Alvin, above. My question is: How does #5 differ from #1? More specifically, when does NiFi write content and flowfile data to disk? When queues get full? Would I be correct to assume that #5 only works if we're processing record-oriented data exclusively? For example, if our flow were to assemble record-oriented from one or more relations (i.e. from sources other than an EncryptedRecordReader), then wouldn't there be a possibility of plaintext data written to disk before it becomes associated/identified as a "record"?

avatar

Alex, your question regarding the difference between option 1 and option 5 is a good one. Option 1 discusses the readers and writers that support the provenance repository implementation. These classes serialize and deserialize provenance events from Java objects to byte streams which can be written to the repository files on disk. Option 5 references the record readers and writers which are used on the NiFi canvas to support the abstract "record" concept of a collection of individual units of data within a single flowfile. In this case, the reader and writer classes convert data between external formats (JSON, CSV, arbitrary via ScriptedRecord*) and the NiFi internal record format. I understand these concepts seem related, but these classes are completely separate and there is no overlap whatsoever. The as-yet-undeveloped EncryptedRecordReader and EncryptedRecordSetWriter classes you mention would allow you to operate on encrypted flowfile content, i.e. a flowfile contained 100 lines of customer data and some of the column values were PII/PCI/PHI and therefore encrypted. If you needed to update these records (let's say add a new property to each record which contained the last four digits of a credit card number, but the full number value was encrypted), you could use an UpdateRecord processor as follows:

  1. EncryptedRecordReader to decrypt the records ephemerally
  2. Add a property "lastFourDigits" which reads the /PAN field and slices the last four digits
  3. EncryptedRecordSetWriter would re-encrypt the sensitive fields

All of these actions happen within the lifecycle of a single @OnTrigger of the UpdateRecord processor (even though some logic is being performed by the controller services), so none of the plaintext "record" data is persisted anywhere on the system (it is only in RAM). To be clear, in this situation, the actual implementation of the record reader and writer would need to combine the crypto capabilities with the format, so it would actually be something like EncryptedJsonRecordReader and EncryptedJsonRecordSetWriter. I haven't done the full architecture work here yet, but obviously it's not ideal to have 2*n implementations just to provide the crypto capabilities. It is likely this would require architecture changes, either to allow multiple sequentially stacked readers and writers in the processor, or the Encrypted* implementations would accept a "type-specific" record reader/writer in their definitions and perform this task via composition. That way you would maintain the type-conversion flexibility that currently exists (i.e. EncryptedRecordReader has a JsonRecordReader and the EncryptedRecordSetWriter has a CsvRecordSetWriter, etc.).

As for when NiFi persists data to disk, this is usually done during/after the @OnTrigger phase completes in the data lifecycle. You can see code like flowFile = processSession.putAttribute(flowFile, JMS_SOURCE_DESTINATION_NAME, destinationName); or

flowFile = processSession.write(flowFile, new OutputStreamCallback() {
@Override
public void process(final OutputStream out) throws IOException {
out.write(response.getMessageBody());
}
});

This is populating the flowfile attributes or content respectively, and then on processSession.commit() that data is persisted to whichever implementation of the content/flowfile repository that is configured (i.e. could be written to disk, volatile memory, etc.) There is a good document which goes into depth on the write-ahead log implementation, and I wrote extensively about the provenance repository serialization here. I hope this clarifies the system. Please follow up if you have further questions.

avatar
New Contributor

Thanks for your in-depth response, Andy! This is great; it certainly clarified the gray areas for me... it sounds like Option 5, then, is simply passing around encrypted data as the FlowFile content. If a given processor needs to access the content, it would employ an EncryptedRecordReader (or however you choose to build it... your layered approach sounds good, I agree that it'd be better than the 2*n implementations...). Since the content is decrypted and re-encrypted as part of the @OnTrigger phase, plaintext data would never be written to the Content Repository. It sounds like the attributes would still be unencrypted, so processors that deal with those (e.g. RouteOnAttribute) could still function.

avatar
Expert Contributor

Hi @Andy LoPresto,

I am curious to know how much risks to use the volatile content repository?

My understanding is:

If there is a node failure/restart,

For data has already been processed/persisted through the flow, no impact on our business or downstreams.

But users cannot view and/or replay content via the provenance UI, since the content are gone due to restart.

For the content of flowfiles are still in the middle of flow during node failure/restart, we can't replay them from where it fails, when the node is back to normal. Instead, we have to fetch the same files from source again, and reprocess them end to end through the flow.

If above is correct, I would say as long as we have source data permanently persisted in somewhere out of NiFi, we can always reprocess it when data in volatile content repository is lost. The only loss is the ability to view/replay them via Provenance UI.

BTW, what happens when content exceeds the maximum size of repository?

Out of memory exception? Auto purged from memory? auto archived in disk?

If I set nifi.content.repository.implementation=org.apache.nifi.controller.repository.VolatileContentRepository

Does that mean below properties are auto-disabled?

nifi.content.claim
nifi.content.repository.archive
nifi.content.viewer.url

Any comments are appropriated.

Thanks.

avatar
Expert Contributor

Hi @Andy LoPresto,

Just want to follow up on ticket https://issues.apache.org/jira/browse/NIFI-3834

Is it prioritized for this year?

Thanks.

avatar

Hi Alvin,

As stated above, I cannot indicate prioritization or scheduling of feature delivery. I am eager to develop this feature, as I am sure many users would like it to be available as well. You can always monitor activity on the Apache NiFi Jira and the mailing lists.