Support Questions

Find answers, ask questions, and share your expertise

Removing Personally Identifiable Information (PII) from HDFS

avatar

In May 2018 new legislation will be introduced in the EU to force organisations to get explicit permission from customers in order to use their data. The GDPR regulations present challenges for the Big Data world. 

 

My team faces issues right now and I'm looking for flexible solutions to give our developers a range of tools depending on the type of data and contractual requirements presented to us.

 

The problem is clear - "How can we depersonalise specific columns and rows in Impala tables based on a matrix of rules?"

 

For example, Customer A ended up in HDFS due to a relationship we have with Business Y. Customer B's data was saved as part of a line of business we are running in partnership with Business Z. Customer A never resulted in a sale but Customer B did. After 3 months we should remove Customer A's personally identifiable information (PII). But we're allowed to keep Customer B's data for 6 years for administrative and tax purposes.

 

Now it gets tricky. Both Customer A & B are in the same partition within an Impala table. The file is approximately 100Mb in size. 

 

To depersonalise Customer A's data we built a job which runs daily and obsfucates against defined set of rules. But the job has to scan most of our partitions looking for matching cases and then SELECT the data back into a table with the PII of customers identified by the routine amended.

 

This job takes a long time to run and is I/O heavy. Every time we have a new contractual relationship the code needs to be altered. It starts looking messy after just a few use cases are added and also causes us a testing headache. We have built a monster that doesn't scale up.

 

I'm interested in how other teams are addressing this issue.

 

The potential solutions floated so far are;

  1. Encrypt the PII but don't allow unencryption. (I don't like this and I don't think an auditor would either)
  2. Flume the data in twice - once with all data available and secondly with all data removed. Then delete version with PII after 3 months. (But in a scenario where we have multiple rules we'd still need to selectively delete some data but not others. And which version do we use? How do we ensure against duplication?)
  3. Delete everything after n days rather than risk breaching a contract. (The nuclear option but we would lose a lot of analytically useful data if we did this

None of these solutions are ideal.

 

For reference we're running CDH 5.9 in a 20 node cluster (16 data nodes). We use Flume to capture approx 11Tb of data per month. The data is used in 'almost' real time but after a defined period is needed only for trend analysis. We use Impala predominently but are making a shift to Spark where appropriate. Our current depersonalisation process uses a daily Oozie job which runs Impala code. This solution won't scale but the contractual requirements seem to be scaling quite fast.

 

Many thanks for any advice in advance,

 

Regards,

 

Gary

1 ACCEPTED SOLUTION

avatar
Cloudera Employee

Hi Gary,

 

Thanks for this rather timely post - given, as you pointed out, that many companies are actively working toward fullfilling GDPR requirements.

 

Let me first outline a high-level set of steps that some organization are using for dealing with record deletions in HDFS:

 

  1. Upon initial write, or via batch jobs executing periodically after arrival, remove or redact external identifiers and replace with internal identifier.  The fields in these files only contain pseudononymized (depersonalized) PII; in other words PII that can’t be linked back to an individual by itself.    This job should only need to run once on each batch of data (file or partition).
  2. Actual re-identifiable PII fields that need to be retained are stored in a separate (smaller) “PII lookup table” specifically for this purpose.
    • Each row in this lookup table should contain internal unique identifier which is used to tie it to the data in the larger files in HDFS.  This identifier could either be an existing internal customerID or a new identifier token generated for this purpose.   
  3. When requests for deletion arrive, add these to a separate “pending deletions” table; this table contains the internal customerID of each record that needs deletion.
  4. On a periodic basis (say, daily or weekly), re-generate files which contain re-identifiable PII, removing the rows which match entries in the "pending deletions" table.  If PII has been confined to a small number of tables (ideally the single PII lookup table), then this operation will be much less expensive.

Would something like the flow above work in your case?   Or is this similar to what you are doing?

 

You've stated: To depersonalise Customer A's data we built a job which runs daily and obsfucates against defined set of rules. But the job has to scan most of our partitions looking for matching cases...

 

From this it sounds like you are re-scanning nearly all of your data daily, using this job to both anonymize and remove records that don't match your rules for inclusion. Do I understand that correctly? Or are you just scanning new data daily?

 

-Steve

 

 

 

 

 

 

 

 

 

View solution in original post

4 REPLIES 4

avatar
Cloudera Employee

Hi Gary,

 

Thanks for this rather timely post - given, as you pointed out, that many companies are actively working toward fullfilling GDPR requirements.

 

Let me first outline a high-level set of steps that some organization are using for dealing with record deletions in HDFS:

 

  1. Upon initial write, or via batch jobs executing periodically after arrival, remove or redact external identifiers and replace with internal identifier.  The fields in these files only contain pseudononymized (depersonalized) PII; in other words PII that can’t be linked back to an individual by itself.    This job should only need to run once on each batch of data (file or partition).
  2. Actual re-identifiable PII fields that need to be retained are stored in a separate (smaller) “PII lookup table” specifically for this purpose.
    • Each row in this lookup table should contain internal unique identifier which is used to tie it to the data in the larger files in HDFS.  This identifier could either be an existing internal customerID or a new identifier token generated for this purpose.   
  3. When requests for deletion arrive, add these to a separate “pending deletions” table; this table contains the internal customerID of each record that needs deletion.
  4. On a periodic basis (say, daily or weekly), re-generate files which contain re-identifiable PII, removing the rows which match entries in the "pending deletions" table.  If PII has been confined to a small number of tables (ideally the single PII lookup table), then this operation will be much less expensive.

Would something like the flow above work in your case?   Or is this similar to what you are doing?

 

You've stated: To depersonalise Customer A's data we built a job which runs daily and obsfucates against defined set of rules. But the job has to scan most of our partitions looking for matching cases...

 

From this it sounds like you are re-scanning nearly all of your data daily, using this job to both anonymize and remove records that don't match your rules for inclusion. Do I understand that correctly? Or are you just scanning new data daily?

 

-Steve

 

 

 

 

 

 

 

 

 

avatar

Thanks for your reply Steve. It got us on the right track in our internal discussions.

 

What you describe here does sound like a distinct improvement on our current process.

 

Separate to our PII project we have a Customer 360 project which generates an internal ID. After mapping this process out I think we may gain from combining the two streams of work.

 

We could;

  1. Separate out PII on ingestion
  2. Process that PII into our C360 each day then delete all the PII in the ingestion table for that day
  3. Periodically delete PII from our C360 table based on a rules matrix
    1. Enquires which result in a customer relationship mean data can be retain for longer
    2. Enquires which do not result in business are deleted after n days
  4. We use views which combine the C360 and ingested data to allow reporting to include PII where appropriate

Under this model we're deleting an entire partition of PII daily and running the more complex rules against a much smaller C360 table which we could partition by source rather than date. After all source is more aligned with the rules matrix than date.

 

Regards,

 

Gary.

avatar
New Contributor

Hi,

We are in the middle of GDPR as well and wanted to try that kind of approach.

Happy to see that we are not the only one 🙂

Do you have concrete return on experience with implementation and data management processes in place or is it just conceptual at the moment.

Thanks,

Céline.

 

avatar

Hi Celine,

 

We've progressed our work so we're now in a compliant position. Or as compliant as we can be before GDPR cases occur and precedent is set. Our internal compliance team is as happy as they can be.

 

Our approach has been to delete and depersonalise to reduce the risk in the short term. And we included much more process around the uses of PII data which now requires audit and sign-off.

 

Long term we're architecting towards my preferred solution which is to store PII data in a core table and tokenise it in other data structures so queries can refer to the core data. This will enable us to perform any action to delete, depersonalise and audit on just the core data source.

 

Re-architecture is a longer term aim. We can't stop working on business focused projects to do a large complex refactoring of our data storage risking our existing data services and taking a couple of months - the business would not allow this.

 

Instead we're planing all current and future projects with this re-architecture in mind. It might take years to complete but we have risk mitigation in place so it is no longer an urgent problem for us.

 

Regards,

 

Gary