New Contributor
Posts: 3
Registered: ‎12-02-2015
Accepted Solution

Removing Personally Identifiable Information (PII) from HDFS

In May 2018 new legislation will be introduced in the EU to force organisations to get explicit permission from customers in order to use their data. The GDPR regulations present challenges for the Big Data world. 


My team faces issues right now and I'm looking for flexible solutions to give our developers a range of tools depending on the type of data and contractual requirements presented to us.


The problem is clear - "How can we depersonalise specific columns and rows in Impala tables based on a matrix of rules?"


For example, Customer A ended up in HDFS due to a relationship we have with Business Y. Customer B's data was saved as part of a line of business we are running in partnership with Business Z. Customer A never resulted in a sale but Customer B did. After 3 months we should remove Customer A's personally identifiable information (PII). But we're allowed to keep Customer B's data for 6 years for administrative and tax purposes.


Now it gets tricky. Both Customer A & B are in the same partition within an Impala table. The file is approximately 100Mb in size. 


To depersonalise Customer A's data we built a job which runs daily and obsfucates against defined set of rules. But the job has to scan most of our partitions looking for matching cases and then SELECT the data back into a table with the PII of customers identified by the routine amended.


This job takes a long time to run and is I/O heavy. Every time we have a new contractual relationship the code needs to be altered. It starts looking messy after just a few use cases are added and also causes us a testing headache. We have built a monster that doesn't scale up.


I'm interested in how other teams are addressing this issue.


The potential solutions floated so far are;

  1. Encrypt the PII but don't allow unencryption. (I don't like this and I don't think an auditor would either)
  2. Flume the data in twice - once with all data available and secondly with all data removed. Then delete version with PII after 3 months. (But in a scenario where we have multiple rules we'd still need to selectively delete some data but not others. And which version do we use? How do we ensure against duplication?)
  3. Delete everything after n days rather than risk breaching a contract. (The nuclear option but we would lose a lot of analytically useful data if we did this

None of these solutions are ideal.


For reference we're running CDH 5.9 in a 20 node cluster (16 data nodes). We use Flume to capture approx 11Tb of data per month. The data is used in 'almost' real time but after a defined period is needed only for trend analysis. We use Impala predominently but are making a shift to Spark where appropriate. Our current depersonalisation process uses a daily Oozie job which runs Impala code. This solution won't scale but the contractual requirements seem to be scaling quite fast.


Many thanks for any advice in advance,





Cloudera Employee
Posts: 2
Registered: ‎01-30-2014

Re: Removing Personally Identifiable Information (PII) from HDFS

Hi Gary,


Thanks for this rather timely post - given, as you pointed out, that many companies are actively working toward fullfilling GDPR requirements.


Let me first outline a high-level set of steps that some organization are using for dealing with record deletions in HDFS:


  1. Upon initial write, or via batch jobs executing periodically after arrival, remove or redact external identifiers and replace with internal identifier.  The fields in these files only contain pseudononymized (depersonalized) PII; in other words PII that can’t be linked back to an individual by itself.    This job should only need to run once on each batch of data (file or partition).
  2. Actual re-identifiable PII fields that need to be retained are stored in a separate (smaller) “PII lookup table” specifically for this purpose.
    • Each row in this lookup table should contain internal unique identifier which is used to tie it to the data in the larger files in HDFS.  This identifier could either be an existing internal customerID or a new identifier token generated for this purpose.   
  3. When requests for deletion arrive, add these to a separate “pending deletions” table; this table contains the internal customerID of each record that needs deletion.
  4. On a periodic basis (say, daily or weekly), re-generate files which contain re-identifiable PII, removing the rows which match entries in the "pending deletions" table.  If PII has been confined to a small number of tables (ideally the single PII lookup table), then this operation will be much less expensive.

Would something like the flow above work in your case?   Or is this similar to what you are doing?


You've stated: To depersonalise Customer A's data we built a job which runs daily and obsfucates against defined set of rules. But the job has to scan most of our partitions looking for matching cases...


From this it sounds like you are re-scanning nearly all of your data daily, using this job to both anonymize and remove records that don't match your rules for inclusion. Do I understand that correctly? Or are you just scanning new data daily?












New Contributor
Posts: 3
Registered: ‎12-02-2015

Re: Removing Personally Identifiable Information (PII) from HDFS

Thanks for your reply Steve. It got us on the right track in our internal discussions.


What you describe here does sound like a distinct improvement on our current process.


Separate to our PII project we have a Customer 360 project which generates an internal ID. After mapping this process out I think we may gain from combining the two streams of work.


We could;

  1. Separate out PII on ingestion
  2. Process that PII into our C360 each day then delete all the PII in the ingestion table for that day
  3. Periodically delete PII from our C360 table based on a rules matrix
    1. Enquires which result in a customer relationship mean data can be retain for longer
    2. Enquires which do not result in business are deleted after n days
  4. We use views which combine the C360 and ingested data to allow reporting to include PII where appropriate

Under this model we're deleting an entire partition of PII daily and running the more complex rules against a much smaller C360 table which we could partition by source rather than date. After all source is more aligned with the rules matrix than date.





New solutions