03-31-2017 07:22 AM
In May 2018 new legislation will be introduced in the EU to force organisations to get explicit permission from customers in order to use their data. The GDPR regulations present challenges for the Big Data world.
My team faces issues right now and I'm looking for flexible solutions to give our developers a range of tools depending on the type of data and contractual requirements presented to us.
The problem is clear - "How can we depersonalise specific columns and rows in Impala tables based on a matrix of rules?"
For example, Customer A ended up in HDFS due to a relationship we have with Business Y. Customer B's data was saved as part of a line of business we are running in partnership with Business Z. Customer A never resulted in a sale but Customer B did. After 3 months we should remove Customer A's personally identifiable information (PII). But we're allowed to keep Customer B's data for 6 years for administrative and tax purposes.
Now it gets tricky. Both Customer A & B are in the same partition within an Impala table. The file is approximately 100Mb in size.
To depersonalise Customer A's data we built a job which runs daily and obsfucates against defined set of rules. But the job has to scan most of our partitions looking for matching cases and then SELECT the data back into a table with the PII of customers identified by the routine amended.
This job takes a long time to run and is I/O heavy. Every time we have a new contractual relationship the code needs to be altered. It starts looking messy after just a few use cases are added and also causes us a testing headache. We have built a monster that doesn't scale up.
I'm interested in how other teams are addressing this issue.
The potential solutions floated so far are;
None of these solutions are ideal.
For reference we're running CDH 5.9 in a 20 node cluster (16 data nodes). We use Flume to capture approx 11Tb of data per month. The data is used in 'almost' real time but after a defined period is needed only for trend analysis. We use Impala predominently but are making a shift to Spark where appropriate. Our current depersonalisation process uses a daily Oozie job which runs Impala code. This solution won't scale but the contractual requirements seem to be scaling quite fast.
Many thanks for any advice in advance,
Solved! Go to Solution.
04-02-2017 12:24 PM
Thanks for this rather timely post - given, as you pointed out, that many companies are actively working toward fullfilling GDPR requirements.
Let me first outline a high-level set of steps that some organization are using for dealing with record deletions in HDFS:
Would something like the flow above work in your case? Or is this similar to what you are doing?
You've stated: To depersonalise Customer A's data we built a job which runs daily and obsfucates against defined set of rules. But the job has to scan most of our partitions looking for matching cases...
From this it sounds like you are re-scanning nearly all of your data daily, using this job to both anonymize and remove records that don't match your rules for inclusion. Do I understand that correctly? Or are you just scanning new data daily?
04-03-2017 02:46 AM
Thanks for your reply Steve. It got us on the right track in our internal discussions.
What you describe here does sound like a distinct improvement on our current process.
Separate to our PII project we have a Customer 360 project which generates an internal ID. After mapping this process out I think we may gain from combining the two streams of work.
Under this model we're deleting an entire partition of PII daily and running the more complex rules against a much smaller C360 table which we could partition by source rather than date. After all source is more aligned with the rules matrix than date.