About sross

sross · ‎04-02-2017

Hi Gary, Thanks for this rather timely post - given, as you pointed out, that many companies are actively working toward fullfilling GDPR requirements. Let me first outline a high-level set of steps that some organization are using for dealing with record deletions in HDFS: Upon initial write, or via batch jobs executing periodically after arrival, remove or redact external identifiers and replace with internal identifier. The fields in these files only contain pseudononymized (depersonalized) PII; in other words PII that can’t be linked back to an individual by itself. This job should only need to run once on each batch of data (file or partition). Actual re-identifiable PII fields that need to be retained are stored in a separate (smaller) “PII lookup table” specifically for this purpose. Each row in this lookup table should contain internal unique identifier which is used to tie it to the data in the larger files in HDFS. This identifier could either be an existing internal customerID or a new identifier token generated for this purpose. When requests for deletion arrive, add these to a separate “pending deletions” table; this table contains the internal customerID of each record that needs deletion. On a periodic basis (say, daily or weekly), re-generate files which contain re-identifiable PII, removing the rows which match entries in the "pending deletions" table. If PII has been confined to a small number of tables (ideally the single PII lookup table), then this operation will be much less expensive. Would something like the flow above work in your case? Or is this similar to what you are doing? You've stated: To depersonalise Customer A's data we built a job which runs daily and obsfucates against defined set of rules. But the job has to scan most of our partitions looking for matching cases... From this it sounds like you are re-scanning nearly all of your data daily, using this job to both anonymize and remove records that don't match your rules for inclusion. Do I understand that correctly? Or are you just scanning new data daily? -Steve

Online	Offline
Last Visited	‎03-08-2019 05:20 PM

Member Since	‎01-30-2014 10:25 AM
Last Visited	‎03-08-2019 05:20 PM
Posts	2
Kudos received	2

Cloudera Community

Re: Removing Personally Identifiable Information (...

Re: Removing Personally Identifiable Information (...