Support Questions
Find answers, ask questions, and share your expertise

Can we use Spark as a mitigation tool ( i.e. data masking)

Can we use Spark as a mitigation tool ( i.e. data masking)

Rising Star

mitigation tool wil have column masked, column removed, column checked for severity lable, etc..

consider there are more than 200 types of feeds.

3 REPLIES 3

Re: Can we use Spark as a mitigation tool ( i.e. data masking)

Spark is a general purpose in-memory distributed processing engine that can be used for several use cases. You can write Spark applications using Java, Scala, Python, or R. Some Spark uses cases could involve:

1. SQL Analytics (There are APIs for this - DataFrames and Datasets API)

2. Stream processing and streaming analytics (Spark Streaming API and Structured Streaming)

3. Machine Learning (Spark ML API - refers to the MLlib DataFrame-based API)

4. Graph Analytics (GraphX)

5. Data Masking can easily be done by using Spark to ingest/process your data. You can easily mix your own code (I use Python) to perform ETL operations on the data (mask, filter, transform, or join) and integrate with the Spark APIs (RDD's, DataFrames, Datasets).

For example, let's say I have hundreds of JSON formatted files on a cluster. I can use Spark to read the JSON files, use DataFrames to create a schema for the data, and then I could use SparkSQL to mask data - i.e. change social security number so that the first 5 numbers are replaced with '#'. Or, instead of using SparkSQL, I could convert the DataFrame to an RDD and write my own Python function to mask the data. There are lots of techniques you can use to mask data with Spark as spark is a general purpose engine.

Re: Can we use Spark as a mitigation tool ( i.e. data masking)

Rising Star

@Binu Mathew

I am looking for a scenario like

I have a file with lets say 20 fields ( pipe delimited ) and among them there will be a seveirity level is there . I wanted to take the file and check that severity value with the data I have it in my Sql DB. if both of them match I would like to delete few of the sensitive fields from the text file and save it in to another text file.

Can you help me with the code

Re: Can we use Spark as a mitigation tool ( i.e. data masking)