Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Develop data quality system

Develop data quality system

New Contributor

I am developing an rule based Data quality engine. Where input data is feed as files which may vary from several KB's upto 14GB (dynamic files) for now.In future, file size may vary and increase largely. I want to process rules like row specific, column specific on the incoming feed to catch bad data. Used an Pandas, but not good at performance, I tried using Python Spark, seems good for larger files than small files. SparkSQL is promising for processing pre-defined rules specific to file. Can somebody please comment if I can rely on Spark ?

6 REPLIES 6
Highlighted

Re: Develop data quality system

Contributor

Have you tried using continous processing in spark structured streaming ? It might work better for small files. It is based on sparksql.

Re: Develop data quality system

New Contributor

Streaming to process batch files right? I get max 1 or 2 file as an input at once. Also can you please provide sample example

Re: Develop data quality system

Cloudera Employee

An approach could be to turn the file into a stream so it doesn't matter about the file length. Then depending on performance requirements, pre-process the row to work out which DQ rules need to be evaluated. Then execute only the required rules.

I would also define a central repository which stores which rules are applied to what data and in what situation which makes adding new rules simpler.

So basically the flow is - convert to stream -> evaluate which rules are required -> route to rules Spark Engine.

Re: Develop data quality system

New Contributor

Thanks @Richard Dobson

I see Socket connection for steaming. Is it possible to do without that? I need to move it to productions.

Also, current design: I have a central repo for rules to be applied and evaluating which ones to apply and send this info in JSON format to Validation Engine.

I have defined it in 2 ways :

1) SQL rules - using SparkSQL

{ "file" : "path to file",

"delimiter" : ",",

"rules" : [

{ "query" : "select count(*) from file",

"threshold" : 20, }, { ..... }

] }

Results : Usually count comarision and its results

2) Something like spot check where validate each row on some rules:

{ "file" : "path to file",

"delimiter" : ",",

"rules" : [

{ "name" : "row_check", "compare_value" :20 },

{ "name" : "min_value_check", "column_name" : "abc, "compare_value" :20 },

{ ..... }

] }

Results : Validation results for each row eg row_no/row_key | pass/failure | actual value | threshold | rule_name

Can you please shed some light how can I turn to stream and process these kind of rules? Also please provide a sample snippet or any link.. [I am entirely new to python and spark].

Re: Develop data quality system

Cloudera Employee

NiFi (https://hortonworks.com/apache/nifi/) is able to take a file and split it into individual rows (called flowfiles). This is instead of Socket.io

My idea is NiFi would turn the file into a stream of rows and then drop them onto Kafka. Spark would listen to Kafka and then validate/ route/process etc then drop the record to another kafka topic which NiFi would pick up and route to somewhere else.

Re: Develop data quality system

New Contributor

Thanks I will try to apply.