Member since
11-21-2018
4
Posts
0
Kudos Received
0
Solutions
11-22-2018
10:37 AM
Thanks @Richard Dobson I see Socket connection for steaming. Is it possible to do without that? I need to move it to productions. Also, current design:
I have a central repo for rules to be applied and evaluating which ones to apply and send this info in JSON format to Validation Engine. I have defined it in 2 ways : 1) SQL rules - using SparkSQL
{
"file" : "path to file",
"delimiter" : ",", "rules" : [
{
"query" : "select count(*) from file", "threshold" : 20,
},
{
.....
} ]
} Results : Usually count comarision and its results 2) Something like spot check where validate each row on some rules: {
"file" : "path to file", "delimiter" : ",",
"rules" : [ {
"name" : "row_check",
"compare_value" :20
}, {
"name" : "min_value_check",
"column_name" : "abc,
"compare_value" :20
},
{
.....
} ]
} Results : Validation results for each row
eg
row_no/row_key | pass/failure | actual value | threshold | rule_name
Can you please shed some light how can I turn to stream and process these kind of rules?
Also please provide a sample snippet or any link.. [I am entirely new to python and spark].
... View more
11-22-2018
07:02 AM
Streaming to process batch files right? I get max 1 or 2 file as an input at once. Also can you please provide sample example
... View more
11-21-2018
03:56 PM
I am developing an rule based Data quality engine. Where input data is feed as files which may vary from several KB's upto 14GB (dynamic files) for now.In future, file size may vary and increase largely. I want to process rules like row specific, column specific on the incoming feed to catch bad data. Used an Pandas, but not good at performance, I tried using Python Spark, seems good for larger files than small files. SparkSQL is promising for processing pre-defined rules specific to file. Can somebody please comment if I can rely on Spark ?
... View more
Labels:
- Labels:
-
Apache Spark