Member since
08-29-2018
3
Posts
0
Kudos Received
0
Solutions
08-31-2018
06:39 PM
@Steven Matison, Thanks for sharing your case! We are still in our early days with NiFi. We haven't yet tackled how to properly setup a cluster nor how to do some sizing/pre-planning, but hearing your design is definitely a good reference for us to pursue. Could you share the hardware choices for your cluster? We are working with AWS workloads and if you have any references with it for sizing a cluster would be wonderfull.
... View more
08-30-2018
04:33 PM
Hi Steve, thanks for the recommendation. We are actually using (and I agree on the "super") NiFi but for other tasks more related to data flow. For some reason we haven't considered it for this part of the job. One of the issues we saw at the time, if I'm not mistaken, was because we would still need to read the files in sequence line by line. Given the size and number of the files (we process around 20GB of data on a daily basis) our first impression was that the process would take longer than today, but I will most certainly rethink it. How do you envision a strategy to process such type of file with nifi? Cheers!
... View more
08-30-2018
12:28 AM
Hi Folks, I'm looking for some hadoop tools recommendations to do basic ETL on daily snapshots of ERP databases sent to me as text files.
Each file usually has multiple sets of master-detail records whose strucutre is identfied by the first character and have a fixed length per column for each record type. The example below is an simplified ilustration: 0.. header
1.. customer1
2.. customer1 address 1
2.. customer1 address 2
2.. customer1 address 3
3.. customer1 bill info
1.. customer2
2.. customer2 address 1
3.. customer2 bill info
1.. customer3
2.. customer3 address 1
2.. customer3 address 2
3.. customer3 bill info
9.. trailer
Some notes: all lines have the same length (final field of each line is usually a filler). not all record set are composed by the same amount of records. there is no information on the records that link them together as master/detail. For example, all address and bill records belongs to "previous" customer record. if a record is malformed I need to be able to identify it and report the offending line. Every day I got a set of files to process: one file with customers to add into the database, and other with customers that should be removed from the database. After importing the files I execute a set of queries and business logic against the snapshot. As of now I'm reading those files sequentially with a python script and importing them into a database to produce a daily snapshot but I would like to harness the power of hadoop/spark ecosystem to do this task, however due the sequential nature of the file I'm not sure how to do it. Could someone advice me and suggest some tools to checkout (ie, pig, cascading, etc)? Thanks in advance Eric.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Pig
-
Apache Spark