Support Questions

Find answers, ask questions, and share your expertise

What would you recommend to process large text file with structured records

New Contributor

Hi Folks,

I'm looking for some hadoop tools recommendations to do basic ETL on daily snapshots of ERP databases sent to me as text files. Each file usually has multiple sets of master-detail records whose strucutre is identfied by the first character and have a fixed length per column for each record type.

The example below is an simplified ilustration:

0.. header
1.. customer1
2.. customer1 address 1
2.. customer1 address 2
2.. customer1 address 3
3.. customer1 bill info
1.. customer2
2.. customer2 address 1
3.. customer2 bill info
1.. customer3
2.. customer3 address 1
2.. customer3 address 2
3.. customer3 bill info
9.. trailer

Some notes:

  • all lines have the same length (final field of each line is usually a filler).
  • not all record set are composed by the same amount of records.
  • there is no information on the records that link them together as master/detail. For example, all address and bill records belongs to "previous" customer record.
  • if a record is malformed I need to be able to identify it and report the offending line.

Every day I got a set of files to process: one file with customers to add into the database, and other with customers that should be removed from the database. After importing the files I execute a set of queries and business logic against the snapshot.

As of now I'm reading those files sequentially with a python script and importing them into a database to produce a daily snapshot but I would like to harness the power of hadoop/spark ecosystem to do this task, however due the sequential nature of the file I'm not sure how to do it.

Could someone advice me and suggest some tools to checkout (ie, pig, cascading, etc)?

Thanks in advance



@Eric Chaves

This sounds like another job for (super) NiFi. Completing the ETL you describe can be done with NiFi without any programming and NiFi can also execute your python too. If you haven't checked out NiFi yet, definitely spend some time to research it.

New Contributor

Hi Steve, thanks for the recommendation. We are actually using (and I agree on the "super") NiFi but for other tasks more related to data flow. For some reason we haven't considered it for this part of the job. One of the issues we saw at the time, if I'm not mistaken, was because we would still need to read the files in sequence line by line. Given the size and number of the files (we process around 20GB of data on a daily basis) our first impression was that the process would take longer than today, but I will most certainly rethink it.

How do you envision a strategy to process such type of file with nifi?


There are several ways to complete that Use Case. Nifi is just one and should well be capable of doing 20gb a day. Keep in mind a big work load will be better suited against a NiFi cluster where there are plenty of cores, lots of ram, and high performance disk arrangements for each node.

I currently have a single Nifi (single disk) instance moving about 70,000,000 rows of data from original source SQL (30 queries) to Nifi to HDFS to Hive and then 3-4 deeper Nifi Process Groups executing logic and giant HQL statements against the tables initially created. Total execution time is 22 minutes for 13.6gb in external ORC and 8.1gb in internal hive hdfs tables.

Either way something has to parse the data into a format that is expected in the final database. I think this could be a Use Case for Kafka and Spark Streaming as well. I have just begun to tackle these two and I am still working with them and NiFi. In this case the original data source is routed to Kafka Topic via NiFi, Spark is used to grab original data source and do the ETL within a cluster of Spark nodes.

New Contributor

@Steven Matison, Thanks for sharing your case! We are still in our early days with NiFi. We haven't yet tackled how to properly setup a cluster nor how to do some sizing/pre-planning, but hearing your design is definitely a good reference for us to pursue. Could you share the hardware choices for your cluster? We are working with AWS workloads and if you have any references with it for sizing a cluster would be wonderfull.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.