Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Split huge file, one file for each day - based on date column - tab delimited

Split huge file, one file for each day - based on date column - tab delimited

New Contributor

Hi


I have a huge file which is more than 100 GB. It has tab delimited values. Below is the sample data.

Location ID Device ID Timestamp Date Time Day of Week
Germany|3345204 997271322a5f54baa57a29b96d04231b0b069b31 1533473417 2018-08-05 14:50:17 Sun
Germany|3345204 997271322a5f54baa57a29b96d04231b0b069b31 1533473434 2018-08-05 14:50:34 Sun
Germany|3345204 ef7f1af6e29c8ad562e87b785685bfb2f79adb4a 1533427210 2018-08-05 02:00:10 Sun
Germany|3345204 64e1884666d73d30f3c8ed0f5ee9054ea6318121 1533508209 2018-08-06 00:30:09 Mon
Germany|3345204 64e1884666d73d30f3c8ed0f5ee9054ea6318121 1533508272 2018-08-06 00:31:12 Mon
Germany|3345204 64e1884666d73d30f3c8ed0f5ee9054ea6318121 1533508273 2018-08-06 00:31:13 Mon

I am quite new to nifi. Struggling hard to understand expression language and storing values into variables, tab delimiter, etc.

I want to split the file into multiple files such that one file for each day. For example, from above data, one file for "2018-08-05" and one for "2018-08-06". Note that i don't know the date. Date values are coming in runtime, from the line. So, when the file processing starts, we pick the first date of occurance and store it in memory, create a file for this date and add the line in the file. And subsequently when we encounter the same date, the line should be added to respective file. Though I have long explanation, I know it is a common need. But, I am not able to create a flow for this due to my limited knowledge.

Can anybody help me with a sample flow / template? It will help me in getting started. Thanks

1 REPLY 1

Re: Split huge file, one file for each day - based on date column - tab delimited

Super Guru
@Ram G

In NiFi we are having partition record processor, based on the content of the flowfile processor creates dynamic partitions and adds the partition_field_name and value as the attribute to the flowfile.

By using these attributes we can store the data into HDFS directories dynamically.

To read the content of the flowfile you need to define

RecordReader Controller service as CSV Reader and value seperator as \t(as you are having tab delimited file), define RecordWriter controller service as per your requirements(like avro,json..etc)

But keep in mind as you mentioned you are having more than 100 GB file and thinking to split the file, For this case i believe Hive will work much better to create Dynamic partitions.Store the file into HDFS then create Hive External table with tab delimiter and create partition table and insert into Partition table select from non_partition_table.

How ever if you want to do this in NiFi make sure you are having sufficient memory in your NiFi instance once you pull the file into NiFi use SplitRecord processor to Split the Huge file into reasonable smaller chunks then feed the splitted flowfiles to PartitionRecord processor.Once you have created partitions then store the flowfiles into HDFS.

Refer to this link for more details regards to PartitionRecord processor Usage/Configurations.

Refer to this link for Jvm OutofMemory issues in NiFi.

-

If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.