@Ram GIn NiFi we are having partition record processor, based on the content of the flowfile processor creates dynamic partitions and adds the partition_field_name and value as the attribute to the flowfile.
By using these attributes we can store the data into HDFS directories dynamically.
To read the content of the flowfile you need to define
RecordReader Controller service as CSV Reader and value seperator as \t(as you are having tab delimited file), define RecordWriter controller service as per your requirements(like avro,json..etc)
But keep in mind as you mentioned you are having more than 100 GB file and thinking to split the file, For this case i believe Hive will work much better to create Dynamic partitions.Store the file into HDFS then create Hive External table with tab delimiter and create partition table and insert into Partition table select from non_partition_table.
How ever if you want to do this in NiFi make sure you are having sufficient memory in your NiFi instance once you pull the file into NiFi use SplitRecord processor to Split the Huge file into reasonable smaller chunks then feed the splitted flowfiles to PartitionRecord processor.Once you have created partitions then store the flowfiles into HDFS.
Refer to this link for more details regards to PartitionRecord processor Usage/Configurations.
Refer to this link for Jvm OutofMemory issues in NiFi.
-
If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.