<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Split huge file, one file for each day - based on date column - tab delimited in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Split-huge-file-one-file-for-each-day-based-on-date-column/m-p/203146#M165149</link>
    <description>&lt;A rel="user" href="https://community.cloudera.com/users/98268/ramgood.html" nodeid="98268"&gt;@Ram G&lt;/A&gt;&lt;P&gt;In NiFi we are having &lt;STRONG&gt;partition record&lt;/STRONG&gt; processor, based on the content of the flowfile processor creates &lt;STRONG&gt;dynamic partitions&lt;/STRONG&gt; and adds the partition_field_name and value as the attribute to the flowfile.&lt;/P&gt;&lt;P&gt;By using these attributes we can store the data into HDFS directories dynamically.&lt;/P&gt;&lt;P&gt;To read the content of the flowfile you need to define &lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;RecordReader Controller service&lt;/STRONG&gt; as &lt;STRONG&gt;CSV Reader &lt;/STRONG&gt;and &lt;STRONG&gt;value seperator &lt;/STRONG&gt;as &lt;B&gt;\t(&lt;/B&gt;as you are having tab delimited file), define RecordWriter controller service as per your requirements(like avro,json..etc)&lt;/P&gt;&lt;P&gt;But keep in mind as you mentioned you are having more than 100 GB file and thinking to split the file, For this case &lt;STRONG&gt;i believe Hive will work much better to create Dynamic partitions.&lt;/STRONG&gt;Store the file into HDFS then create Hive External table with tab delimiter and create partition table and insert into Partition table select from non_partition_table.&lt;/P&gt;&lt;P&gt;How ever if you want to do this in NiFi make sure you are having &lt;A href="https://community.hortonworks.com/articles/7882/hdfnifi-best-practices-for-setting-up-a-high-perfo.html" target="_blank"&gt;sufficient memory &lt;/A&gt;in your NiFi instance once you pull the file into NiFi use &lt;STRONG&gt;SplitRecord&lt;/STRONG&gt; processor to Split the &lt;STRONG&gt;Huge file&lt;/STRONG&gt; into reasonable smaller chunks then feed the &lt;STRONG&gt;splitted&lt;/STRONG&gt; flowfiles to &lt;STRONG&gt;PartitionRecord&lt;/STRONG&gt; processor.Once you have created partitions then store the flowfiles into HDFS.&lt;/P&gt;&lt;P&gt;Refer to &lt;A href="https://community.hortonworks.com/articles/191760/create-dynamic-partitions-based-on-flowfile-conten.html" target="_blank"&gt;this&lt;/A&gt; link for more details regards to PartitionRecord processor Usage/Configurations.&lt;/P&gt;&lt;P&gt;Refer to &lt;A href="https://community.hortonworks.com/articles/85234/how-to-address-jvm-outofmemory-errors-in-nifi.html" target="_blank"&gt;this&lt;/A&gt; link for Jvm OutofMemory issues in NiFi.&lt;/P&gt;&lt;P&gt;-&lt;/P&gt;&lt;P&gt;If the Answer helped to resolve your issue, &lt;STRONG&gt;Click on Accept button below to accept the answer,&lt;/STRONG&gt; That would be great help to Community users to find solution quickly for these kind of issues.&lt;/P&gt;</description>
    <pubDate>Thu, 27 Sep 2018 05:44:44 GMT</pubDate>
    <dc:creator>Shu_ashu</dc:creator>
    <dc:date>2018-09-27T05:44:44Z</dc:date>
  </channel>
</rss>

