<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question How to change flush duration of cloudera hdfs sink connector? in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/How-to-change-flush-duration-of-cloudera-hdfs-sink-connector/m-p/339705#M233159</link>
    <description>&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;I am using cloudera hdfs sink connector with parquetWriter.&amp;nbsp;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;"com.cloudera.dim.kafka.connect.hdfs.HdfsSinkConnector"&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;It seems to flush the data in kafka topic every 1 minute.&amp;nbsp;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;The number of partition of one of my kafka topics is 64.&amp;nbsp;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;64 (kafka partition) * 60 (minutes per hour) * 24 (hours per day) = 92160 files are sinked everyday in one directory.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;So, I created a job to delete files n days old. But this job is too slow because of the number of files of the directory where the parquet files are sinked by hdfs sink connector.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;I have question about cloudera hdfs sink connector&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;1. Is it possible to sink the files in daily partition directory? ex) /blah/{topicName}/{yyyyMMdd}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;2. Is there a way to change flush duration instead of every minutes?&lt;/FONT&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 25 Mar 2022 13:14:42 GMT</pubDate>
    <dc:creator>inyongkim</dc:creator>
    <dc:date>2022-03-25T13:14:42Z</dc:date>
    <item>
      <title>How to change flush duration of cloudera hdfs sink connector?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-to-change-flush-duration-of-cloudera-hdfs-sink-connector/m-p/339705#M233159</link>
      <description>&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;I am using cloudera hdfs sink connector with parquetWriter.&amp;nbsp;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;"com.cloudera.dim.kafka.connect.hdfs.HdfsSinkConnector"&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;It seems to flush the data in kafka topic every 1 minute.&amp;nbsp;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;The number of partition of one of my kafka topics is 64.&amp;nbsp;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;64 (kafka partition) * 60 (minutes per hour) * 24 (hours per day) = 92160 files are sinked everyday in one directory.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;So, I created a job to delete files n days old. But this job is too slow because of the number of files of the directory where the parquet files are sinked by hdfs sink connector.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;I have question about cloudera hdfs sink connector&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;1. Is it possible to sink the files in daily partition directory? ex) /blah/{topicName}/{yyyyMMdd}&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;2. Is there a way to change flush duration instead of every minutes?&lt;/FONT&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 25 Mar 2022 13:14:42 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-to-change-flush-duration-of-cloudera-hdfs-sink-connector/m-p/339705#M233159</guid>
      <dc:creator>inyongkim</dc:creator>
      <dc:date>2022-03-25T13:14:42Z</dc:date>
    </item>
    <item>
      <title>Re: How to change flush duration of cloudera hdfs sink connector?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-to-change-flush-duration-of-cloudera-hdfs-sink-connector/m-p/339986#M233234</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/95374"&gt;@inyongkim&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;At the moment this connector has no controls to adjust the flushing mechanism. We're aware of that and Cloudera is working on making that more configurable so that it does not create a small file problem in your destination cluster.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Cheers,&lt;/P&gt;&lt;P&gt;André&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 30 Mar 2022 01:00:38 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-to-change-flush-duration-of-cloudera-hdfs-sink-connector/m-p/339986#M233234</guid>
      <dc:creator>araujo</dc:creator>
      <dc:date>2022-03-30T01:00:38Z</dc:date>
    </item>
  </channel>
</rss>

