Support Questions

Find answers, ask questions, and share your expertise

How to change flush duration of cloudera hdfs sink connector?

avatar
New Contributor

I am using cloudera hdfs sink connector with parquetWriter. 

"com.cloudera.dim.kafka.connect.hdfs.HdfsSinkConnector"

 

It seems to flush the data in kafka topic every 1 minute. 

The number of partition of one of my kafka topics is 64. 

64 (kafka partition) * 60 (minutes per hour) * 24 (hours per day) = 92160 files are sinked everyday in one directory.

So, I created a job to delete files n days old. But this job is too slow because of the number of files of the directory where the parquet files are sinked by hdfs sink connector.

 

I have question about cloudera hdfs sink connector

1. Is it possible to sink the files in daily partition directory? ex) /blah/{topicName}/{yyyyMMdd}

2. Is there a way to change flush duration instead of every minutes?

1 ACCEPTED SOLUTION

avatar
Super Guru

Hi @inyongkim ,

 

At the moment this connector has no controls to adjust the flushing mechanism. We're aware of that and Cloudera is working on making that more configurable so that it does not create a small file problem in your destination cluster.

 

Cheers,

André

 

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

View solution in original post

1 REPLY 1

avatar
Super Guru

Hi @inyongkim ,

 

At the moment this connector has no controls to adjust the flushing mechanism. We're aware of that and Cloudera is working on making that more configurable so that it does not create a small file problem in your destination cluster.

 

Cheers,

André

 

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.