I am using cloudera hdfs sink connector with parquetWriter.
"com.cloudera.dim.kafka.connect.hdfs.HdfsSinkConnector"
It seems to flush the data in kafka topic every 1 minute.
The number of partition of one of my kafka topics is 64.
64 (kafka partition) * 60 (minutes per hour) * 24 (hours per day) = 92160 files are sinked everyday in one directory.
So, I created a job to delete files n days old. But this job is too slow because of the number of files of the directory where the parquet files are sinked by hdfs sink connector.
I have question about cloudera hdfs sink connector
1. Is it possible to sink the files in daily partition directory? ex) /blah/{topicName}/{yyyyMMdd}
2. Is there a way to change flush duration instead of every minutes?
Created 03-29-2022 06:00 PM
Hi @inyongkim ,
At the moment this connector has no controls to adjust the flushing mechanism. We're aware of that and Cloudera is working on making that more configurable so that it does not create a small file problem in your destination cluster.
Cheers,
André
Created 03-29-2022 06:00 PM
Hi @inyongkim ,
At the moment this connector has no controls to adjust the flushing mechanism. We're aware of that and Cloudera is working on making that more configurable so that it does not create a small file problem in your destination cluster.
Cheers,
André