- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to change flush duration of cloudera hdfs sink connector?
- Labels:
-
Apache Kafka
-
HDFS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am using cloudera hdfs sink connector with parquetWriter.
"com.cloudera.dim.kafka.connect.hdfs.HdfsSinkConnector"
It seems to flush the data in kafka topic every 1 minute.
The number of partition of one of my kafka topics is 64.
64 (kafka partition) * 60 (minutes per hour) * 24 (hours per day) = 92160 files are sinked everyday in one directory.
So, I created a job to delete files n days old. But this job is too slow because of the number of files of the directory where the parquet files are sinked by hdfs sink connector.
I have question about cloudera hdfs sink connector
1. Is it possible to sink the files in daily partition directory? ex) /blah/{topicName}/{yyyyMMdd}
2. Is there a way to change flush duration instead of every minutes?
Created ‎03-29-2022 06:00 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @inyongkim ,
At the moment this connector has no controls to adjust the flushing mechanism. We're aware of that and Cloudera is working on making that more configurable so that it does not create a small file problem in your destination cluster.
Cheers,
André
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Created ‎03-29-2022 06:00 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @inyongkim ,
At the moment this connector has no controls to adjust the flushing mechanism. We're aware of that and Cloudera is working on making that more configurable so that it does not create a small file problem in your destination cluster.
Cheers,
André
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.
