Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.

Flume - multiplexing by event date

Flume - multiplexing by event date

New Contributor

I'm trying to implement flume to comsume a Kafka topic that gets around 40k messages a second. My current problem is the required directory structure for the HDFS sink creates too many files since the event timestamp can vary by days.

 

node1.sinks.sink1.type = hdfs
node1.sinks.sink1.channel = channel1
node1.sinks.sink1.hdfs.path = hdfs://datawarehouse/data/flume_%{topic}/batch=%{batch}/year=%Y/month=%m/day=%d/hour=%H

Is it possible to use multiplexing to send events older than 24 hours to a specific channel? That way I can send those to an HDFS sink without the "hour=%H" directory partition, which would lessen the number of files dramatically.

1 REPLY 1

Re: Flume - multiplexing by event date

New Contributor
We got this working by inserting a new header value if the event is old or not using our custom interceptor.