01-13-2019
06:03 PM
- last edited on
01-14-2019
06:03 AM
by
cjervis
I'm trying to implement flume to comsume a Kafka topic that gets around 40k messages a second. My current problem is the required directory structure for the HDFS sink creates too many files since the event timestamp can vary by days.
node1.sinks.sink1.type = hdfs node1.sinks.sink1.channel = channel1 node1.sinks.sink1.hdfs.path = hdfs://datawarehouse/data/flume_%{topic}/batch=%{batch}/year=%Y/month=%m/day=%d/hour=%H
Is it possible to use multiplexing to send events older than 24 hours to a specific channel? That way I can send those to an HDFS sink without the "hour=%H" directory partition, which would lessen the number of files dramatically.
01-17-2019 12:24 PM