i have flume to hdfs and s3 sink set up done in cloudera and i am able to get the logs from kafka to both of them. my issue is with the count of the logs in hdfs is more than s3 atleast by few millions. writing to s3 seems to be slow compared to hdfs. I tried increasing/decreasing the batch size and sink size and in both the scenarios i get unable to put the batch in the memory issues. Can you let me know how to increase the throughput to s3.
Below is my configuration
agent.sources = sync_s3 agent.channels = channel_s3 agent.sinks = sync agent.sources.sync_s3.type = org.apache.flume.source.kafka.KafkaSource agent.sources.sync_s3.zookeeperConnect = zookeer agent.sources.sync_s3.topic = kafka_topic agent.sources.sync_s3.groupId = s3sink_test agent.sources.sync_s3.consumer.timeout.ms = 10000 agent.sources.sync_s3.auto.commit.enabled = true agent.sources.sync_s3.batchDurationMillis = 10000 agent.sources.sync_s3.batchSize = 30000 agent.sources.sync_s3.channels = channel_s3 agent.channels.channel_s3.type = memory agent.channels.channel_s3.capacity = 100000 agent.channels.channel_s3.transactionCapacity = 100000 agent.sinks.s3_sync.channel = channel_s3 agent.sinks.s3_sync.type = hdfs agent.sinks.s3_sync.hdfs.useLocalTimeStamp = true agent.sinks.s3_sync.hdfs.path = s3n://key@test/%{topic}/utc=%s agent.sinks.s3_sync.hdfs.roundUnit = minute agent.sinks.s3_sync.hdfs.round = true agent.sinks.s3_sync.hdfs.roundValue = 10 agent.sinks.s3_sync.hdfs.fileSuffix = .avro agent.sinks.s3_sync.hdfs.fileType = DataStream agent.sinks.s3_sync.hdfs.maxOpenFiles=100 agent.sinks.s3_sync.hdfs.appendTimeout = 10000 agent.sinks.s3_sync.hdfs.callTimeout = 180000 agent.sinks.s3_sync.hdfs.rollInterval = 600 agent.sinks.s3_sync.hdfs.rollSize = 0 agent.sinks.s3_sync.hdfs.rollCount = 50000 agent.sinks.s3_sync.hdfs.batchSize = 50000 agent.sinks.s3_sync.hdfs.threadsPoolSize=100 agent.sinks.s3_sync.hdfs.rollTimerPoolSize = 1