Reply
Explorer
Posts: 7
Registered: ‎08-08-2017

flume to s3 sink is missing data compared to hdfs

[ Edited ]
 

i have flume to hdfs and s3 sink set up done in cloudera and i am able to get the logs from kafka to both of them. my issue is with the count of the logs in hdfs is more than s3 atleast by few millions. writing to s3 seems to be slow compared to hdfs. I tried increasing/decreasing the batch size and sink size and in both the scenarios i get unable to put the batch in the memory issues. Can you let me know how to increase the throughput to s3.

Below is my configuration

agent.sources = sync_s3
agent.channels = channel_s3
agent.sinks =  sync 

agent.sources.sync_s3.type = org.apache.flume.source.kafka.KafkaSource
agent.sources.sync_s3.zookeeperConnect = zookeer
agent.sources.sync_s3.topic = kafka_topic
agent.sources.sync_s3.groupId = s3sink_test
agent.sources.sync_s3.consumer.timeout.ms = 10000
agent.sources.sync_s3.auto.commit.enabled = true
agent.sources.sync_s3.batchDurationMillis = 10000
agent.sources.sync_s3.batchSize = 30000
agent.sources.sync_s3.channels = channel_s3

agent.channels.channel_s3.type = memory
agent.channels.channel_s3.capacity = 100000
agent.channels.channel_s3.transactionCapacity = 100000

agent.sinks.s3_sync.channel = channel_s3
agent.sinks.s3_sync.type = hdfs
agent.sinks.s3_sync.hdfs.useLocalTimeStamp = true
agent.sinks.s3_sync.hdfs.path = s3n://key@test/%{topic}/utc=%s
agent.sinks.s3_sync.hdfs.roundUnit = minute
agent.sinks.s3_sync.hdfs.round = true
agent.sinks.s3_sync.hdfs.roundValue = 10
agent.sinks.s3_sync.hdfs.fileSuffix = .avro
agent.sinks.s3_sync.hdfs.fileType = DataStream
agent.sinks.s3_sync.hdfs.maxOpenFiles=100
agent.sinks.s3_sync.hdfs.appendTimeout = 10000
agent.sinks.s3_sync.hdfs.callTimeout = 180000
agent.sinks.s3_sync.hdfs.rollInterval = 600
agent.sinks.s3_sync.hdfs.rollSize = 0
agent.sinks.s3_sync.hdfs.rollCount = 50000
agent.sinks.s3_sync.hdfs.batchSize = 50000
agent.sinks.s3_sync.hdfs.threadsPoolSize=100
agent.sinks.s3_sync.hdfs.rollTimerPoolSize = 1
Champion
Posts: 595
Registered: ‎05-16-2016

Re: flume to s3 sink is missing data compared to hdfs

[ Edited ]

@puth Looks like you might want to tweak the JAVA_OPTS in the  flume-env.sh file . 

Would you  consider increasing your current JAVA_OPTS min and Max  value , I guess that will slove the memory issue . Meantime you can also use JMX for memory monitoring 

Explorer
Posts: 7
Registered: ‎08-08-2017

Re: flume to s3 sink is missing data compared to hdfs

[ Edited ]

Hi,

 

The memory issue is not related to Java. It is an issue with the channel memory allocated to hold 100k records. i increased the channel size to hold 300k records and its filling up pretty fast before the sink happens. my question is how do i speed up the sink process to s3?

Champion
Posts: 595
Registered: ‎05-16-2016

Re: flume to s3 sink is missing data compared to hdfs

[ Edited ]

Since you stated this "  i get unable to put the batch in the memory issues " I assumed it is do with Heap size ,My bad. 

 

well the only way that I could think of to fix this issue is to add more sink, more of like a  parallelism .

 

Side note . 

If it is Kafka or NIFI we could you the  back pressure functionality   it is really hard to manage flume . 

Explorer
Posts: 7
Registered: ‎08-08-2017

Re: flume to s3 sink is missing data compared to hdfs

Already have sinks equal to the number of partitions of the kafka topic. i tried everything i could think of before posting here for experts advice.

Explorer
Posts: 7
Registered: ‎08-08-2017

Re: flume to s3 sink is missing data compared to hdfs

any help on this is appreciated....please advise

Announcements