Support Questions
Find answers, ask questions, and share your expertise

how to increase flume file size in hdfs?

New Contributor



Im trying to get my flume to make bigger data when putting twitter data to hdfs. currently there are a lot of 1mb files, but i want less 64mb.

this is my configuration:

TwitterAgent.sources = twitter
TwitterAgent.channels = memoryChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.twitter.consumerKey = x
TwitterAgent.sources.twitter.consumerSecret = x
TwitterAgent.sources.twitter.accessToken =  x-x
TwitterAgent.sources.twitter.accessTokenSecret = x
TwitterAgent.sources.twitter.keywords = wm2014
TwitterAgent.sources.twitter.maxBatchDurationMillis = 200 
TwitterAgent.sources.twitter.channels = memoryChannel
TwitterAgent.channels.memoryChannel.type = memory
TwitterAgent.channels.memoryChannel.capacity = 10000
TwitterAgent.channels.memoryChannel.transactionCapacity = 10000
TwitterAgent.sinks.HDFS.type = hdfs = memoryChannel
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:8020/user/flume/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 10
TwitterAgent.sinks.HDFS.hdfs.rollSize = 66584576
TwitterAgent.sinks.HDFS.hdfs.rollCount = 0
TwitterAgent.sinks.HDFS.hdfs.useLocalTimeStamp = true
and why does the keywords line not work? im getting all tweets, not just the keyworded..

Re: how to increase flume file size in hdfs?


Try this:


delete: TwitterAgent.sources.twitter.maxBatchDurationMillis = 200 


and put: TwitterAgent.sinks.HDFS.hdfs.batchSize = 64000


The duration is the time given to write to the hdfs, so 200 milli is too short, just don't put restriction on it.


Lefevre Kevin