I'm trying to stream my log files from diff web folders using flume to process and write to hdfs. Flume agent running fine but as i read in flume userguide exec source will tail file.when i start agent it reading what ever records available in logs after that it become idle mode if any new records return to log its not reading any records, after we save that log exec source reading log from first so we getting duplicates is ther any possibility to avoid those.is there any way to read new log entries.
Here My config File:
agent.sources = localsource
agent.channels = memoryChannel
agent.sinks = avro_Sink
agent.sources.localsource.shell = /bin/bash -c
agent.sources.localsource.command = tail -F /home/admin1/teja/Flumedata/Behaviourlog
#agent.sources.localsource.fileHeader = true
# The channel can be defined as follows.
agent.sources.localsource.channels = memoryChannel
#Specify the channel the sink should use
agent.sinks.avro_Sink.channel = memoryChannel
# Each channel's type is defined.
agent.channels.memoryChannel.type = memory
# In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 10000
agent.channels.memoryChannel.transactionCapacity = 1000
# Each sink's type must be defined
agent.sinks.avro_Sink.type = avro
agent.sinks.avro_Sink.avro.batchSize = 100
agent.sinks.avro_Sink.avro.rollCount = 0
agent.sinks.avro_Sink.avro.rollSize = 143060835
agent.sinks.avro_Sink.avro.rollInterval = 0
agent.sources.localsource.interceptors = search-replace regex-filter
agent.sources.localsource.interceptors.search-replace.type = search_replace
# Remove leading alphanumeric characters in an event body.
agent.sources.localsource.interceptors.regex-filter.type = regex_filter
###Remove full event body.
agent.sources.localsource.interceptors.regex-filter.regex = .*(pagenotfound.php).*
agent.sources.localsource.interceptors.regex-filter.excludeEvents = true
and I tried TailDir source in new flume version in this source im getting same duplicates problem, is flume wont read instantly if any data return to log .
Here My tailDir source config:
agent.sources.localsource.type = TAILDIR
agent.sources.localsource.positionFile = /home/admin1/teja/flume/taildir_position.json
agent.sources.localsource.filegroups = f1
agent.sources.localsource.filegroups.f1 = /home/admin1/teja/Flumedata/Behaviourlog
agent.sources.localsource.batchSize = 20
agent.sources.localsource.fileHeader = true
You may want to derecrese the millis in restartThrottle to see if its picking up the new log entriess .
Also add this parameter in your flume.config -
restart false Whether the executed cmd should be restarted if it dies
beacause if the above parameter is not set or false , restartThrottle has no effect .
The source expects the command to continously procduce data and ingests its output
the Excec source is asynchronous in nature , there is a possibility of data loss if agent dies .
if data loss is something you want to avoid , you can use spoolingdir Source.