Reply
New Contributor
Posts: 3
Registered: ‎11-07-2016

While using Exec source getting duplicates

[ Edited ]

Hi, Guys

 

           I'm trying to stream my log files from diff web folders using flume to process and write to hdfs. Flume agent running fine but as i read in flume userguide exec source will tail file.when i start agent it reading what ever records available in logs after that it become idle mode if any new records return to log its not reading any records, after we save that log exec source reading log from first so we getting duplicates is ther any possibility to avoid those.is there any way to read new log entries.

 

 Here My config File:

 

agent.sources = localsource
agent.channels = memoryChannel
agent.sinks = avro_Sink


agent.sources.localsource.restartThrottle=240000
agent.sources.localsource.type=exec
agent.sources.localsource.shell = /bin/bash -c
agent.sources.localsource.command = tail -F /home/admin1/teja/Flumedata/Behaviourlog
agent.sources.localsource.logStdErr=true
agent.sources.localsource.batchSize=5
#agent.sources.localsource.fileHeader = true


# The channel can be defined as follows.
agent.sources.localsource.channels = memoryChannel

#Specify the channel the sink should use
agent.sinks.avro_Sink.channel = memoryChannel

# Each channel's type is defined.
agent.channels.memoryChannel.type = memory

# In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 10000
agent.channels.memoryChannel.transactionCapacity = 1000


# Each sink's type must be defined
agent.sinks.avro_Sink.type = avro
agent.sinks.avro_Sink.hostname= 192.168.44.444
agent.sinks.avro_Sink.port= 8021
agent.sinks.avro_Sink.avro.batchSize = 100
agent.sinks.avro_Sink.avro.rollCount = 0
agent.sinks.avro_Sink.avro.rollSize = 143060835
agent.sinks.avro_Sink.avro.rollInterval = 0

 

agent.sources.localsource.interceptors = search-replace regex-filter
agent.sources.localsource.interceptors.search-replace.type = search_replace
# Remove leading alphanumeric characters in an event body.
agent.sources.localsource.interceptors.search-replace.searchPattern= ###|##|#
agent.sources.localsource.interceptors.search-replace.replaceString= |

agent.sources.localsource.interceptors.regex-filter.type = regex_filter
###Remove full event body.
agent.sources.localsource.interceptors.regex-filter.regex = .*(pagenotfound.php).*
agent.sources.localsource.interceptors.regex-filter.excludeEvents = true

 

Please Help.

 

and I tried TailDir source in new flume version in this source im getting same duplicates problem, is flume wont read instantly if any data return to log .

 

Here My tailDir source config:

agent.sources.localsource.type = TAILDIR
agent.sources.localsource.positionFile = /home/admin1/teja/flume/taildir_position.json
agent.sources.localsource.filegroups = f1
agent.sources.localsource.filegroups.f1 = /home/admin1/teja/Flumedata/Behaviourlog
agent.sources.localsource.batchSize = 20
agent.sources.localsource.writePosInterval=2000
agent.sources.localsource.fileHeader = true

 

Pleasehelp.

Highlighted
Champion
Posts: 777
Registered: ‎05-16-2016

Re: While using Exec source getting duplicates

You may want to derecrese the millis in restartThrottle to see if its picking up the new log entriess .

Also add this parameter in your flume.config  - 

restart	 false	 Whether the executed cmd should be restarted if it dies

beacause if the above parameter is not set or false , restartThrottle has no effect . 

The source expects the command to continously procduce data and ingests its output

the Excec source is asynchronous in nature , there is a possibility of data loss if agent dies .

if data loss is something you want to avoid , you can use spoolingdir Source.