Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

While using Exec source getting duplicates

Highlighted

While using Exec source getting duplicates

New Contributor

Hi, Guys

 

           I'm trying to stream my log files from diff web folders using flume to process and write to hdfs. Flume agent running fine but as i read in flume userguide exec source will tail file.when i start agent it reading what ever records available in logs after that it become idle mode if any new records return to log its not reading any records, after we save that log exec source reading log from first so we getting duplicates is ther any possibility to avoid those.is there any way to read new log entries.

 

 Here My config File:

 

agent.sources = localsource
agent.channels = memoryChannel
agent.sinks = avro_Sink


agent.sources.localsource.restartThrottle=240000
agent.sources.localsource.type=exec
agent.sources.localsource.shell = /bin/bash -c
agent.sources.localsource.command = tail -F /home/admin1/teja/Flumedata/Behaviourlog
agent.sources.localsource.logStdErr=true
agent.sources.localsource.batchSize=5
#agent.sources.localsource.fileHeader = true


# The channel can be defined as follows.
agent.sources.localsource.channels = memoryChannel

#Specify the channel the sink should use
agent.sinks.avro_Sink.channel = memoryChannel

# Each channel's type is defined.
agent.channels.memoryChannel.type = memory

# In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 10000
agent.channels.memoryChannel.transactionCapacity = 1000


# Each sink's type must be defined
agent.sinks.avro_Sink.type = avro
agent.sinks.avro_Sink.hostname= 192.168.44.444
agent.sinks.avro_Sink.port= 8021
agent.sinks.avro_Sink.avro.batchSize = 100
agent.sinks.avro_Sink.avro.rollCount = 0
agent.sinks.avro_Sink.avro.rollSize = 143060835
agent.sinks.avro_Sink.avro.rollInterval = 0

 

agent.sources.localsource.interceptors = search-replace regex-filter
agent.sources.localsource.interceptors.search-replace.type = search_replace
# Remove leading alphanumeric characters in an event body.
agent.sources.localsource.interceptors.search-replace.searchPattern= ###|##|#
agent.sources.localsource.interceptors.search-replace.replaceString= |

agent.sources.localsource.interceptors.regex-filter.type = regex_filter
###Remove full event body.
agent.sources.localsource.interceptors.regex-filter.regex = .*(pagenotfound.php).*
agent.sources.localsource.interceptors.regex-filter.excludeEvents = true

 

Please Help.

 

and I tried TailDir source in new flume version in this source im getting same duplicates problem, is flume wont read instantly if any data return to log .

 

Here My tailDir source config:

agent.sources.localsource.type = TAILDIR
agent.sources.localsource.positionFile = /home/admin1/teja/flume/taildir_position.json
agent.sources.localsource.filegroups = f1
agent.sources.localsource.filegroups.f1 = /home/admin1/teja/Flumedata/Behaviourlog
agent.sources.localsource.batchSize = 20
agent.sources.localsource.writePosInterval=2000
agent.sources.localsource.fileHeader = true

 

Pleasehelp.

1 REPLY 1

Re: While using Exec source getting duplicates

Champion

You may want to derecrese the millis in restartThrottle to see if its picking up the new log entriess .

Also add this parameter in your flume.config  - 

restart	 false	 Whether the executed cmd should be restarted if it dies

beacause if the above parameter is not set or false , restartThrottle has no effect . 

The source expects the command to continously procduce data and ingests its output

the Excec source is asynchronous in nature , there is a possibility of data loss if agent dies .

if data loss is something you want to avoid , you can use spoolingdir Source.