Created on 02-11-2016 01:57 PM - edited 09-16-2022 03:03 AM
Hi -
I've checked the doc, read the O'Reilly book, Googled, and searched this forum, but did not find much useful for what looks like should be a common Flume use case:
I want to ingest log files of some distributed application that runs on multiple hosts. They behave like typical Unix or web server logs - in fixed directories and roll infrequently. I cannot modify the application nor the log files themselves - the ingestion has to be totally non-invasive. So far so good:
1. Current Flume documentation recommends Spooling Dir Source over Exec Source for tailing logs, yet does not explain how to do that in a streaming fashion without modifying source file. Spooling Dir Source requires that the source file be completed for update and closed - it's batch- rather than stream-oriented. So we can't use it for typical actively-updated log files.
2. Now, using Exec Source should work, except that these log files don't have constant file names - e.g. app-<role>-<rack>.log. The log directory is NFS-mounted and can be shared by multiple hosts, so it can contain:
app-role1-rack1.log
app-role2-rack1.log
app-role1-rack2.log
app-role2-rack2.log
....
Hadoop itself has such examples:
$ ls -l /var/log/hbase/
total 40836
-rw-r--r-- 1 hbase hbase 139032 Nov 20 17:40 hbase-cmf-hbase-HBASERESTSERVER-hou76072.log.out
-rw-r--r-- 1 hbase hbase 27859661 Feb 11 15:20 hbase-cmf-hbase-REGIONSERVER-hou76072.log.out
I would like to configure these values in flume-env.sh and pass it into the agent config file, or have the command itself call some script to derive dynamically. Exec Source has a shell option that seems to support this:
The ‘shell’ config is used to invoke the ‘command’ through a command shell (such as Bash or Powershell). The ‘command’ is passed as an argument to ‘shell’ for execution. This allows the ‘command’ to use features from the shell such as wildcards, back ticks, pipes, loops, conditionals etc. In the absence of the ‘shell’ config, the ‘command’ will be invoked directly. Common values for ‘shell’ : ‘/bin/sh -c’, ‘/bin/ksh -c’, ‘cmd /c’, ‘powershell -Command’, etc.
a1.sources.tailsource-1.type = exec
a1.sources.tailsource-1.shell = /bin/bash -c
a1.sources.tailsource-1.command = for i in /path/*.txt; do cat $i; done
However, I could not get it to work:
(flume-env.sh)
ROLE=`<some script or command>`
RACK=`<some script or command>`
...
JAVA_OPTS="-Xms60m -Xmx360m -Drole=${ROLE} -Drack=${RACK} -Dhostname=${HOSTNAME} "
(flume_spool_avro_agent.conf)
.... spool_avro_agent.sources.s2.channels = c1 spool_avro_agent.sources.s2.type = exec spool_avro_agent.sources.s2.shell = /bin/bash -c spool_avro_agent.sources.s2.command = tail -F /var/log/app-${role}-${rack}.log ....
I verified that $JAVA_OPTS is correct, but the values don't seem to be passed to the command line:
2016-02-11 15:46:14,175 |INFO |org.apache.flume.source.ExecSource$StderrReader | |- StderrLogger[1] = '+ tail -F /var/log/app--.log'
2016-02-11 15:46:14,176 |INFO |org.apache.flume.source.ExecSource$StderrReader | |- StderrLogger[2] = 'tail: cannot open `/var/log/app--.log' for reading: No such file or directory'
So, what exactly is the runtime environment for this Exec Source shell? What kind of constraints does it have (compared to, say, ssh)?
Any insights from the trenches would be appreciated. Thanks!
Created 02-11-2016 04:02 PM
Created 02-11-2016 02:31 PM
Exporting the variables in flume-env.sh seems to make them visible, and solves my immediate problem.
Also, passing the variables in JAVA (JAVA_OPTS="... -Dvar=x") doesn't seem to make a difference in this case (but required if you want to use them in log4j.properties):
export ROLE=`<some script or command>` export RACK=`<some script or command>` ... JAVA_OPTS="-Xms60m -Xmx360m -Dhostname=${HOSTNAME} "
My larger questions still stand, though, and I'd welcome any comments.
Created 02-11-2016 04:02 PM