Support Questions

MilesYao · ‎02-11-2016

Hi -

I've checked the doc, read the O'Reilly book, Googled, and searched this forum, but did not find much useful for what looks like should be a common Flume use case:

I want to ingest log files of some distributed application that runs on multiple hosts. They behave like typical Unix or web server logs - in fixed directories and roll infrequently. I cannot modify the application nor the log files themselves - the ingestion has to be totally non-invasive. So far so good:

1. Current Flume documentation recommends Spooling Dir Source over Exec Source for tailing logs, yet does not explain how to do that in a streaming fashion without modifying source file. Spooling Dir Source requires that the source file be completed for update and closed - it's batch- rather than stream-oriented. So we can't use it for typical actively-updated log files.

2. Now, using Exec Source should work, except that these log files don't have constant file names - e.g. app-<role>-<rack>.log. The log directory is NFS-mounted and can be shared by multiple hosts, so it can contain:

app-role1-rack1.log

app-role2-rack1.log

app-role1-rack2.log

app-role2-rack2.log

....

Hadoop itself has such examples:

$ ls -l /var/log/hbase/
total 40836
-rw-r--r-- 1 hbase hbase 139032 Nov 20 17:40 hbase-cmf-hbase-HBASERESTSERVER-hou76072.log.out
-rw-r--r-- 1 hbase hbase 27859661 Feb 11 15:20 hbase-cmf-hbase-REGIONSERVER-hou76072.log.out

I would like to configure these values in flume-env.sh and pass it into the agent config file, or have the command itself call some script to derive dynamically. Exec Source has a shell option that seems to support this:

The ‘shell’ config is used to invoke the ‘command’ through a command shell (such as Bash or Powershell). The ‘command’ is passed as an argument to ‘shell’ for execution. This allows the ‘command’ to use features from the shell such as wildcards, back ticks, pipes, loops, conditionals etc. In the absence of the ‘shell’ config, the ‘command’ will be invoked directly. Common values for ‘shell’ : ‘/bin/sh -c’, ‘/bin/ksh -c’, ‘cmd /c’, ‘powershell -Command’, etc.

a1.sources.tailsource-1.type = exec
a1.sources.tailsource-1.shell = /bin/bash -c
a1.sources.tailsource-1.command = for i in /path/*.txt; do cat $i; done

However, I could not get it to work:

(flume-env.sh)

ROLE=`<some script or command>`
RACK=`<some script or command>`
... 
JAVA_OPTS="-Xms60m -Xmx360m -Drole=${ROLE} -Drack=${RACK} -Dhostname=${HOSTNAME} "

(flume_spool_avro_agent.conf)

....
spool_avro_agent.sources.s2.channels = c1
spool_avro_agent.sources.s2.type = exec
spool_avro_agent.sources.s2.shell = /bin/bash -c
spool_avro_agent.sources.s2.command = tail -F /var/log/app-${role}-${rack}.log
....

I verified that $JAVA_OPTS is correct, but the values don't seem to be passed to the command line:

2016-02-11 15:46:14,175 |INFO |org.apache.flume.source.ExecSource$StderrReader | |- StderrLogger[1] = '+ tail -F /var/log/app--.log'
2016-02-11 15:46:14,176 |INFO |org.apache.flume.source.ExecSource$StderrReader | |- StderrLogger[2] = 'tail: cannot open `/var/log/app--.log' for reading: No such file or directory'

So, what exactly is the runtime environment for this Exec Source shell? What kind of constraints does it have (compared to, say, ssh)?

Any insights from the trenches would be appreciated. Thanks!

pdvorak · ‎02-11-2016

The Exec source is called with the ProcessBuilder:
https://docs.oracle.com/javase/7/docs/api/java/lang/ProcessBuilder.html

It inherits the environment of the current running flume process

View solution in original post

MilesYao · ‎02-11-2016

Exporting the variables in flume-env.sh seems to make them visible, and solves my immediate problem.

Also, passing the variables in JAVA (JAVA_OPTS="... -Dvar=x") doesn't seem to make a difference in this case (but required if you want to use them in log4j.properties):

export ROLE=`<some script or command>`
export RACK=`<some script or command>`
... 
JAVA_OPTS="-Xms60m -Xmx360m -Dhostname=${HOSTNAME} "

My larger questions still stand, though, and I'd welcome any comments.

pdvorak · ‎02-11-2016

The Exec source is called with the ProcessBuilder:
https://docs.oracle.com/javase/7/docs/api/java/lang/ProcessBuilder.html

It inherits the environment of the current running flume process

Cloudera Community

Support Questions

How to pass parameters into Flume Exec Source command

PySpark with Livy via script submission and Zeppel...

How to pass Hive configuration parameters to Knox ...

Flume - source exec and sink hdfs. File is not loa...

multiple sources of flume agent

Using Apache Flume Sources and Sinks with Apache N...

Passing ListSFTP parameters on NiFi

Flume custom source

Pass parameters to oozie subworkflow

Passing dynamic values to Nifi Parameters/processo...

Migrating Apache Flume Flows to Apache NiFi: Kafka...