I am in a situation whether data is continuously being loaded into a particular folder based on date-wise. Filename would be like
21511182016 - where 215 is the folder name followed by timestamp in the order of MM/DD/YYYY.
I have followed this configuration for log data.
# a source for for error log file
agent1.sources = tailAccessSource tailErrorSource
# I define one sink agent1.sinks = hdfsSink # I define one channel agent1.channels = memChannel01 # Bind the source and sink to the channel # Both sources will use the memory channel agent1.sources.tailAccessSource.channels = memChannel01 agent1.sources.tailErrorSource.channels = memChannel01 agent1.sinks.hdfsSink.channel = memChannel01 # Define the type and options for each sources agent1.sources.tailAccessSource.type = exec agent1.sources.tailAccessSource.command = tail -F /var/log/httpd/access_log agent1.sources.tailErrorSource.type = exec agent1.sources.tailErrorSource.command = tail -F /var/log/httpd/error_log # Define the type and options for the channel agent1.channels.memChannel01.type = memory agent1.channels.memChannel01.capacity = 100000 agent1.channels.memChannel01.transactionCapacity = 10000 # Define the type and options for the sink # Note: namenode is the hostname the hadoop namenode server # flume/data-example.1/ is the directory where the apache logs will be stored agent1.sinks.hdfsSink.type = hdfs agent1.sinks.hdfsSink.hdfs.path = hdfs://namenode/flume/data-example.1/ agent1.sinks.hdfsSink.hdfs.fileType = DataStream agent1.sinks.hdfsSink.hdfs.rollCount = 0 agent1.sinks.hdfsSink.hdfs.rollSize = 0 agent1.sinks.hdfsSink.hdfs.rollInterval = 60.
Here we are Hard Coding the Files inside a folder,
Now how flume should take the filenames automatically from the folder with time stamp present.
Eg) 21511182016 is the Filename.
agent1.sources.tailAccessSource.command = tail -F /var/log/httpd/access_log (Here the filename is hard coded)
I need flume to take the files automatically from a folder from FTP Server Source.
tail -F /var/log/httpd/access_log
with Source from FTP Server logs (text files).
Please do the needful.
Thanks in Advance
Not sure i understand it correctly, but why don't use spooling directory source instead -
If you need some complex approach like tail on the latest file in the directory there is no easy way to do it without custom source development
My requirement is that Flume needs to read the data from single file and append the data to single file in HDFS based on datewise. But Spooling Directory is used where source have different files but in my case. I have single file as a source with foldername and date as filename and i need to read the data from single file and append the new records in single file in HDFS.
You can try to find particular file by the command itself, instead of just "tail" command use smth like:
for i in `find <your_folder> -xtype f`;do tail -n 10 $i; break; done
You can put some regexp here in order to find exact file you need.
As for writing everything to a single file - i'm not sure it's possible. You can route input file to some particular output folder in hdfs, but not the same file (at least you can not count on that, even if its possible by setting some large rollIntervals)
Routing can be done with custom headers/channel selectors and/or interceptors