I have the logs which has been created continuously in the remote system connected through the network. Now i need to load the logs to my HDFS using Flume. What is the source configuration for this flume source to pull the Real Time Logs to my HDFS.
I tried this link,
How to proceed further?
I am struck in the middle.
I have not used this source, but what error are you seeing in flume logs? There is a example flume conf here
My source is a text file with streaming data in every milliseconds and i need to transfer that data from remote to my HDFS using Flume. I have the username and password and i dont the exact configuration that are required to transfer the data using Flume to my HDFS.
It shows that Flume has started after that its not proceeding further and i could not locate the logs also.
3 Options here:
1) you can mount any remote ftp/share to local linux folder, f.e. -
and then use flume exec source with tail command or spool directory source.
2) Install 2 flume agents:
- on remote host with exec/spool source and avro sink
- on hdfs/hadoop host with avro source and hdfs sink
3) Write custom source to serve your needs for any custom protocol or non-standard requirements
Hi @Avijeet Dash, unfortunately i have no experience with rsyslog, but if i understand correctly is fully compatible with syslog. Flume has some integration with it - https://flume.apache.org/FlumeUserGuide.html#syslog-sources
As for network sockets, i'd say there is 2 options - use exec source with some bash command like here -
or, write custom source. Benefit of custom source here is more control over the process (i.e. choose between pool and stream types of the source..)
Now i am trying to connect the main server in abroad using FTP server to access the real time logs data from a particular folder using FTP server.
When i tried this link i got issues with Junit Test
This is the Flume Configuration to pull data
### wwww.keedio.com # example file, protocol is ftp, process by lines, and sink to file_roll # for testing purposes.
## Sources Definition for agent "agent" #ACTIVE LIST agent.sources = ftp1 agent.sinks = k1 agent.channels = ch1
##### SOURCE IS ftp server
# Type of source for ftp sources agent.sources.ftp1.type = org.keedio.flume.source.ftp.source.Source agent.sources.ftp1.client.source = ftp
# Connection properties for ftp server agent.sources.ftp1.name.server = 192.168.2.3 agent.sources.ftp1.port = 21
agent.sources.ftp1.user = admin
agent.sources.ftp1.password = admin321
agent.sources.ftp1.folder =D:\data\<files> agent.sources.ftp1.file.name = filename
# Discover delay, each configured millisecond directory will be explored agent.sources.ftp1.run.discover.delay=5000
# Process by lines agent.sources.ftp1.flushlines = true
agent.sinks.k1.type = file_roll agent.sinks.k1.sink.directory = /streamingdata/ agent.sinks.k1.sink.rollInterval = 7200
agent.channels.ch1.type = memory agent.channels.ch1.capacity = 10000 agent.channels.ch1.transactionCapacity = 1000
agent.sources.ftp1.channels = ch1 agent.sinks.k1.channel = ch1
Please check whether this configuration is good enough
Here Source is a Windows Server and Sink is Linux HDFS.
Please do the needful.
@Magesh Kumar, is hard to say what's happening without the flume logs..
However 2 comments about your config:
- you're using file_roll sink, not HDFS
- from what i understand that source consumes root folder of ftp server. .folder and .name parameters have another purpose.