I had a requirement to transfer the logs from external server to hdfs cluster . External server has username and password to access log file path.
Please suggest below thing :
1) Do we require to install Flume in Source (i.e. External server) .
2) How we can pass credential in conf file to get the data from External server.
Any help will be greatly apprcieated !!
Thanks . But this link does not have details that which properties of sources we have to use to pass username credentials . Can you please elloborate that how we can do this
There is no built-in support of such feature.
I'd recommend to map remote directory to the server with flume agent running using smth like samba share (or windows network drive) with your credentials.
If it's not possible and you're using some custom protocols to access the files then you have to write your custom source to support that.
Here is example of FTP source with creds support: https://github.com/keedio/flume-ftp-source
Could you explain the purpose/advantage of using flume-ftp over ftp. Is it not possible to perform a sftp from remote server to hdfs? Thank you.
I mentioned FTP source just as example of custom protocol implementation in case is not possible to mount as linux folder.
Is just a matter of environment configuration. Shared folders/custom mount points in linux are usually managed by admins, and in this case it just adds one more point to keep an eye on - to have those folders correctly mapped/mounted before you start running the flume.
If you can mount sftp to your hadoop node as local folder - go with it. There will be no difference in term of flume process
Hi @Michael M
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
What this capacity and TransactionCapacity means?
Hi @Amit Dass
As per Flume doc:
|capacity||The maximum number of events stored in the channel|
|transactionCapacity||The maximum number of events the channel will take from a source or give to a sink per transaction|
In your case this means that your channel can store up to 1000 events. Source will send the events to the channel in batches with up to 100 events per each transaction. As well as sink will consume up to 100 events per batch/transaction.
Transaction here means as everywhere. If something goes wrong - whole transaction will be rolled back and all 100 events will return back to the channel.
If your sink can't save the events during some period of time, your channel will be overflowed and flume will throw an error.
What do you mean saying "data truncated"?
I'd say is not a problem of the channel configuration, but we can check if you provide more details
Hi @Michael M
Thanks you for such a brief explanation .
When you are saying event is it means no of records fetch from the source ? Like if we have a file with 5 records then we can say capacity = 5 & transactionCapacity = 5 .