Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

what is the best way to get ftp file to hdfs continusly ?

avatar
Rising Star

I want to get ftp file into hdfs,in ftp files are created in date directory for every day, I need to autonmate this job. what will be the best way for doing this?

1 ACCEPTED SOLUTION

avatar

Hi @Ravikiran Dasari

If it is for knowledge purpose then what Im going to give has no more information then the previous answers. But if you are looking for something related to work then this answer might help a bit.

Have a file watcher which looks for a file with the particular pattern, which has to be ftp'ed to the desired location. Once the file arrives you can move the file to HDFS server. This can be accomplished by a simple shell script which requires basic knowledge on shell and nothing more. Also this can accomplished by either push or pull. If you have any other downstream jobs which has to be executed once the file arrives in hdfs then I would recommend to go with pull approach so that you can execute any other hadoop/hive/pig/spark jobs in hdfs server.

Hope it helps!!

View solution in original post

10 REPLIES 10

avatar
Rising Star
@Ravikiran Dasari

Have you tried looking at NiFi and its capabilities. NiFi provides a lot of processors which can help you automate your tasks and to create a flow for performing those tasks. You can create a flow to pickup data from a source and dump it in a different location. You can check the following example written by one of our NiFi experts @Matt Clarke on - How you can use NiFi to pull data from an FTP server.

How-to: Retrieve files from a SFTP server using NiFi (GetSFTP vs. ListSFTP)

Along with the processors mentioned in the article above, you can use PutHDFS processors explained in below docs to dump the data in HDFS.

PutHDFS - NiFi docs

Hope this helps

avatar
Super Collaborator

posted my answer in parallel, without noticing yours. sorry for the redundant info.

avatar
Super Collaborator

Are you looking for the scheduling or how to script it? Shall the files be copied to hdfs as soon as they arrive, or in a special frequency, ie. daily, hourly, etc...

The best way depends on the tools and knowledge you have. It could be done with a plain shell script, but also with nifi. Spark has also a FTP connector.

Here is a post on how to solve it with nifi: https://community.hortonworks.com/questions/70261/how-to-read-data-from-a-file-from-remote-ftp-serve...

avatar
Rising Star

Hi @Harald Berghoff, Thanks for solution, In my cluster NiFi is not there. So I have to do using shell only other wise I can go for Flume, in case of shell how to do that? manual interaction I am getting files, bu I want to automate this, generally my manula process is like follows

step1:

sftp ayosftpuser@IPaddredss

password

step2:

cd /sourcedir

step3:in above directory every day one directory will create, in this directory some files are droping.

get -Pr 2018-02-26

bye

step4:

hadoop fs -put -f 2018-02-26 /destination

I need to automate this


avatar
Super Collaborator

@Ravikiran Dasari: do you have any experience in shell scripts? Do you know how to write a bash script, or maybe a Python script?
Do you have a special scheduler that you use in your environment, or will you use cron? If you don't know, I guess it will become cron. Try if you are able to edit the crontab via entering this command on the shell

crontab -e

Either you get a list of cron jobs or an error message like 'You (<<userid>>) are not allowed to use this program (crontab)'

Now when you want to write a shell script the starting point is a simple text files, containing the commands you otherwise enter on the shell. The script file should start with a line aka 'shebang' providing the script interpreter to be used. I.e. on RedHat
  • bash: #!/bin/bash
  • python: #!/usr/bin/python
  • perl: #!usr/bin/perl

If you decide to go for bash script, just create a file like this (you can use a different editor if you like):

vi ~/mycopyscript

enter in that script all your command

#!/bin/bash

dir = `date +%Y-%m-%d`
sftp ayosftpuser@IPaddredss << __MY_FTP_COMMANDS__
password
cd /sourcedir
get -Pr ${dir}
bye
__MY_FTP_COMMANDS__ #at this point the files should already be locally copied hadoop fs -put -f ${dir} /destination

Save the script ( by entering <ESC>:wq in vi) Next make the script executable, and only allow access from the owner (you)

chmod 700 ~/mycopyscript

You should be able to execute it now:

~/mycopyscript

This script is just a starting point, and done plain simple, no error handling and no security, whoever reads the script also has access to the password, and no parameter (you must execute it at the date that the dir uses)

Still it should provide you with the basic idea of a shell script.

avatar
Rising Star

HI @Harald Berghoff

Im usinf crontab only for scheduling the jobs, I tried in ur way also, but its prompting for password, how to give password in different script and error handling?If you dont mind can I have well capable script for handling errors and security.

avatar
Super Collaborator

ok, the error handling can be implemented in this way:

...
__MY_FTP_COMMANDS__

ret_ftp = $?
if [ ${ret_ftp} == 0]
  then
#if you have a logging facility you properly want to use it to log the status echo "Files successfully transfered" else echo "Error in file transfer" return ${ret_ftp} fi #at this point the files should already be locally copied hadoop fs -put -f ${dir} /destination ret_hdfs = $? #Put a similar handling here

For the password, SFTP is like ssh a little tricky, so to get rid of the password prompt, I would recommend to exchange SSH keys
If this is working you can add the scheduled execution of the script in your crontab.

avatar

Hi @Ravikiran Dasari

If it is for knowledge purpose then what Im going to give has no more information then the previous answers. But if you are looking for something related to work then this answer might help a bit.

Have a file watcher which looks for a file with the particular pattern, which has to be ftp'ed to the desired location. Once the file arrives you can move the file to HDFS server. This can be accomplished by a simple shell script which requires basic knowledge on shell and nothing more. Also this can accomplished by either push or pull. If you have any other downstream jobs which has to be executed once the file arrives in hdfs then I would recommend to go with pull approach so that you can execute any other hadoop/hive/pig/spark jobs in hdfs server.

Hope it helps!!

avatar
Rising Star

Hi @Bala Vignesh N V

Thanks for solution,

I need to implement,in case of shell how to do that? manual interaction I am getting files, bu I want to automate this, generally my manula process is like follows

step1:

sftp ayosftpuser@IPaddredss

password

step2:

cd /sourcedir

step3:in above directory every day one directory will create, in this directory some files are dropi...

get -Pr 2018-02-26

bye

step4:

hadoop fs -put -f 2018-02-26 /destination

I need to automate this