Created 02-20-2018 11:29 AM
I want to get ftp file into hdfs,in ftp files are created in date directory for every day, I need to autonmate this job. what will be the best way for doing this?
Created 02-20-2018 12:33 PM
If it is for knowledge purpose then what Im going to give has no more information then the previous answers. But if you are looking for something related to work then this answer might help a bit.
Have a file watcher which looks for a file with the particular pattern, which has to be ftp'ed to the desired location. Once the file arrives you can move the file to HDFS server. This can be accomplished by a simple shell script which requires basic knowledge on shell and nothing more. Also this can accomplished by either push or pull. If you have any other downstream jobs which has to be executed once the file arrives in hdfs then I would recommend to go with pull approach so that you can execute any other hadoop/hive/pig/spark jobs in hdfs server.
Hope it helps!!
Created 02-20-2018 11:50 AM
Have you tried looking at NiFi and its capabilities. NiFi provides a lot of processors which can help you automate your tasks and to create a flow for performing those tasks. You can create a flow to pickup data from a source and dump it in a different location. You can check the following example written by one of our NiFi experts @Matt Clarke on - How you can use NiFi to pull data from an FTP server.
How-to: Retrieve files from a SFTP server using NiFi (GetSFTP vs. ListSFTP)
Along with the processors mentioned in the article above, you can use PutHDFS processors explained in below docs to dump the data in HDFS.
Hope this helps
Created 02-20-2018 11:53 AM
posted my answer in parallel, without noticing yours. sorry for the redundant info.
Created 02-20-2018 11:51 AM
Are you looking for the scheduling or how to script it? Shall the files be copied to hdfs as soon as they arrive, or in a special frequency, ie. daily, hourly, etc...
The best way depends on the tools and knowledge you have. It could be done with a plain shell script, but also with nifi. Spark has also a FTP connector.
Here is a post on how to solve it with nifi: https://community.hortonworks.com/questions/70261/how-to-read-data-from-a-file-from-remote-ftp-serve...
Created 02-27-2018 07:05 AM
Hi @Harald Berghoff, Thanks for solution, In my cluster NiFi is not there. So I have to do using shell only other wise I can go for Flume, in case of shell how to do that? manual interaction I am getting files, bu I want to automate this, generally my manula process is like follows
step1:
password
step2:
cd /sourcedir
step3:in above directory every day one directory will create, in this directory some files are droping.
get -Pr 2018-02-26
bye
step4:
hadoop fs -put -f 2018-02-26 /destination
I need to automate this
Created 02-27-2018 08:51 AM
@Ravikiran Dasari: do you have any experience in shell scripts? Do you know how to write a bash script, or maybe a Python script?
Do you have a special scheduler that you use in your environment, or will you use cron? If you don't know, I guess it will become cron. Try if you are able to edit the crontab via entering this command on the shell
crontab -e
Either you get a list of cron jobs or an error message like 'You (<<userid>>) are not allowed to use this program (crontab)'
Now when you want to write a shell script the starting point is a simple text files, containing the commands you otherwise enter on the shell. The script file should start with a line aka 'shebang' providing the script interpreter to be used. I.e. on RedHatIf you decide to go for bash script, just create a file like this (you can use a different editor if you like):
vi ~/mycopyscript
enter in that script all your command
#!/bin/bash
dir = `date +%Y-%m-%d`
sftp ayosftpuser@IPaddredss << __MY_FTP_COMMANDS__
password
cd /sourcedir
get -Pr ${dir}
bye
__MY_FTP_COMMANDS__ #at this point the files should already be locally copied hadoop fs -put -f ${dir} /destination
Save the script ( by entering <ESC>:wq in vi) Next make the script executable, and only allow access from the owner (you)
chmod 700 ~/mycopyscript
You should be able to execute it now:
~/mycopyscript
This script is just a starting point, and done plain simple, no error handling and no security, whoever reads the script also has access to the password, and no parameter (you must execute it at the date that the dir uses)
Still it should provide you with the basic idea of a shell script.
Created 02-27-2018 11:44 AM
Created 02-27-2018 12:52 PM
ok, the error handling can be implemented in this way:
... __MY_FTP_COMMANDS__ ret_ftp = $? if [ ${ret_ftp} == 0] then
#if you have a logging facility you properly want to use it to log the status echo "Files successfully transfered" else echo "Error in file transfer" return ${ret_ftp} fi #at this point the files should already be locally copied hadoop fs -put -f ${dir} /destination ret_hdfs = $? #Put a similar handling here
For the password, SFTP is like ssh a little tricky, so to get rid of the password prompt, I would recommend to exchange SSH keys
If this is working you can add the scheduled execution of the script in your crontab.
Created 02-20-2018 12:33 PM
If it is for knowledge purpose then what Im going to give has no more information then the previous answers. But if you are looking for something related to work then this answer might help a bit.
Have a file watcher which looks for a file with the particular pattern, which has to be ftp'ed to the desired location. Once the file arrives you can move the file to HDFS server. This can be accomplished by a simple shell script which requires basic knowledge on shell and nothing more. Also this can accomplished by either push or pull. If you have any other downstream jobs which has to be executed once the file arrives in hdfs then I would recommend to go with pull approach so that you can execute any other hadoop/hive/pig/spark jobs in hdfs server.
Hope it helps!!
Created 02-27-2018 07:48 AM
Thanks for solution,
I need to implement,in case of shell how to do that? manual interaction I am getting files, bu I want to automate this, generally my manula process is like follows
step1: