Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

sftp transfer to hdfs in spark as opposed to using a command in a script

avatar
Contributor

I wanted to know how to do sftp transfer to hdfs  in spark 1.6. By loading the data mainly in csv format mid size files a few to maybe 50 gigs size workflow. is this reccomended to do in spark or better from a script. I found  a library https://github.com/springml/spark-sftp and wanted to know if this a reccomended way of doing things . One of my problems as well was using this library how would I handle say touch files when I need to read data from a specific data to a specific date. Thanks 

 

I am using spark 1.6 and scala with cloudera manager version around  5.7.2 I think . It is routinely upgraded might be around 5.9

3 REPLIES 3

avatar
Champion
Disclosure: Never done this.

I read the readme of that project. If that is what you want to do that would be a way to do it. The note at the bottom spells out the restriction though and it follows what I was thinking. It says that it doesn't run in a spark job but a SparkContext is created and used; so it must. This means that while it runs in a driver or executor it only works in local mode and will only run on the node you launch it from. For me this removes any benefits of using Spark for this piece of the workflow. It would be better to use Flume or some other ingestion tool.

But yes you could use this project or write your own java, scala app to read sftp and write to HDFS.

SFTP files are fetched and written using jsch. It is not executed as spark job. It might have issues in cluster

avatar
Contributor
Thanks for the response . I did not see the part of it not running on a cluster as I will be using a cluster. I had one more question why would it not work on a cluster does it have something to do with it being distributed like in general?

avatar
Champion

I don't know specifically, but yes, it is most likely because the libraries used were not built for distributed system.  For instance, if you had three executors running the code in the library then all three would be reading from the sftp side and directory all vying for the same files and copying them to the destination.  It would be a mess.