I wanted to know how to do sftp transfer to hdfs in spark 1.6. By loading the data mainly in csv format mid size files a few to maybe 50 gigs size workflow. is this reccomended to do in spark or better from a script. I found a library https://github.com/springml/spark-sftp and wanted to know if this a reccomended way of doing things . One of my problems as well was using this library how would I handle say touch files when I need to read data from a specific data to a specific date. Thanks
I am using spark 1.6 and scala with cloudera manager version around 5.7.2 I think . It is routinely upgraded might be around 5.9
I don't know specifically, but yes, it is most likely because the libraries used were not built for distributed system. For instance, if you had three executors running the code in the library then all three would be reading from the sftp side and directory all vying for the same files and copying them to the destination. It would be a mess.