I have been trying to copy some HDFS files over to the local file system using the FileSystem API's copyToLocalFile function. But when I run the spark job in the cluster mode, it is unable to write to my local file system, even with the absolute path. It usually fails with a permission denied error, though I'm running the spark-submit from the same user and the user would obviously have access to his/her home folder. I've checked the user by using Process("whoami") and it's my user (and not yarn or any other user who probably did not have access to my home folder)
I've tried to explicitly set the working directory to the given folder but to no avail. It says it cannot create the directory though the directory already exists on the local file system. On checking the working directory from Process("pwd") it points to a usercache location specific to the application.
I've alternatively tried to even give the absolute path with the mount point, but to no avail. I've also tried to execute the Process("hdfs dfs -copyToLocal <src> <dest>") as a workaround and that hasn't worked either.
And I have no issues when I run the same commands via spark-shell or spark-submit in the client mode. YARN seems to be somehow affecting the process, am I missing something? What are the alternative to get it onto the local file system via code?
Was there any answer for this?
I will do ssh to any edge node and then run a shell script to copy files from hdfs to local , not sure if there is any other right way of doing it?
Hi @prem1301 as this is an older post, you would have a better chance of receiving a resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question. You can link this thread as a reference in your new post.