Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Cannot find a saved DataFrame on disk

avatar
Contributor

I want to save DataFrame on disk:

df.write.format("parquet").save("/home/centos/test/df.parquet")

I get the following error, which says that the user "centos" does not have write permissions:

18/05/07 09:18:08 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.security.AccessControlException: Permission denied: user=centos, access=WRITE, inode="/home/centos/test/df.parquet/_temporary/0":hdfs:hdfs:drwxr-xr-x

This is how I run spark-submit command:

spark-submit  --master yarn  --deploy-mode cluster  --driver-memory 6g  --executor-cores 2  --num-executors 2  --executor-memory 4g  --class org.test.MyProcessor  mytest.jar
1 ACCEPTED SOLUTION

avatar
Master Guru

@Liana Napalkova

.save action in spark writes the data to HDFS, but the permissions are changed in Local file system.

Please change the permissions to /home/centos directory in HDFS

Login as HDFS user

hdfs dfs -chown -R centos /home/centos/*

View solution in original post

8 REPLIES 8

avatar
Master Mentor

@Liana Napalkova

You are trying to save to a local Filesystem /home/centos/---/---/ and from the error stack above the user and group is hdfs:hdfs The user centos doesn't have the correct permissions and ownership of this directory. This has nothing to do with your earlier hdfs directory where you set the correct permissions

Please do the following, while logged on the Linux CLI as centos

centos@{host}$ id

This will give you the group to which centos belongs to be used in the change ownership syntax,so as the root user or sudoer where xxx is the group

# chown -R centos:xxxxx  /home/centos/---/---/

Hope that helps

avatar
Contributor

The output of "id":

uid=1000(centos) gid=1000(centos) groups=1000(centos),4(adm),10(wheel),190(systemd-journal)

I executed "chown -R centos:centos /home/centos/test" but still get the same error:

18/05/07 12:06:28 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.security.AccessControlException: Permission denied: user=centos, access=WRITE, inode="/home/centos/test/df.parquet/_temporary/0":hdfs:hdfs:drwxr-xr-x

This is the output of "ls -la" executed in "/home/centos":

total 36236
drwx------.  4 centos centos     4096 May  7 12:34 .
drwxr-xr-x. 15 root   root       4096 Apr 16 18:41 ..
-rw-------.  1 centos centos    13781 May  7 11:26 .bash_history
-rw-r--r--.  1 centos centos       18 Mar  5  2015 .bash_logout
-rw-r--r--.  1 centos centos      193 Mar  5  2015 .bash_profile
-rw-r--r--.  1 centos centos      231 Mar  5  2015 .bashrc
-rw-rw-r--   1 centos centos       47 May  7 11:38 .scala_history
drwx------.  2 centos centos       46 May  2 07:57 .ssh
drwxrwxr-x   4 centos centos      144 May  7 11:42 test

avatar
Contributor

Maybe the problem is that I run Spark program in Yarn cluster mode? It means that the driver can be running in any of the machines of the cluster. So, probably I should run "chown -R centos:centos ..." in each machine or do ".coalesce(1)"?

avatar
Master Guru

@Liana Napalkova

.save action in spark writes the data to HDFS, but the permissions are changed in Local file system.

Please change the permissions to /home/centos directory in HDFS

Login as HDFS user

hdfs dfs -chown -R centos /home/centos/*

avatar
Contributor

I think that this is the reason. If I login as HDFS user and run "hdfs dfs -chown -R centos /home/centos/test", then it says that this directory does not exist. I created this directory as HDFS user and then changed permissions to centos. Should I write a parquet file to the full path?:

df.coalesce(1).write.format("parquet").save("hdfs://eureambarimaster1.local.eurecat.org:8020/user/hdfs/test")

avatar
Master Guru
@Liana Napalkova
Use write.mode to specify is it overwrite/append so that spark will write the file to test directory
df.coalesce(1).write.mode("overwrite").format("parquet").save("/user/hdfs/test")

if we won't mention any mode spark will fail with directory already exists error because you have already created the test directory.

avatar
Contributor

Yes, sure. Sorry, I was actually referring to "hdfs://eureambarimaster1.local.eurecat.org:8020/user/hdfs/test/df.parquet"

Let me test it.

avatar
Contributor

I have just tested it. It worked fine! Thank you!