Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Cannot find a saved DataFrame on disk

Solved Go to solution

Cannot find a saved DataFrame on disk

Contributor

I want to save DataFrame on disk:

df.write.format("parquet").save("/home/centos/test/df.parquet")

I get the following error, which says that the user "centos" does not have write permissions:

18/05/07 09:18:08 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.security.AccessControlException: Permission denied: user=centos, access=WRITE, inode="/home/centos/test/df.parquet/_temporary/0":hdfs:hdfs:drwxr-xr-x

This is how I run spark-submit command:

spark-submit  --master yarn  --deploy-mode cluster  --driver-memory 6g  --executor-cores 2  --num-executors 2  --executor-memory 4g  --class org.test.MyProcessor  mytest.jar
1 ACCEPTED SOLUTION

Accepted Solutions

Re: Cannot find a saved DataFrame on disk

Super Guru

@Liana Napalkova

.save action in spark writes the data to HDFS, but the permissions are changed in Local file system.

Please change the permissions to /home/centos directory in HDFS

Login as HDFS user

hdfs dfs -chown -R centos /home/centos/*

8 REPLIES 8

Re: Cannot find a saved DataFrame on disk

Mentor

@Liana Napalkova

You are trying to save to a local Filesystem /home/centos/---/---/ and from the error stack above the user and group is hdfs:hdfs The user centos doesn't have the correct permissions and ownership of this directory. This has nothing to do with your earlier hdfs directory where you set the correct permissions

Please do the following, while logged on the Linux CLI as centos

centos@{host}$ id

This will give you the group to which centos belongs to be used in the change ownership syntax,so as the root user or sudoer where xxx is the group

# chown -R centos:xxxxx  /home/centos/---/---/

Hope that helps

Re: Cannot find a saved DataFrame on disk

Contributor

The output of "id":

uid=1000(centos) gid=1000(centos) groups=1000(centos),4(adm),10(wheel),190(systemd-journal)

I executed "chown -R centos:centos /home/centos/test" but still get the same error:

18/05/07 12:06:28 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.security.AccessControlException: Permission denied: user=centos, access=WRITE, inode="/home/centos/test/df.parquet/_temporary/0":hdfs:hdfs:drwxr-xr-x

This is the output of "ls -la" executed in "/home/centos":

total 36236
drwx------.  4 centos centos     4096 May  7 12:34 .
drwxr-xr-x. 15 root   root       4096 Apr 16 18:41 ..
-rw-------.  1 centos centos    13781 May  7 11:26 .bash_history
-rw-r--r--.  1 centos centos       18 Mar  5  2015 .bash_logout
-rw-r--r--.  1 centos centos      193 Mar  5  2015 .bash_profile
-rw-r--r--.  1 centos centos      231 Mar  5  2015 .bashrc
-rw-rw-r--   1 centos centos       47 May  7 11:38 .scala_history
drwx------.  2 centos centos       46 May  2 07:57 .ssh
drwxrwxr-x   4 centos centos      144 May  7 11:42 test

Re: Cannot find a saved DataFrame on disk

Contributor

Maybe the problem is that I run Spark program in Yarn cluster mode? It means that the driver can be running in any of the machines of the cluster. So, probably I should run "chown -R centos:centos ..." in each machine or do ".coalesce(1)"?

Re: Cannot find a saved DataFrame on disk

Super Guru

@Liana Napalkova

.save action in spark writes the data to HDFS, but the permissions are changed in Local file system.

Please change the permissions to /home/centos directory in HDFS

Login as HDFS user

hdfs dfs -chown -R centos /home/centos/*

Re: Cannot find a saved DataFrame on disk

Contributor

I think that this is the reason. If I login as HDFS user and run "hdfs dfs -chown -R centos /home/centos/test", then it says that this directory does not exist. I created this directory as HDFS user and then changed permissions to centos. Should I write a parquet file to the full path?:

df.coalesce(1).write.format("parquet").save("hdfs://eureambarimaster1.local.eurecat.org:8020/user/hdfs/test")

Re: Cannot find a saved DataFrame on disk

Super Guru
@Liana Napalkova
Use write.mode to specify is it overwrite/append so that spark will write the file to test directory
df.coalesce(1).write.mode("overwrite").format("parquet").save("/user/hdfs/test")

if we won't mention any mode spark will fail with directory already exists error because you have already created the test directory.

Highlighted

Re: Cannot find a saved DataFrame on disk

Contributor

Yes, sure. Sorry, I was actually referring to "hdfs://eureambarimaster1.local.eurecat.org:8020/user/hdfs/test/df.parquet"

Let me test it.

Re: Cannot find a saved DataFrame on disk

Contributor

I have just tested it. It worked fine! Thank you!

Don't have an account?
Coming from Hortonworks? Activate your account here