Created 10-30-2017 10:12 PM
We are just getting underway to use Spark and the rest of our HDP 2.6.2 distribution for some machine learning. I got a ticket from our infrastructure guys late last week stating that I was running high on disk usage on one of my nodes. This particular node happens to be a Spark2 History Server. So, I go to check it out. Sure enough /var/log/spark2/ had one log that was over 14gb!!
I removed that file, restarted the service and when I came in this morning after the weekend to check on it, once again, ~12gb. So I check the logs and see stuff like this:
17/10/30 15:00:47 INFO FsHistoryProvider: Replaying log path: hdfs://xczzpa0073.apsc.com:8020/spark2-history/local-1505774309971 17/10/30 15:00:47 ERROR FsHistoryProvider: Exception encountered when attempting to load application log hdfs://xczzpa0073.apsc.com:8020/spark2-history/local-1505774309971 org.apache.hadoop.security.AccessControlException: Permission denied: user=spark, access=READ, inode="/spark2-history/local-1505774309971":zx6878a:hadoop:-rwxrwx--- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1955) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1939) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1913) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2001) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1970) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1883) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2345) at sun.reflect.GeneratedConstructorAccessor5.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1240) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1225) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213) at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:309) at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:274) at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:266) at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1538) at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:331) at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:327) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:327) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:786) at org.apache.spark.scheduler.EventLoggingListener$.openEventLog(EventLoggingListener.scala:312) at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:647) at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:464) at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$3$$anon$4.run(FsHistoryProvider.scala:352) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=spark, access=READ, inode="/spark2-history/local-1505774309971":zx6878a:hadoop:-rwxrwx--- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1955) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1939) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1913) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2001) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1970) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1883) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2345) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554) at org.apache.hadoop.ipc.Client.call(Client.java:1498) at org.apache.hadoop.ipc.Client.call(Client.java:1398) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at com.sun.proxy.$Proxy10.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:272) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:291) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:203) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:185) at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1238) ... 20 more
Ok, so it appears to be a permissions thing, but I am not sure how to fix this. A little background - I am in an enterprise setting, but have setup a vanilla HDP deployment with Ambari - no AD/Kerberos stuff going on, I am letting local process accounts deal with things. When you see a user like zxXXXX, that is a local user. In the example above, that is one of our contractors we have working on some of the heavy lifting for some of our machine learning algorithms. Looks like maybe he is running Spark or PySpark as his user, not the Spark user, but I can't really tell.
Any idea what is going on here and how I can fix it to keep from getting so many error logs from building up?
Thanks!
Created 10-31-2017 11:47 AM
This is a issue with permission. The /spark2-history should belong to the spark user. You can change it as below
hdfs dfs -chown spark /spark2-history
Thanks,
Aditya
Created 10-31-2017 11:47 AM
This is a issue with permission. The /spark2-history should belong to the spark user. You can change it as below
hdfs dfs -chown spark /spark2-history
Thanks,
Aditya
Created 10-31-2017 05:21 PM
@Aditya Sirna - I checked this prior to posting - the Spark user does own that directory, but I don't think the issue with the Spark user. It seems to be with the other user zx6868a:
org.apache.hadoop.security.AccessControlException: Permission denied: user=spark, access=READ, inode="/spark2-history/local-1505774309971":zx6878a:hadoop:-rwxrwx---
I think what is happening is that user is running PySpark / spark-submit as his own username, not as the Spark user. At least that is my guess. Would doing a CHMOD on that /spark2-hisotry folder to give everyone read and write access (chmod 777) be appropriate and fix this?
Created 10-31-2017 06:03 PM
So, I think I fixed this, as the 'hdfs' user, I simply did an 'hdfs -chmod -R 777 /spark2-history' and restarted services. I'm no longer seeing the access / permission errors. Let me know if this was the correct fix or if I maybe did something I shouldn't have.... Thanks!
Created 10-31-2017 06:20 PM
You can do this as well. Or add the "zx6878a" user to hadoop group.
Thanks,
Aditya
Created 10-31-2017 07:36 PM
Perfect! Thanks so much!
Created 11-03-2017 06:20 AM
Glad that the issue is resolved. Can you kindly accept the answer so that community users can quickly find the answer.