Created 05-30-2017 05:36 PM
On HDP 2.6, when trying to run the following paragraph as user2/user2 from a Zeppelin notebook (This is running in yarn-cluster mode):
%livy2.spark sc.version
It hangs for a bit, times out, and gives me the following java stack:
org.apache.zeppelin.livy.LivyException: Session 60 is finished, appId: null, log: [java.lang.Exception: No YARN application is found with tag livy-session-60-zahglq2y in 60 seconds. Please check your cluster status, it is may be very busy., com.cloudera.livy.utils.SparkYarnApp.com$cloudera$livy$utils$SparkYarnApp$$getAppIdFromTag(SparkYarnApp.scala:182) com.cloudera.livy.utils.SparkYarnApp$$anonfun$1$$anonfun$4.apply(SparkYarnApp.scala:248) com.cloudera.livy.utils.SparkYarnApp$$anonfun$1$$anonfun$4.apply(SparkYarnApp.scala:245) scala.Option.getOrElse(Option.scala:120) com.cloudera.livy.utils.SparkYarnApp$$anonfun$1.apply$mcV$sp(SparkYarnApp.scala:245) com.cloudera.livy.Utils$$anon$1.run(Utils.scala:95)] at org.apache.zeppelin.livy.BaseLivyInterprereter.createSession(BaseLivyInterprereter.java:209) at org.apache.zeppelin.livy.BaseLivyInterprereter.initLivySession(BaseLivyInterprereter.java:98) at org.apache.zeppelin.livy.BaseLivyInterprereter.open(BaseLivyInterprereter.java:80) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:482) at org.apache.zeppelin.scheduler.Job.run(Job.java:175) at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
From YARN logs, I just see this logged and nothing else from these unsuccessful attempts:
2017-05-30 17:22:44,115 INFO resourcemanager.ClientRMService (ClientRMService.java:getNewApplicationId(291)) - Allocated new applicationId: 32 2017-05-30 17:28:55,804 INFO resourcemanager.ClientRMService (ClientRMService.java:getNewApplicationId(291)) - Allocated new applicationId: 33
The same notebook works perfectly fine as user 'admin'. It's just when switching the user that it causes this issue. Any suggestion on what is wrong? And, there are plenty of resources available on YARN.
Created 05-30-2017 07:17 PM
I found the answer in the actual livy server log itself (Not the zeppelin livy interpreter log I was looking at all this time):
17/05/30 18:53:34 INFO InteractiveSessionManager: Registering new session 67 17/05/30 18:53:35 INFO ContextLauncher: Warning: Master yarn-cluster is deprecated since 2.0. Please use master "yarn" with specified deploy mode instead. 17/05/30 18:53:36 INFO ContextLauncher: 17/05/30 18:53:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/05/30 18:53:37 INFO ContextLauncher: 17/05/30 18:53:37 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 17/05/30 18:53:37 INFO ContextLauncher: 17/05/30 18:53:37 INFO RMProxy: Connecting to ResourceManager at zhoussen-edw1.field.hortonworks.com/172.26.255.217:8050 17/05/30 18:53:37 INFO ContextLauncher: 17/05/30 18:53:37 INFO Client: Requesting a new application from cluster with 4 NodeManagers 17/05/30 18:53:37 INFO ContextLauncher: 17/05/30 18:53:37 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (4096 MB per container) 17/05/30 18:53:37 INFO ContextLauncher: 17/05/30 18:53:37 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 17/05/30 18:53:37 INFO ContextLauncher: 17/05/30 18:53:37 INFO Client: Setting up container launch context for our AM 17/05/30 18:53:37 INFO ContextLauncher: 17/05/30 18:53:37 INFO Client: Setting up the launch environment for our AM container 17/05/30 18:53:37 INFO ContextLauncher: 17/05/30 18:53:37 INFO Client: Preparing resources for our AM container 17/05/30 18:53:39 INFO ContextLauncher: Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: user=user1, access=WRITE, inode="/user/user1/.sparkStaging/application_1496151555596_0039":hdfs:hdfs:drwxr-xr-x
So, it appears Livy was indeed able to connect to the resource manager and get an application id (as correlated per the previous entry in the YARN log - ), but cannot proceed further to allocate an AM because it can't write /user/user1 on hdfs due to permission problems. After creating the directory, /user/user1, it works fine.
Created 05-30-2017 06:29 PM
@zhoussen, As per livy logs, The spark application was not started correctly. In order to find out its root cause, please check the spark application logs.
Steps to follow:
1) Check the status of yarn cluster. ( List running applications)
2) Run livy paragraph as user2
3) Check if new application is launched in Yarn. If a new application is launched, check its status and application log for further debugging.
Created 05-30-2017 06:52 PM
Doesn't help. The YARN cluster is healthy, and doesn't even show this application in any failed state. The application log doesn't contain any more helpful message.
Created 05-30-2017 06:59 PM
@zhoussen, so if application with "livy-session-60-zahglq2y" tag is alive and running fine. You need to update the livy app lookup timeout to be more than 60 secs. It seems that livy believes that yarn application was not started within 60 sec.
set livy.server.yarn.app-lookup-timeout to may be 300 sec.
Created 05-30-2017 07:17 PM
I found the answer in the actual livy server log itself (Not the zeppelin livy interpreter log I was looking at all this time):
17/05/30 18:53:34 INFO InteractiveSessionManager: Registering new session 67 17/05/30 18:53:35 INFO ContextLauncher: Warning: Master yarn-cluster is deprecated since 2.0. Please use master "yarn" with specified deploy mode instead. 17/05/30 18:53:36 INFO ContextLauncher: 17/05/30 18:53:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/05/30 18:53:37 INFO ContextLauncher: 17/05/30 18:53:37 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 17/05/30 18:53:37 INFO ContextLauncher: 17/05/30 18:53:37 INFO RMProxy: Connecting to ResourceManager at zhoussen-edw1.field.hortonworks.com/172.26.255.217:8050 17/05/30 18:53:37 INFO ContextLauncher: 17/05/30 18:53:37 INFO Client: Requesting a new application from cluster with 4 NodeManagers 17/05/30 18:53:37 INFO ContextLauncher: 17/05/30 18:53:37 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (4096 MB per container) 17/05/30 18:53:37 INFO ContextLauncher: 17/05/30 18:53:37 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 17/05/30 18:53:37 INFO ContextLauncher: 17/05/30 18:53:37 INFO Client: Setting up container launch context for our AM 17/05/30 18:53:37 INFO ContextLauncher: 17/05/30 18:53:37 INFO Client: Setting up the launch environment for our AM container 17/05/30 18:53:37 INFO ContextLauncher: 17/05/30 18:53:37 INFO Client: Preparing resources for our AM container 17/05/30 18:53:39 INFO ContextLauncher: Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: user=user1, access=WRITE, inode="/user/user1/.sparkStaging/application_1496151555596_0039":hdfs:hdfs:drwxr-xr-x
So, it appears Livy was indeed able to connect to the resource manager and get an application id (as correlated per the previous entry in the YARN log - ), but cannot proceed further to allocate an AM because it can't write /user/user1 on hdfs due to permission problems. After creating the directory, /user/user1, it works fine.
Created 05-30-2017 07:23 PM
Thanks for your answer. Made me look back at the entire flow.
Created 06-02-2017 11:52 PM
It looks like the owner of /user/user1 is hdfs, but should be user1. Not sure how you create folder /user/user1, if you are admin, please change the owner, or ask your admin to do that.
Created 01-08-2019 09:00 PM
@bkv
Check the YARN logs. It could be starving on YARN containers. You may need to adjust some YARN container settings there. As well, please post yours as a separate new issue rather than an answer to this one.