Created on 01-15-2018 11:07 AM - edited 09-16-2022 05:44 AM
Hello,
I am having a problem that I can't find any logical solution. Every job that requires YARN it will show up in "YARN Applications" UI on Cloudera Manager. Even though I can see all the running jobs on YARN Applications UI, ResourceManager UI, or Spark UI I have to widen my time selector to a year or two to see the finished jobs.
I think this has something to do with displayed time. All the running jobs have the static `17540.7d` as their duration:
At the same time these applications on `ResourceManager` are showing up with the right date/time:
As you can see this makes it really hard to monitor and track anything in YARN Applications view in Cloudera Manager.
Cloudera Manager express: 5.13.1
CDH: 5.13.1
Ubuntu Server 16.04
And I checked all the machines date/time to see if they are not sync. But unfortunately I can't find any issue in my cluster.
NOTE: there is only one similar issue here, but I guess he can't see any jobs even by
widening time window. (I can see jobs with wider time window 1-2yrs)
Best,
Maziyar
Created on 01-11-2019 10:51 AM - edited 01-11-2019 10:55 AM
Hi Maziyar and Li,
You are definitely on the right track here. I see that the top-level root queue has "aclAdministerApps=maziyar,admin", which limits YARN API access to some of the time metrics for the application. If you're not one of these users when you make the API call you won't get correct value returned for:
Although you will get some basic application info, so you'll see the application, but metrics will be wrong, just like you report.
So, the question is which user is the CM service using to interact with the YARN API. I did some testing and reproduced your issue by limiting the aclAdministerApps property to "yarn". I then found that when I add "dr.who" to aclAdministerApps at the root level, it starts working properly.
So, try modifying your root level ACLs to be "aclAdministerApps=maziyar,admin,dr.who", refresh the Dynamic Resource Pool (DRP) configuration, and see if it resolves the issue for you.
Nick
Created 01-18-2018 09:17 AM
I clear all the logs and previous jobs but the CM still have all the finished jobs with the wrong date. Also, still shows the new apps with that weird duration (which it looks like the converting the milliseconds to another time format went wrong).
Does anybody know where is this data coming from? I have a MySQL setup for my CM. Can I look for this to see if this is a front-end issue or back-end or being inserted into file/table wrongly from the beginning.
Many thanks.
Created on 01-22-2018 03:11 AM - edited 01-22-2018 03:43 AM
If I export the current job it shows me this:
"applications" : [ { "applicationId" : "application_1516618738289_0001", "name" : "livy-session-0", "startTime" : "1970-01-01T00:00:00.000Z", "user" : "maziyar", "pool" : "root.users.maziyar", "state" : "RUNNING", "progress" : 10.0, "attributes" : { }, "mr2AppInformation" : { } }, { "applicationId" : "application_1516618738289_0002", "name" : "Main", "startTime" : "1970-01-01T00:00:00.000Z", "user" : "maziyar", "pool" : "root.users.maziyar", "state" : "RUNNING", "progress" : 10.0, "attributes" : { }, "mr2AppInformation" : { } } ], "warnings" : [ ] }
The startTime is in 1970 for some reason! This date is really famouse in Unix:
"January 1, 1970 is the so called Unix epoch. It's the date where they started counting the Unix time. If you get this date as a return value, it usually means that the conversion of your date to the Unix timestamp returned a (near-) zero result. So the date conversion doesn't succeed"
So is it the backend of Cloudera Manager that has ` returns 0` or the MySQL conversion some where pass unsupported format .
Created 01-25-2018 08:40 AM
I just upgraded the entire cluster to 5.14 and the issue still remains:
CDH: 5.14
CM: 5.14
Created 01-08-2019 02:13 AM
Just to update this post, I have upgraded to CM/CDH 6.1 and I still experiencing the same thing! I am out of ideas and don't know how to fix this 🙂
Created 01-08-2019 02:29 PM
Hi @maziyar,
Quick question, does your cluster have the YARN ACL turned on? You can search yarn.acl.enable from Cloudera Manager YARN Configuration to find out.
This issue you have been experiencing may be caused by "startedTime":0 in RM REST API when ACLs are enabled.
If the YARN ACLs are enabled, then you need to define on the queue if the user has privilege to to administer apps (e.g. seeing duration information). This is controlled by aclAdministerApps parameter.
If you can send us the air-scheduler.xml for examine, it will be helpful.
Thanks,
Li
Li Wang, Technical Solution Manager
Created 01-09-2019 06:18 AM
.Hi @lwang,
Yes! I have yarn.acl.enable to not allow other users to have admin level access to the queues (mostly not kill others' application by mistake). My username has an admin level access, but I have the same wrong format for my own applications as well as the other users' apps in YARN UI.
In the section "Administration Access Control" of the queues, there are only two options:
Which I chose the second one with listing my own username and few others (system).
I couldn't find any file name fair-scheduler.xml in all my servers, is this something I should generate? Also, I couldn't find "startedTime" in aclAdministerApps
But I have the JSON format of what's inside aclAdministerApps (I think it is generated automatically):
{ "defaultFairSharePreemptionThreshold": null, "defaultFairSharePreemptionTimeout": null, "defaultMinSharePreemptionTimeout": null, "defaultQueueSchedulingPolicy": "drf", "queueMaxAMShareDefault": null, "queueMaxAppsDefault": null, "queuePlacementRules": [{ "create": false, "name": "specified", "queue": null, "rules": null }, { "create": true, "name": "nestedUserQueue", "queue": null, "rules": [{ "create": true, "name": "default", "queue": "users", "rules": null }] }, { "create": null, "name": "default", "queue": null, "rules": null }], "queues": [{ "aclAdministerApps": "maziyar ", "aclSubmitApps": "maziyar,test-user,hdfs,admin hadoop-admin,admin,hdfs,hive", "allowPreemptionFrom": null, "fairSharePreemptionThreshold": null, "fairSharePreemptionTimeout": null, "minSharePreemptionTimeout": null, "name": "root", "queues": [{ "aclAdministerApps": "maziyar,root,spark,hdfs ", "aclSubmitApps": "*", "allowPreemptionFrom": null, "fairSharePreemptionThreshold": null, "fairSharePreemptionTimeout": null, "minSharePreemptionTimeout": null, "name": "users", "queues": [{ "aclAdministerApps": null, "aclSubmitApps": null, "allowPreemptionFrom": null, "fairSharePreemptionThreshold": null, "fairSharePreemptionTimeout": null, "minSharePreemptionTimeout": null, "name": "maziyar", "queues": [], "schedulablePropertiesList": [{ "impalaClampMemLimitQueryOption": null, "impalaDefaultQueryMemLimit": null, "impalaDefaultQueryOptions": null, "impalaMaxMemory": null, "impalaMaxQueryMemLimit": null, "impalaMaxQueuedQueries": null, "impalaMaxRunningQueries": null, "impalaMinQueryMemLimit": null, "impalaQueueTimeout": null, "maxAMShare": null, "maxChildResources": null, "maxResources": { "cpuPercent": 30.0, "memory": null, "memoryPercent": 30.0, "vcores": null }, "maxRunningApps": null, "minResources": null, "scheduleName": "default", "weight": 3.0 }], "schedulingPolicy": "drf", "type": null }], "schedulablePropertiesList": [{ "impalaClampMemLimitQueryOption": null, "impalaDefaultQueryMemLimit": null, "impalaDefaultQueryOptions": null, "impalaMaxMemory": null, "impalaMaxQueryMemLimit": null, "impalaMaxQueuedQueries": null, "impalaMaxRunningQueries": null, "impalaMinQueryMemLimit": null, "impalaQueueTimeout": null, "maxAMShare": null, "maxChildResources": { "cpuPercent": 10.0, "memory": null, "memoryPercent": 10.0, "vcores": null }, "maxResources": { "cpuPercent": 60.0, "memory": null, "memoryPercent": 60.0, "vcores": null }, "maxRunningApps": 15, "minResources": null, "scheduleName": "default", "weight": 4.0 }], "schedulingPolicy": "drf", "type": "parent" }, { "aclAdministerApps": "*", "aclSubmitApps": "*", "allowPreemptionFrom": null, "fairSharePreemptionThreshold": null, "fairSharePreemptionTimeout": null, "minSharePreemptionTimeout": null, "name": "default", "queues": [], "schedulablePropertiesList": [{ "impalaClampMemLimitQueryOption": null, "impalaDefaultQueryMemLimit": null, "impalaDefaultQueryOptions": null, "impalaMaxMemory": null, "impalaMaxQueryMemLimit": null, "impalaMaxQueuedQueries": null, "impalaMaxRunningQueries": null, "impalaMinQueryMemLimit": null, "impalaQueueTimeout": null, "maxAMShare": null, "maxChildResources": null, "maxResources": { "cpuPercent": 10.0, "memory": null, "memoryPercent": 10.0, "vcores": null }, "maxRunningApps": null, "minResources": null, "scheduleName": "default", "weight": 1.0 }], "schedulingPolicy": "fifo", "type": null }, { "aclAdministerApps": "maziyar ", "aclSubmitApps": "mziyar,hdfs,hive ", "allowPreemptionFrom": null, "fairSharePreemptionThreshold": null, "fairSharePreemptionTimeout": null, "minSharePreemptionTimeout": null, "name": "multivac", "queues": [], "schedulablePropertiesList": [{ "impalaClampMemLimitQueryOption": null, "impalaDefaultQueryMemLimit": null, "impalaDefaultQueryOptions": null, "impalaMaxMemory": null, "impalaMaxQueryMemLimit": null, "impalaMaxQueuedQueries": null, "impalaMaxRunningQueries": null, "impalaMinQueryMemLimit": null, "impalaQueueTimeout": null, "maxAMShare": null, "maxChildResources": null, "maxResources": { "cpuPercent": 80.0, "memory": null, "memoryPercent": 80.0, "vcores": null }, "maxRunningApps": 3, "minResources": null, "scheduleName": "default", "weight": 5.0 }], "schedulingPolicy": "drf", "type": null }], "schedulablePropertiesList": [{ "impalaClampMemLimitQueryOption": null, "impalaDefaultQueryMemLimit": null, "impalaDefaultQueryOptions": null, "impalaMaxMemory": null, "impalaMaxQueryMemLimit": null, "impalaMaxQueuedQueries": null, "impalaMaxRunningQueries": null, "impalaMinQueryMemLimit": null, "impalaQueueTimeout": null, "maxAMShare": null, "maxChildResources": null, "maxResources": null, "maxRunningApps": null, "minResources": null, "scheduleName": "default", "weight": 1.0 }], "schedulingPolicy": "drf", "type": null }], "userMaxAppsDefault": null, "users": [] }
Many thanks, I feel we are close to solving this problem 🙂
Created 01-09-2019 11:44 AM
Hi @maziyar,
Great we are making progress. BTW, are your cluster kerberized? And is the authentication for the web UI turned on (SPNEGO)? You can find out by search "Enable Kerberos Authentication for HTTP Web-Consoles" from CM UI.
Can you attach the fair-scheduler.xml that is deployed on the RM nodes? The json is a CM internal format that gets translated into the xml and I am not sure how that is converted and what the final XML is.
You can pull the proper xml down from the CM UI -> yarn -> instances -> active RM -> Processes -> fair-scheduler.xml . That file is readable and gives a good view of what is there.
Thanks!
Li
Li Wang, Technical Solution Manager
Created 01-11-2019 02:36 AM
Hi @lwang
Thanks for your reply. My cluster is not Kerberized. Also, SPNEGO is not selected/enabled.
Here is my fair-scheduler.xml file:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <allocations> <queue name="root"> <weight>1.0</weight> <schedulingPolicy>drf</schedulingPolicy> <aclSubmitApps>maziyar,test-user,hdfs,admin iscpif-hadoop,admin,hdfs,hive</aclSubmitApps> <aclAdministerApps>maziyar,admin </aclAdministerApps> <queue name="users" type="parent"> <maxResources>60.0%</maxResources> <maxChildResources>10.0%</maxChildResources> <maxRunningApps>15</maxRunningApps> <weight>4.0</weight> <schedulingPolicy>drf</schedulingPolicy> <aclSubmitApps>*</aclSubmitApps> <aclAdministerApps>maziyar,root,spark,hdfs </aclAdministerApps> <queue name="mpanahi"> <maxResources>30.0%</maxResources> <weight>3.0</weight> <schedulingPolicy>drf</schedulingPolicy> </queue> </queue> <queue name="default"> <maxResources>10.0%</maxResources> <weight>1.0</weight> <schedulingPolicy>fifo</schedulingPolicy> <aclSubmitApps>*</aclSubmitApps> <aclAdministerApps>*</aclAdministerApps> </queue> <queue name="multivac"> <maxResources>80.0%</maxResources> <maxRunningApps>3</maxRunningApps> <weight>5.0</weight> <schedulingPolicy>drf</schedulingPolicy> <aclSubmitApps>mziyar,hdfs,hive </aclSubmitApps> <aclAdministerApps>maziyar </aclAdministerApps> </queue> </queue> <defaultQueueSchedulingPolicy>drf</defaultQueueSchedulingPolicy> <queuePlacementPolicy> <rule name="specified" create="false"/> <rule name="nestedUserQueue" create="true"> <rule name="default" create="true" queue="users"/> </rule> <rule name="default"/> </queuePlacementPolicy> </allocations>
Thanks again for your follow up, I really appreciate it.
Best,
Maziyar
Created on 01-11-2019 10:51 AM - edited 01-11-2019 10:55 AM
Hi Maziyar and Li,
You are definitely on the right track here. I see that the top-level root queue has "aclAdministerApps=maziyar,admin", which limits YARN API access to some of the time metrics for the application. If you're not one of these users when you make the API call you won't get correct value returned for:
Although you will get some basic application info, so you'll see the application, but metrics will be wrong, just like you report.
So, the question is which user is the CM service using to interact with the YARN API. I did some testing and reproduced your issue by limiting the aclAdministerApps property to "yarn". I then found that when I add "dr.who" to aclAdministerApps at the root level, it starts working properly.
So, try modifying your root level ACLs to be "aclAdministerApps=maziyar,admin,dr.who", refresh the Dynamic Resource Pool (DRP) configuration, and see if it resolves the issue for you.
Nick