Support Questions

Find answers, ask questions, and share your expertise

YARN Applications display wrong formatted duration

avatar
Expert Contributor

Hello,

 

I am having a problem that I can't find any logical solution. Every job that requires YARN it will show up in "YARN Applications" UI on Cloudera Manager. Even though I can see all the running jobs on YARN Applications UI, ResourceManager UI, or Spark UI I have to widen my time selector to a year or two to see the finished jobs.

 

I think this has something to do with displayed time. All the running jobs have the static `17540.7d` as their duration:

 

Screenshot 2018-01-09 18.14.26.png

 

At the same time these applications on `ResourceManager` are showing up with the right date/time:

 

Screenshot 2018-01-15 19.58.56.png

As you can see this makes it really hard to monitor and track anything in YARN Applications view in Cloudera Manager.

 

Cloudera Manager express: 5.13.1

CDH: 5.13.1

Ubuntu Server 16.04

And I checked all the machines date/time to see if they are not sync. But unfortunately I can't find any issue in my cluster.

 

 

NOTE: there is only one similar issue here, but I guess he can't see any jobs even by

widening time window. (I can see jobs with wider time window 1-2yrs)

http://community.cloudera.com/t5/Batch-Processing-and-Workflow/Completed-YARN-applications-not-visib...

 

Best,

Maziyar

 
1 ACCEPTED SOLUTION

avatar
Expert Contributor

Hi Maziyar and Li,

 

You are definitely on the right track here.  I see that the top-level root queue has "aclAdministerApps=maziyar,admin", which limits YARN API access to some of the time metrics for the application.  If you're not one of these users when you make the API call you won't get correct value returned for:

 

  1. startedTime
  2. finishedTime
  3. elapsedTime
  4. logAggregationStatus
  5. amHostHttpAddress
  6. usedResources
  7. allocatedMB
  8. allocatedVCores
  9. runningContainers

Although you will get some basic application info, so you'll see the application, but metrics will be wrong, just like you report.

 

So, the question is which user is the CM service using to interact with the YARN API.  I did some testing and reproduced your issue by limiting the aclAdministerApps property to "yarn".  I then found that when I add "dr.who" to aclAdministerApps at the root level, it starts working properly.

 

So, try modifying your root level ACLs to be "aclAdministerApps=maziyar,admin,dr.who", refresh the Dynamic Resource Pool (DRP) configuration, and see if it resolves the issue for you.

 

Nick

View solution in original post

17 REPLIES 17

avatar
Expert Contributor

I clear all the logs and previous jobs but the CM still have all the finished jobs with the wrong date. Also, still shows the new apps with that weird duration (which it looks like the converting the milliseconds to another time format went wrong).

 

Does anybody know where is this data coming from? I have a MySQL setup for my CM. Can I look for this to see if this is a front-end issue or back-end or being inserted into file/table wrongly from the beginning.

 

Many thanks.

avatar
Expert Contributor

If I export the current job it shows me this:

 

  "applications" : [ {
    "applicationId" : "application_1516618738289_0001",
    "name" : "livy-session-0",
    "startTime" : "1970-01-01T00:00:00.000Z",
    "user" : "maziyar",
    "pool" : "root.users.maziyar",
    "state" : "RUNNING",
    "progress" : 10.0,
    "attributes" : { },
    "mr2AppInformation" : { }
  }, {
    "applicationId" : "application_1516618738289_0002",
    "name" : "Main",
    "startTime" : "1970-01-01T00:00:00.000Z",
    "user" : "maziyar",
    "pool" : "root.users.maziyar",
    "state" : "RUNNING",
    "progress" : 10.0,
    "attributes" : { },
    "mr2AppInformation" : { }
  } ],
  "warnings" : [ ]
}

The startTime is in 1970 for some reason! This date is really famouse in Unix:

 

"January 1, 1970 is the so called Unix epoch. It's the date where they started counting the Unix time. If you get this date as a return value, it usually means that the conversion of your date to the Unix timestamp returned a (near-) zero result. So the date conversion doesn't succeed"

 

So is it the backend of Cloudera Manager that has ` returns 0` or the MySQL conversion some where pass unsupported format .

avatar
Expert Contributor

I just upgraded the entire cluster to 5.14 and the issue still remains:

 

CDH: 5.14

CM: 5.14

avatar
Expert Contributor

Just to update this post, I have upgraded to CM/CDH 6.1 and I still experiencing the same thing! I am out of ideas and don't know how to fix this 🙂 

avatar
Guru

Hi @maziyar,

 

Quick question, does your cluster have the YARN ACL turned on? You can search yarn.acl.enable from Cloudera Manager YARN Configuration to find out.

 

This issue you have been experiencing may be caused by "startedTime":0 in RM REST API when ACLs are enabled.

 

If the YARN ACLs are enabled, then you need to define on the queue if the user has privilege to to administer apps (e.g. seeing duration information). This is controlled by aclAdministerApps parameter.

 

If you can send us the air-scheduler.xml for examine, it will be helpful.

 

Thanks,

Li

Li Wang, Technical Solution Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community:

Terms of Service

Community Guidelines

How to use the forum

avatar
Expert Contributor

.Hi @lwang,

 

Yes! I have yarn.acl.enable to not allow other users to have admin level access to the queues (mostly not kill others' application by mistake). My username has an admin level access, but I have the same wrong format for my own applications as well as the other users' apps in YARN UI.

 

In the section "Administration Access Control" of the queues, there are only two options:

  • Allow anyone to administer this pool
  • Allow these users and groups to administer this pool

 

Which I chose the second one with listing my own username and few others (system).

 

I couldn't find any file name fair-scheduler.xml in all my servers, is this something I should generate? Also, I couldn't find "startedTime" in aclAdministerApps

 

But I have the JSON format of what's inside aclAdministerApps (I think it is generated automatically):

 

{
	"defaultFairSharePreemptionThreshold": null,
	"defaultFairSharePreemptionTimeout": null,
	"defaultMinSharePreemptionTimeout": null,
	"defaultQueueSchedulingPolicy": "drf",
	"queueMaxAMShareDefault": null,
	"queueMaxAppsDefault": null,
	"queuePlacementRules": [{
		"create": false,
		"name": "specified",
		"queue": null,
		"rules": null
	}, {
		"create": true,
		"name": "nestedUserQueue",
		"queue": null,
		"rules": [{
			"create": true,
			"name": "default",
			"queue": "users",
			"rules": null
		}]
	}, {
		"create": null,
		"name": "default",
		"queue": null,
		"rules": null
	}],
	"queues": [{
		"aclAdministerApps": "maziyar ",
		"aclSubmitApps": "maziyar,test-user,hdfs,admin hadoop-admin,admin,hdfs,hive",
		"allowPreemptionFrom": null,
		"fairSharePreemptionThreshold": null,
		"fairSharePreemptionTimeout": null,
		"minSharePreemptionTimeout": null,
		"name": "root",
		"queues": [{
			"aclAdministerApps": "maziyar,root,spark,hdfs ",
			"aclSubmitApps": "*",
			"allowPreemptionFrom": null,
			"fairSharePreemptionThreshold": null,
			"fairSharePreemptionTimeout": null,
			"minSharePreemptionTimeout": null,
			"name": "users",
			"queues": [{
				"aclAdministerApps": null,
				"aclSubmitApps": null,
				"allowPreemptionFrom": null,
				"fairSharePreemptionThreshold": null,
				"fairSharePreemptionTimeout": null,
				"minSharePreemptionTimeout": null,
				"name": "maziyar",
				"queues": [],
				"schedulablePropertiesList": [{
					"impalaClampMemLimitQueryOption": null,
					"impalaDefaultQueryMemLimit": null,
					"impalaDefaultQueryOptions": null,
					"impalaMaxMemory": null,
					"impalaMaxQueryMemLimit": null,
					"impalaMaxQueuedQueries": null,
					"impalaMaxRunningQueries": null,
					"impalaMinQueryMemLimit": null,
					"impalaQueueTimeout": null,
					"maxAMShare": null,
					"maxChildResources": null,
					"maxResources": {
						"cpuPercent": 30.0,
						"memory": null,
						"memoryPercent": 30.0,
						"vcores": null
					},
					"maxRunningApps": null,
					"minResources": null,
					"scheduleName": "default",
					"weight": 3.0
				}],
				"schedulingPolicy": "drf",
				"type": null
			}],
			"schedulablePropertiesList": [{
				"impalaClampMemLimitQueryOption": null,
				"impalaDefaultQueryMemLimit": null,
				"impalaDefaultQueryOptions": null,
				"impalaMaxMemory": null,
				"impalaMaxQueryMemLimit": null,
				"impalaMaxQueuedQueries": null,
				"impalaMaxRunningQueries": null,
				"impalaMinQueryMemLimit": null,
				"impalaQueueTimeout": null,
				"maxAMShare": null,
				"maxChildResources": {
					"cpuPercent": 10.0,
					"memory": null,
					"memoryPercent": 10.0,
					"vcores": null
				},
				"maxResources": {
					"cpuPercent": 60.0,
					"memory": null,
					"memoryPercent": 60.0,
					"vcores": null
				},
				"maxRunningApps": 15,
				"minResources": null,
				"scheduleName": "default",
				"weight": 4.0
			}],
			"schedulingPolicy": "drf",
			"type": "parent"
		}, {
			"aclAdministerApps": "*",
			"aclSubmitApps": "*",
			"allowPreemptionFrom": null,
			"fairSharePreemptionThreshold": null,
			"fairSharePreemptionTimeout": null,
			"minSharePreemptionTimeout": null,
			"name": "default",
			"queues": [],
			"schedulablePropertiesList": [{
				"impalaClampMemLimitQueryOption": null,
				"impalaDefaultQueryMemLimit": null,
				"impalaDefaultQueryOptions": null,
				"impalaMaxMemory": null,
				"impalaMaxQueryMemLimit": null,
				"impalaMaxQueuedQueries": null,
				"impalaMaxRunningQueries": null,
				"impalaMinQueryMemLimit": null,
				"impalaQueueTimeout": null,
				"maxAMShare": null,
				"maxChildResources": null,
				"maxResources": {
					"cpuPercent": 10.0,
					"memory": null,
					"memoryPercent": 10.0,
					"vcores": null
				},
				"maxRunningApps": null,
				"minResources": null,
				"scheduleName": "default",
				"weight": 1.0
			}],
			"schedulingPolicy": "fifo",
			"type": null
		}, {
			"aclAdministerApps": "maziyar ",
			"aclSubmitApps": "mziyar,hdfs,hive ",
			"allowPreemptionFrom": null,
			"fairSharePreemptionThreshold": null,
			"fairSharePreemptionTimeout": null,
			"minSharePreemptionTimeout": null,
			"name": "multivac",
			"queues": [],
			"schedulablePropertiesList": [{
				"impalaClampMemLimitQueryOption": null,
				"impalaDefaultQueryMemLimit": null,
				"impalaDefaultQueryOptions": null,
				"impalaMaxMemory": null,
				"impalaMaxQueryMemLimit": null,
				"impalaMaxQueuedQueries": null,
				"impalaMaxRunningQueries": null,
				"impalaMinQueryMemLimit": null,
				"impalaQueueTimeout": null,
				"maxAMShare": null,
				"maxChildResources": null,
				"maxResources": {
					"cpuPercent": 80.0,
					"memory": null,
					"memoryPercent": 80.0,
					"vcores": null
				},
				"maxRunningApps": 3,
				"minResources": null,
				"scheduleName": "default",
				"weight": 5.0
			}],
			"schedulingPolicy": "drf",
			"type": null
		}],
		"schedulablePropertiesList": [{
			"impalaClampMemLimitQueryOption": null,
			"impalaDefaultQueryMemLimit": null,
			"impalaDefaultQueryOptions": null,
			"impalaMaxMemory": null,
			"impalaMaxQueryMemLimit": null,
			"impalaMaxQueuedQueries": null,
			"impalaMaxRunningQueries": null,
			"impalaMinQueryMemLimit": null,
			"impalaQueueTimeout": null,
			"maxAMShare": null,
			"maxChildResources": null,
			"maxResources": null,
			"maxRunningApps": null,
			"minResources": null,
			"scheduleName": "default",
			"weight": 1.0
		}],
		"schedulingPolicy": "drf",
		"type": null
	}],
	"userMaxAppsDefault": null,
	"users": []
}

 

Many thanks, I feel we are close to solving this problem 🙂

avatar
Guru

Hi @maziyar,

 

Great we are making progress. BTW, are your cluster kerberized? And is the authentication for the web UI turned on (SPNEGO)? You can find out by search "Enable Kerberos Authentication for HTTP Web-Consoles" from CM UI.

 

Can you attach the fair-scheduler.xml that is deployed on the RM nodes? The json is a CM internal format that gets translated into the xml and I am not sure how that is converted and what the final XML is.

 

You can pull the proper xml down from the CM UI -> yarn -> instances -> active RM -> Processes -> fair-scheduler.xml . That file is readable and gives a good view of what is there.

 

Thanks!

Li

Li Wang, Technical Solution Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community:

Terms of Service

Community Guidelines

How to use the forum

avatar
Expert Contributor

Hi @lwang

 

Thanks for your reply. My cluster is not Kerberized. Also, SPNEGO is not selected/enabled.

 

Here is my fair-scheduler.xml file:

 

 

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<allocations>
    <queue name="root">
        <weight>1.0</weight>
        <schedulingPolicy>drf</schedulingPolicy>
        <aclSubmitApps>maziyar,test-user,hdfs,admin iscpif-hadoop,admin,hdfs,hive</aclSubmitApps>
        <aclAdministerApps>maziyar,admin </aclAdministerApps>
        <queue name="users" type="parent">
            <maxResources>60.0%</maxResources>
            <maxChildResources>10.0%</maxChildResources>
            <maxRunningApps>15</maxRunningApps>
            <weight>4.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <aclSubmitApps>*</aclSubmitApps>
            <aclAdministerApps>maziyar,root,spark,hdfs </aclAdministerApps>
            <queue name="mpanahi">
                <maxResources>30.0%</maxResources>
                <weight>3.0</weight>
                <schedulingPolicy>drf</schedulingPolicy>
            </queue>
        </queue>
        <queue name="default">
            <maxResources>10.0%</maxResources>
            <weight>1.0</weight>
            <schedulingPolicy>fifo</schedulingPolicy>
            <aclSubmitApps>*</aclSubmitApps>
            <aclAdministerApps>*</aclAdministerApps>
        </queue>
        <queue name="multivac">
            <maxResources>80.0%</maxResources>
            <maxRunningApps>3</maxRunningApps>
            <weight>5.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <aclSubmitApps>mziyar,hdfs,hive </aclSubmitApps>
            <aclAdministerApps>maziyar </aclAdministerApps>
        </queue>
    </queue>
    <defaultQueueSchedulingPolicy>drf</defaultQueueSchedulingPolicy>
    <queuePlacementPolicy>
        <rule name="specified" create="false"/>
        <rule name="nestedUserQueue" create="true">
            <rule name="default" create="true" queue="users"/>
        </rule>
        <rule name="default"/>
    </queuePlacementPolicy>
</allocations>

 

Thanks again for your follow up, I really appreciate it.

 

Best,

Maziyar

 

avatar
Expert Contributor

Hi Maziyar and Li,

 

You are definitely on the right track here.  I see that the top-level root queue has "aclAdministerApps=maziyar,admin", which limits YARN API access to some of the time metrics for the application.  If you're not one of these users when you make the API call you won't get correct value returned for:

 

  1. startedTime
  2. finishedTime
  3. elapsedTime
  4. logAggregationStatus
  5. amHostHttpAddress
  6. usedResources
  7. allocatedMB
  8. allocatedVCores
  9. runningContainers

Although you will get some basic application info, so you'll see the application, but metrics will be wrong, just like you report.

 

So, the question is which user is the CM service using to interact with the YARN API.  I did some testing and reproduced your issue by limiting the aclAdministerApps property to "yarn".  I then found that when I add "dr.who" to aclAdministerApps at the root level, it starts working properly.

 

So, try modifying your root level ACLs to be "aclAdministerApps=maziyar,admin,dr.who", refresh the Dynamic Resource Pool (DRP) configuration, and see if it resolves the issue for you.

 

Nick