Support Questions

Find answers, ask questions, and share your expertise

YARN Applications display wrong formatted duration

Expert Contributor

Hello,

 

I am having a problem that I can't find any logical solution. Every job that requires YARN it will show up in "YARN Applications" UI on Cloudera Manager. Even though I can see all the running jobs on YARN Applications UI, ResourceManager UI, or Spark UI I have to widen my time selector to a year or two to see the finished jobs.

 

I think this has something to do with displayed time. All the running jobs have the static `17540.7d` as their duration:

 

Screenshot 2018-01-09 18.14.26.png

 

At the same time these applications on `ResourceManager` are showing up with the right date/time:

 

Screenshot 2018-01-15 19.58.56.png

As you can see this makes it really hard to monitor and track anything in YARN Applications view in Cloudera Manager.

 

Cloudera Manager express: 5.13.1

CDH: 5.13.1

Ubuntu Server 16.04

And I checked all the machines date/time to see if they are not sync. But unfortunately I can't find any issue in my cluster.

 

 

NOTE: there is only one similar issue here, but I guess he can't see any jobs even by

widening time window. (I can see jobs with wider time window 1-2yrs)

http://community.cloudera.com/t5/Batch-Processing-and-Workflow/Completed-YARN-applications-not-visib...

 

Best,

Maziyar

 
1 ACCEPTED SOLUTION

Cloudera Employee

Hi Maziyar and Li,

 

You are definitely on the right track here.  I see that the top-level root queue has "aclAdministerApps=maziyar,admin", which limits YARN API access to some of the time metrics for the application.  If you're not one of these users when you make the API call you won't get correct value returned for:

 

  1. startedTime
  2. finishedTime
  3. elapsedTime
  4. logAggregationStatus
  5. amHostHttpAddress
  6. usedResources
  7. allocatedMB
  8. allocatedVCores
  9. runningContainers

Although you will get some basic application info, so you'll see the application, but metrics will be wrong, just like you report.

 

So, the question is which user is the CM service using to interact with the YARN API.  I did some testing and reproduced your issue by limiting the aclAdministerApps property to "yarn".  I then found that when I add "dr.who" to aclAdministerApps at the root level, it starts working properly.

 

So, try modifying your root level ACLs to be "aclAdministerApps=maziyar,admin,dr.who", refresh the Dynamic Resource Pool (DRP) configuration, and see if it resolves the issue for you.

 

Nick

View solution in original post

17 REPLIES 17

Expert Contributor

I clear all the logs and previous jobs but the CM still have all the finished jobs with the wrong date. Also, still shows the new apps with that weird duration (which it looks like the converting the milliseconds to another time format went wrong).

 

Does anybody know where is this data coming from? I have a MySQL setup for my CM. Can I look for this to see if this is a front-end issue or back-end or being inserted into file/table wrongly from the beginning.

 

Many thanks.

Expert Contributor

If I export the current job it shows me this:

 

  "applications" : [ {
    "applicationId" : "application_1516618738289_0001",
    "name" : "livy-session-0",
    "startTime" : "1970-01-01T00:00:00.000Z",
    "user" : "maziyar",
    "pool" : "root.users.maziyar",
    "state" : "RUNNING",
    "progress" : 10.0,
    "attributes" : { },
    "mr2AppInformation" : { }
  }, {
    "applicationId" : "application_1516618738289_0002",
    "name" : "Main",
    "startTime" : "1970-01-01T00:00:00.000Z",
    "user" : "maziyar",
    "pool" : "root.users.maziyar",
    "state" : "RUNNING",
    "progress" : 10.0,
    "attributes" : { },
    "mr2AppInformation" : { }
  } ],
  "warnings" : [ ]
}

The startTime is in 1970 for some reason! This date is really famouse in Unix:

 

"January 1, 1970 is the so called Unix epoch. It's the date where they started counting the Unix time. If you get this date as a return value, it usually means that the conversion of your date to the Unix timestamp returned a (near-) zero result. So the date conversion doesn't succeed"

 

So is it the backend of Cloudera Manager that has ` returns 0` or the MySQL conversion some where pass unsupported format .

Expert Contributor

I just upgraded the entire cluster to 5.14 and the issue still remains:

 

CDH: 5.14

CM: 5.14

Expert Contributor

Just to update this post, I have upgraded to CM/CDH 6.1 and I still experiencing the same thing! I am out of ideas and don't know how to fix this 🙂 

Super Collaborator

Hi @maziyar,

 

Quick question, does your cluster have the YARN ACL turned on? You can search yarn.acl.enable from Cloudera Manager YARN Configuration to find out.

 

This issue you have been experiencing may be caused by "startedTime":0 in RM REST API when ACLs are enabled.

 

If the YARN ACLs are enabled, then you need to define on the queue if the user has privilege to to administer apps (e.g. seeing duration information). This is controlled by aclAdministerApps parameter.

 

If you can send us the air-scheduler.xml for examine, it will be helpful.

 

Thanks,

Li

Li Wang, Technical Solution Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community:

Terms of Service

Community Guidelines

How to use the forum

Expert Contributor

.Hi @lwang,

 

Yes! I have yarn.acl.enable to not allow other users to have admin level access to the queues (mostly not kill others' application by mistake). My username has an admin level access, but I have the same wrong format for my own applications as well as the other users' apps in YARN UI.

 

In the section "Administration Access Control" of the queues, there are only two options:

  • Allow anyone to administer this pool
  • Allow these users and groups to administer this pool

 

Which I chose the second one with listing my own username and few others (system).

 

I couldn't find any file name fair-scheduler.xml in all my servers, is this something I should generate? Also, I couldn't find "startedTime" in aclAdministerApps

 

But I have the JSON format of what's inside aclAdministerApps (I think it is generated automatically):

 

{
	"defaultFairSharePreemptionThreshold": null,
	"defaultFairSharePreemptionTimeout": null,
	"defaultMinSharePreemptionTimeout": null,
	"defaultQueueSchedulingPolicy": "drf",
	"queueMaxAMShareDefault": null,
	"queueMaxAppsDefault": null,
	"queuePlacementRules": [{
		"create": false,
		"name": "specified",
		"queue": null,
		"rules": null
	}, {
		"create": true,
		"name": "nestedUserQueue",
		"queue": null,
		"rules": [{
			"create": true,
			"name": "default",
			"queue": "users",
			"rules": null
		}]
	}, {
		"create": null,
		"name": "default",
		"queue": null,
		"rules": null
	}],
	"queues": [{
		"aclAdministerApps": "maziyar ",
		"aclSubmitApps": "maziyar,test-user,hdfs,admin hadoop-admin,admin,hdfs,hive",
		"allowPreemptionFrom": null,
		"fairSharePreemptionThreshold": null,
		"fairSharePreemptionTimeout": null,
		"minSharePreemptionTimeout": null,
		"name": "root",
		"queues": [{
			"aclAdministerApps": "maziyar,root,spark,hdfs ",
			"aclSubmitApps": "*",
			"allowPreemptionFrom": null,
			"fairSharePreemptionThreshold": null,
			"fairSharePreemptionTimeout": null,
			"minSharePreemptionTimeout": null,
			"name": "users",
			"queues": [{
				"aclAdministerApps": null,
				"aclSubmitApps": null,
				"allowPreemptionFrom": null,
				"fairSharePreemptionThreshold": null,
				"fairSharePreemptionTimeout": null,
				"minSharePreemptionTimeout": null,
				"name": "maziyar",
				"queues": [],
				"schedulablePropertiesList": [{
					"impalaClampMemLimitQueryOption": null,
					"impalaDefaultQueryMemLimit": null,
					"impalaDefaultQueryOptions": null,
					"impalaMaxMemory": null,
					"impalaMaxQueryMemLimit": null,
					"impalaMaxQueuedQueries": null,
					"impalaMaxRunningQueries": null,
					"impalaMinQueryMemLimit": null,
					"impalaQueueTimeout": null,
					"maxAMShare": null,
					"maxChildResources": null,
					"maxResources": {
						"cpuPercent": 30.0,
						"memory": null,
						"memoryPercent": 30.0,
						"vcores": null
					},
					"maxRunningApps": null,
					"minResources": null,
					"scheduleName": "default",
					"weight": 3.0
				}],
				"schedulingPolicy": "drf",
				"type": null
			}],
			"schedulablePropertiesList": [{
				"impalaClampMemLimitQueryOption": null,
				"impalaDefaultQueryMemLimit": null,
				"impalaDefaultQueryOptions": null,
				"impalaMaxMemory": null,
				"impalaMaxQueryMemLimit": null,
				"impalaMaxQueuedQueries": null,
				"impalaMaxRunningQueries": null,
				"impalaMinQueryMemLimit": null,
				"impalaQueueTimeout": null,
				"maxAMShare": null,
				"maxChildResources": {
					"cpuPercent": 10.0,
					"memory": null,
					"memoryPercent": 10.0,
					"vcores": null
				},
				"maxResources": {
					"cpuPercent": 60.0,
					"memory": null,
					"memoryPercent": 60.0,
					"vcores": null
				},
				"maxRunningApps": 15,
				"minResources": null,
				"scheduleName": "default",
				"weight": 4.0
			}],
			"schedulingPolicy": "drf",
			"type": "parent"
		}, {
			"aclAdministerApps": "*",
			"aclSubmitApps": "*",
			"allowPreemptionFrom": null,
			"fairSharePreemptionThreshold": null,
			"fairSharePreemptionTimeout": null,
			"minSharePreemptionTimeout": null,
			"name": "default",
			"queues": [],
			"schedulablePropertiesList": [{
				"impalaClampMemLimitQueryOption": null,
				"impalaDefaultQueryMemLimit": null,
				"impalaDefaultQueryOptions": null,
				"impalaMaxMemory": null,
				"impalaMaxQueryMemLimit": null,
				"impalaMaxQueuedQueries": null,
				"impalaMaxRunningQueries": null,
				"impalaMinQueryMemLimit": null,
				"impalaQueueTimeout": null,
				"maxAMShare": null,
				"maxChildResources": null,
				"maxResources": {
					"cpuPercent": 10.0,
					"memory": null,
					"memoryPercent": 10.0,
					"vcores": null
				},
				"maxRunningApps": null,
				"minResources": null,
				"scheduleName": "default",
				"weight": 1.0
			}],
			"schedulingPolicy": "fifo",
			"type": null
		}, {
			"aclAdministerApps": "maziyar ",
			"aclSubmitApps": "mziyar,hdfs,hive ",
			"allowPreemptionFrom": null,
			"fairSharePreemptionThreshold": null,
			"fairSharePreemptionTimeout": null,
			"minSharePreemptionTimeout": null,
			"name": "multivac",
			"queues": [],
			"schedulablePropertiesList": [{
				"impalaClampMemLimitQueryOption": null,
				"impalaDefaultQueryMemLimit": null,
				"impalaDefaultQueryOptions": null,
				"impalaMaxMemory": null,
				"impalaMaxQueryMemLimit": null,
				"impalaMaxQueuedQueries": null,
				"impalaMaxRunningQueries": null,
				"impalaMinQueryMemLimit": null,
				"impalaQueueTimeout": null,
				"maxAMShare": null,
				"maxChildResources": null,
				"maxResources": {
					"cpuPercent": 80.0,
					"memory": null,
					"memoryPercent": 80.0,
					"vcores": null
				},
				"maxRunningApps": 3,
				"minResources": null,
				"scheduleName": "default",
				"weight": 5.0
			}],
			"schedulingPolicy": "drf",
			"type": null
		}],
		"schedulablePropertiesList": [{
			"impalaClampMemLimitQueryOption": null,
			"impalaDefaultQueryMemLimit": null,
			"impalaDefaultQueryOptions": null,
			"impalaMaxMemory": null,
			"impalaMaxQueryMemLimit": null,
			"impalaMaxQueuedQueries": null,
			"impalaMaxRunningQueries": null,
			"impalaMinQueryMemLimit": null,
			"impalaQueueTimeout": null,
			"maxAMShare": null,
			"maxChildResources": null,
			"maxResources": null,
			"maxRunningApps": null,
			"minResources": null,
			"scheduleName": "default",
			"weight": 1.0
		}],
		"schedulingPolicy": "drf",
		"type": null
	}],
	"userMaxAppsDefault": null,
	"users": []
}

 

Many thanks, I feel we are close to solving this problem 🙂

Super Collaborator

Hi @maziyar,

 

Great we are making progress. BTW, are your cluster kerberized? And is the authentication for the web UI turned on (SPNEGO)? You can find out by search "Enable Kerberos Authentication for HTTP Web-Consoles" from CM UI.

 

Can you attach the fair-scheduler.xml that is deployed on the RM nodes? The json is a CM internal format that gets translated into the xml and I am not sure how that is converted and what the final XML is.

 

You can pull the proper xml down from the CM UI -> yarn -> instances -> active RM -> Processes -> fair-scheduler.xml . That file is readable and gives a good view of what is there.

 

Thanks!

Li

Li Wang, Technical Solution Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community:

Terms of Service

Community Guidelines

How to use the forum

Expert Contributor

Hi @lwang

 

Thanks for your reply. My cluster is not Kerberized. Also, SPNEGO is not selected/enabled.

 

Here is my fair-scheduler.xml file:

 

 

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<allocations>
    <queue name="root">
        <weight>1.0</weight>
        <schedulingPolicy>drf</schedulingPolicy>
        <aclSubmitApps>maziyar,test-user,hdfs,admin iscpif-hadoop,admin,hdfs,hive</aclSubmitApps>
        <aclAdministerApps>maziyar,admin </aclAdministerApps>
        <queue name="users" type="parent">
            <maxResources>60.0%</maxResources>
            <maxChildResources>10.0%</maxChildResources>
            <maxRunningApps>15</maxRunningApps>
            <weight>4.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <aclSubmitApps>*</aclSubmitApps>
            <aclAdministerApps>maziyar,root,spark,hdfs </aclAdministerApps>
            <queue name="mpanahi">
                <maxResources>30.0%</maxResources>
                <weight>3.0</weight>
                <schedulingPolicy>drf</schedulingPolicy>
            </queue>
        </queue>
        <queue name="default">
            <maxResources>10.0%</maxResources>
            <weight>1.0</weight>
            <schedulingPolicy>fifo</schedulingPolicy>
            <aclSubmitApps>*</aclSubmitApps>
            <aclAdministerApps>*</aclAdministerApps>
        </queue>
        <queue name="multivac">
            <maxResources>80.0%</maxResources>
            <maxRunningApps>3</maxRunningApps>
            <weight>5.0</weight>
            <schedulingPolicy>drf</schedulingPolicy>
            <aclSubmitApps>mziyar,hdfs,hive </aclSubmitApps>
            <aclAdministerApps>maziyar </aclAdministerApps>
        </queue>
    </queue>
    <defaultQueueSchedulingPolicy>drf</defaultQueueSchedulingPolicy>
    <queuePlacementPolicy>
        <rule name="specified" create="false"/>
        <rule name="nestedUserQueue" create="true">
            <rule name="default" create="true" queue="users"/>
        </rule>
        <rule name="default"/>
    </queuePlacementPolicy>
</allocations>

 

Thanks again for your follow up, I really appreciate it.

 

Best,

Maziyar

 

Cloudera Employee

Hi Maziyar and Li,

 

You are definitely on the right track here.  I see that the top-level root queue has "aclAdministerApps=maziyar,admin", which limits YARN API access to some of the time metrics for the application.  If you're not one of these users when you make the API call you won't get correct value returned for:

 

  1. startedTime
  2. finishedTime
  3. elapsedTime
  4. logAggregationStatus
  5. amHostHttpAddress
  6. usedResources
  7. allocatedMB
  8. allocatedVCores
  9. runningContainers

Although you will get some basic application info, so you'll see the application, but metrics will be wrong, just like you report.

 

So, the question is which user is the CM service using to interact with the YARN API.  I did some testing and reproduced your issue by limiting the aclAdministerApps property to "yarn".  I then found that when I add "dr.who" to aclAdministerApps at the root level, it starts working properly.

 

So, try modifying your root level ACLs to be "aclAdministerApps=maziyar,admin,dr.who", refresh the Dynamic Resource Pool (DRP) configuration, and see if it resolves the issue for you.

 

Nick

Expert Contributor
Hi Nick,

Thanks for the advice. I have added "dr.who" to the list and now everything is back to normal! Many thanks mate 🙂

Cloudera Employee

Maziyar,

 

I was discussing this issue internally and adding "dr.who" to the adminACL has the side effect of allowing all users to have access, so we don't want that.  I know we're on the right track here, we just need to get the correct user or group added to the adminACL for CM.  I'm researching and will update as soon as I have the answer!

 

Nick

Expert Contributor
Fantastic! Many thanks Nick and looking forward to the right solution 🙂

Cloudera Employee

Hi Maziyar,

 

I've found that CM uses the "hue" user to interact with the YARN API, so try changing the root level ACL to be "aclAdministerApps=maziyar,admin,hue", refresh the Dynamic Resource Pool (DRP) configuration, and test if it still resolves the issue for you.  This will be much more restricted than using "dr.who" but allow the CM Web UI to function properly.

 

Nick

Expert Contributor
Hi Nick,

Unfortunately, removing dr.who and adding hue resulted in the same problem as I had initially. I do agree to add hue would be much safer and restricted than dr.who, but it didn't work.

I am looking forward to something similar to hue to solve this issue 🙂

Many thanks,
Maziyar

Cloudera Employee

Hi Maziyar,

 

I'm digging in to this again.  I clearly see messages in the RM log showing "dr.who" is the user accessing the YARN API.  I'm researching further so I hopefully can provide the correct answer!

 

Nick

Cloudera Employee

Hi Maziyar,

 

The information I had about user "hue" being used by CM to access YARN API is correct for kerberized clusters, but in your case we know that the cluster is not kerberized and we see "dr.who" is used by CM.  Consequently, I think that adding "dr.who" to the aclAdminsterApps property is the only solution for now.

 

I am creating an internal improvement request for Cloudera Manager (CM) to also use the use "hue" if ACLs are turned on in a non-kerberized cluster.  That way the behavior will be consistent and will provide some level of restriction on who can administer queues in a non-kerberized environment.

 

EDIT: Internal Improvement JIRA created for CM - look for this change in a future release of CDH (no guarantees, but I hope we implement this change)

 

Nick

Expert Contributor
Hi Nick,

That would be great! Thank you so much for your time and helping me in this matter, I really appreciate it 🙂
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.