Support Questions

Find answers, ask questions, and share your expertise

After updating config properties Cloudera Manager, how to get correct Hadoop config dir?

avatar
New Contributor

Using Cloudera Manager I can set property "yarn.log-aggregation-enable" to "true".

I can then run "Deploy Client Configuration" from Cloudera Manager.

 

However, if I then run "hadoop classpath" or "yarn classpath", the Hadoop configuration directory, which is typically the first entry in the classpath, does not include an updated yarn-site.xml with "yarn.log-aggregation-enable" set to "true". Instead, it has the original yarn-site.xml which has no "yarn.log-aggregation-enable" property in it.

 

Typically the first entry in the classpath is "/etc/hadoop/conf" from "hadoop classpath" or "yarn classpath".

 

In contrast, if I run a YARN application which starts a Java task, I can print system property "java.class.path" and the first entry is a directory that does contain an updated yarn-site.xml with the property set with the value "true". For example, instead of "/etc/hadoop/conf" I see in one task the first directory is 

"/var/run/cloudera-scm-agent/process/840-yarn-NODEMANAGER". And in fact there is an environment variable, 

HADOOP_CONF_DIR, which points to the correct Hadoop config dir.

 

But this directory, the one in /var/run/cloudera-scm-agent, is not included in "hadoop classpath" or "yarn classpath".

 

In our application, we need to get the correct Hadoop config dir without running an YARN task. Even if I create a small Java program that prints environment variable HADOOP_CONF_DIR, or system property "java.class.path" and run it with "hadoop jar", I do not get the correct results.

 

How do I get the correct Hadoop configuration directory without running a YARN job?

 

Thanks

 

 

1 ACCEPTED SOLUTION

avatar
Super Collaborator

No there is nothing that you can run to check if the log aggregation has finished. It is a distributed state only known inside the NM's

 

The only thing you can do is retry the log retrieval. Log aggregation is performed by the NodeManager(s) when the containers finish.

There is no possibility to tell how long that will take since one node could be running more than one container that finishes at almost the same time. The load on HDFS is also a factor: copying to HDFS will only be as fast as HDFS can handle it at that point.Wilfred

 

 

View solution in original post

4 REPLIES 4

avatar
Mentor
CM separates server-side configs and client-side configs. The specific property of "yarn.log-aggregation-enable" is only used by NodeManagers as a toggle.

Clients do not use the above property, so they aren't found in the regular /etc/hadoop/conf/*.xml path configs (these are gateway configs, or client-configs).

When your Java class runs from within a YARN container, it also inherits the parent environment (parent being the NodeManager here). Thereby, it sees the same paths the NM service does, which would explain your output of true when that happens.

YARN, speaking on its role as an application framework, provides no notion of 'application configuration', and expects custom applications to roll their own solution. For example, the MR2 app uses the concept of a 'job.xml' written and sent from clients that is then utilised as the single source of configuration truth (rather than sourced from the environment from the random NM the AM/Tasks may run on).

Does this help resolve the confusion?

avatar
New Contributor

Thank you for the excellent clarification.

 

In our most typical use case, we submit a YARN application from a machine that is outside of the cluster. Thus the Client process runs on an external machine. Once the submitted YARN app completes our Client attempts to fetch the aggregated logs. Before attempting to fetch the aggregated logs, I had included a check to see if log aggregation was enabled simply to save time. I now believe that I should remove the check for log aggregation being enabled since the property is not a client-side property.

 

So the explanation answers my initial question. A related question: Is there a way for a Client process running on an external machine to check if log aggregation has been completed?

 

Thanks

avatar
Super Collaborator

No there is nothing that you can run to check if the log aggregation has finished. It is a distributed state only known inside the NM's

 

The only thing you can do is retry the log retrieval. Log aggregation is performed by the NodeManager(s) when the containers finish.

There is no possibility to tell how long that will take since one node could be running more than one container that finishes at almost the same time. The load on HDFS is also a factor: copying to HDFS will only be as fast as HDFS can handle it at that point.Wilfred

 

 

avatar
New Contributor

Thanks for the explanation. And thanks for tolerating me extending the original question. 

This issue can be closed.