About ravi1

ravi1 · ‎12-03-2015

Default for this is 10. I have seen it at 128 in a large (over 1000 nodes) cluster and I think this is causing load issues. What is the recommended value for this and when should this be increased from default 10.

ravi1 · ‎12-02-2015

When I look at HDFS audit logs, I see hbase user from HBaseMaster node accessing hdfs files and the entry I see in audit log is with 'cmd=listStatus'. We regularly see about 3 million of them per hour and we have seen a hike of 6 million of them per hour which probably may have crashed NN. Any idea what HBaseMaster is doing here or if we can reduce any of this load on NN?

ravi1 · ‎11-30-2015

Thanks Steve. In our case, we are looking to set it at RM level, not necessarily even at app/AM level. So, AM fails for any reason, just don't retry AM on the same host, pick something else. Based on error, it might be good option to blacklist at RM level to not send further AMs there.

ravi1 · ‎11-18-2015

Thanks. Will this change help with all TCP/IP communication? Or will it only help with certain communication like mapreduce shuffle ?

ravi1 · ‎11-17-2015

ipc.server.tcpnodelay has been changed to true by default in hadoop 2.6. We are on hadoop 2.4 and would like to change it to true. What services if any require a restart for this change? Can it be set at job level for all jobs and not restart services? With a big cluster where NN restart takes more than 60 minutes, we would like to avoid all possible restarts.

ravi1 · ‎11-10-2015

The question is more in the sense of how mapreduce AM can have a policy of not rerunning AM on the same node that failed on the first try. This is not a custom yarn app where we can decide where AM should go. If map reduce AM can't do this now, it might be better if we can drive a support ticket for it for an enhancement since with current approach, problem in single node manager can cause failed map reduce jobs.

ravi1 · ‎11-09-2015

It has nothing to do with labels. It will be the same issue with Node Labels. Whenever an AM fails for any reason, I see that the retry is happening on the same node. If first AM failed for any node related issue, the second one also will fail for the same issue. What we are looking at is if it is possible with any config change to make sure AM retry does not happen on the same node.

ravi1 · ‎11-09-2015

I have seen AM being retried on the same node where first attempt failed causing the job to fail. There are situations where there is something wrong with the node (either with space or other issues), so any number of retries there will fail. Is there any way to see that AM retries always go to a different Node Manager? Is the current policy to always retry on the same Node Manager?

ravi1 · ‎11-04-2015

REGISTER /tmp/tez-tfile-parser-0.8.2-SNAPSHOT.jar; yarnlogs = LOAD '/app-logs/hdfs/logs/**/*' USING org.apache.tez.tools.TFileLoader(); lines_with_fetchertime = FILTER yarnlogs BY $2 matches '.*freed by fetcher.*'; This was the code that I used to extract specific text in logs. However, TFileLoader in tez-tools does not seem to scale up that well when we pass a folder with ton on logs. tez-tools I believe is also not part of HDP. You need to build it separately. Worked well on smaller datasets and ran into issues on bigger datasets Thanks

ravi1 · ‎11-04-2015

Hortonworks documentation (http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_yarn_resource_mgt/content/enabling_cgroups.html) says using CGroups requires HDP running in secure mode with kerberos. However, there is no such requirement according to apache documentation. https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html Apache documentation calls out for settings for running Cgroups without kerberos and settings required. Which documentation is incorrect? Do we support running LCE and CGroups without kerberos?

Online	Offline
Last Visited	‎12-18-2021 05:54 PM

Member Since	‎01-09-2019 05:01 PM
Last Visited	‎12-18-2021 05:54 PM
Posts	401
Kudos received	163

Cloudera Community

Re: 2 hosts not running master services

Re: ambari restart and service restart updating kr...

Re: How to automate sqoop incremental import using...

Re: Path to core-site.xml in sandbox?

Re: Curious to know why majority of the people are...

What is the recommended value for dfs.datanode.han...

When would HBaseMaster access HBase HDFS files wit...

Re: Can we avoid Resource Manager to retry failed ...

Re: What services need to be restarted if ipc.ser...

What services need to be restarted if ipc.server....

Re: Can we avoid Resource Manager to retry failed ...

Re: Can we avoid Resource Manager to retry failed ...

Can we avoid Resource Manager to retry failed Appl...

Re: In which format are yarn container logs stored...

Does using CGroups with LinuxContainerExecutor req...