Member since
10-04-2016
243
Posts
281
Kudos Received
43
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1171 | 01-16-2018 03:38 PM | |
6139 | 11-13-2017 05:45 PM | |
3032 | 11-13-2017 12:30 AM | |
1518 | 10-27-2017 03:58 AM | |
28426 | 10-19-2017 03:17 AM |
10-18-2021
04:41 AM
@rajatsachan, to help others who may face similar issues, it will be great if you can mark the response that helped you resolve your issue as a solution. To Mark as the solution, you can click this button
If you resolved the issue in any other way, please provide the solution in this thread and you can mark that as a solution.
... View more
10-13-2021
05:25 AM
For the information, file credentialbuilder*.jar is not missing. The problem is this variable, RANGER_OZONE_PLUGIN_INSTALL_LIB, is pointing to wrong directory, /opt/cloudera/parcels/CDH-7.1.6-1.cdh7.1.6.p0.10506313:GPLEXTRAS-7.1.6-1.gplextras7.1.6.p0.10506313/lib/ranger-ozone-plugin/install/lib The correct directory is /opt/cloudera/parcels/CDH-7.1.6-1.cdh7.1.6.p0.10506313/lib/ranger-ozone-plugin/install/lib Before installing the parcel, this variable is pointing to the correct directory. I don't know why after GPL Extras parcel being installed, Cloudera Manager insert some string (:GPLEXTRAS-7.1.6-1.gplextras7.1.6.p0.10506313) in the variable. I think if I could edit it from Cloudera Manager, it should resolve the issues. Any advice how to edit this variable using Cloudera Manager?
... View more
10-12-2021
08:47 PM
3 Kudos
Introduction
This article is the final part in the series Scaling the Namenode (See part 1, part 2, part 3 and part 4)
In part 4 we discussed about monitoring Namenode Logs for Excessive Skews.
In this part, we will look at a few optimizations around logging, access check, and block reports.
Audience
This article is for Hadoop administrators who are familiar with HDFS and its components.
Audit log specific operations only when debug is enabled.
By default, the following property is set to blank so none of the Namenode operations are restricted from making an entry into Audit log.
Operations like getfileinfo results in fetching the metadata associated with a file, and in a large/read-heavy cluster, it can generate too much audit log. So, it is recommended to audit log getfileinfo only when audit log debug is enabled.
Change in hdfs-site.xml
<property>
<name>dfs.namenode.audit.log.debug.cmdlist<\name>
<value>getfileinfo<\value>
<description>A comma separated list of NameNode commands that are written to the HDFS
namenode audit log only if the audit log level is debug.
<\description>
<\property>
In Cloudera Manager you can add the property under "NameNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml".
Further, the BlockStateChange and the StateChange related logging are really only useful when those operations have failed i.e. the log entry for those classes is ERROR. At the default INFO level, these two classes generate a large amount of log entry in the Namenode logs. You can reduce the frequency of logging by adding the following lines in log4j.properties file in your Hadoop configurations. Under Cloudera Manager these properties can be added under "NameNode Logging Advanced Configuration Snippet (Safety Valve)".
log4j.logger.BlockStateChange=ERROR
log4j.logger.org.apache.hadoop.hdfs.StateChange=ERROR
Avoid recursive call to external authorizer for getContentSummary
getContentSummary is an expensive operation in general. It becomes even more expensive in a secured environment where the security is managed by an external component like Apache Ranger as the permission check is performed via a recursive call to check for all descendants in a path. HDFS-14112 introduced an improvement to make just one call with subaccess, because often they don't have to evaluate for each and every component of the path.
Change in hdfs-site.xml
<property>
<name>dfs.permissions.ContentSummary.subAccess</name>
<value>false</value>
<description>
If "true", the ContentSummary permission checking will use subAccess.
If "false", the ContentSummary permission checking will NOT use subAccess.
subAccess means using recursion to check the access of all descendants.
</description>
</property>
Again in Cloudera Manager place the property under "NameNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml"
It is recommended to set this property to true so as to use subAccess.
Note: This improvement is only available in CDP releases, the older CDH/HDP releases do not have this improvement so adding this configuration on CDH/HDP releases is not recommended.
Optimizing Block Reports
In busy and large clusters (say 200 Datanodes), it is very important to not overwhelm NameNode with too frequent full block reports from the datanodes. If the NameNodes are already degraded, the block reports add further stress on the NameNodes. The NameNodes might be so slow to process the block reports that you would eventually see messages like 'Block report queue is full' in the NameNode logs.
It is interesting to note that while the default block report queue size is set to 1024, we can see this 'Block report queue is full' message even during a NameNode startup in what we call a block report flood event and also when your NameNode's RPC processing time is too high to indicate a severely degraded NameNode, thereby having a backlog of reports to process and eventually overflowing the queue.
While the block report queue size is configurable and you could essentially increase the queue size, a better approach is to optimize the way the data nodes send blocks reports.
We recommend a 3 prong approach to change the following in hdfs-site.xml:
Split block report by volume (Default value 1000000) <property>
<name>dfs.blockreport.split.threshold</name>
<value>0</value>
<description>
If the number of blocks on the DataNode is below this threshold then it will send block reports for all Storage Directories in a single message. If the number of blocks exceeds this threshold then the DataNode will send block reports for each Storage Directory in separate messages. Set to zero to always split.
</description>
</property>
Reduce full block report frequency from a default 6 hours to 12 hours <property>
<name>dfs.blockreport.intervalMsec</name>
<value>43200000</value>
<description>
Determines block reporting interval in milliseconds.
</description>
</property>
Batch incremental reports (Default value 0 disables batching) <property>
<name>dfs.blockreport.incremental.intervalMsec</name>
<value> 100 </value>
<description>
If set to a positive integer, the value in ms to wait between sending incremental block reports from the Datanode to the Namenode.
</description>
</property>
All 3 belong under "NameNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml" in Cloudera Manager.
Conclusion
This wraps up the series on getting the best performance possible out of your NameNode. We hope these tips will keep your cluster running at its best and your users happy.
... View more
08-06-2021
01:25 AM
Had the same issue on CDP 7.1.6, which comes with Tez 0.9.1. Looks like this: https://issues.apache.org/jira/browse/TEZ-4057 One workaround (probably not 100% secure) is to add the yarn user to the hive group: usermod -a -G hive yarn This needs to be done on all nodes and requires Yarn services restart. After that the issue has gone, no more random errors for Hive on Tez anymore.
... View more
04-17-2020
02:24 PM
They are actually not the same. SORT BY sorts data inside partition, while ORDER BY is global sort. SORT BY calls sortWithinPartitions() function, while ORDER BY calls sort() Both of these functions call sortInternal(), but with different global flag: def sortWithinPartitions ... sortInternal(global = false, sortExprs) def sort ... sortInternal(global = true, sortExprs)
... View more
03-27-2020
10:12 AM
@dineshc Please specify where the mentioned workaround property needs to be added. Ams-site.xml or Ams-Hbase-site.xml?
... View more
09-24-2019
12:02 AM
Thank you very much!
... View more
08-21-2019
07:34 PM
1 Kudo
In HDP-2.6/Ambari-2.6, it was not mandatory enable HS2 metrics explicitly. Thus, all metrics would be emitted without defining any configs explicitly. In HDP-3/Ambari-2.7, we will see similar erros in AMS Collector Log: Error : 2019-06-10 02:42:59,215 INFO timeline timeline.HadoopTimelineMetricsSink: No live collector to send metrics to. Metrics to be sent will be discarded. This message will be skipped for the next 20
Debug Error shows this :
2019-06-14 20:35:29,538 DEBUG main timeline.HadoopTimelineMetricsSink: Trying to find live collector host from : exp5.lab.com,exp4.lab.com 2019-06-14 20:35:29,538 DEBUG main timeline.HadoopTimelineMetricsSink: Requesting live collector nodes : http://exp5.lab.com,exp4.lab.com:6188/ws/v1/timeline/metrics/livenodes 2019-06-14 20:35:29,557 DEBUG main timeline.HadoopTimelineMetricsSink: Unable to connect to collector, http://exp5.lab.com,exp4.lab.com:6188/ws/v1/timeline/metrics/livenodes 2019-06-14 20:35:29,557 DEBUG main timeline.HadoopTimelineMetricsSink: java.net.UnknownHostException: exp5.lab.com,exp4.lab.com 2019-06-14 20:35:29,558 DEBUG main timeline.HadoopTimelineMetricsSink: Collector exp5.lab.com,exp4.lab.com is not longer live. Removing it from list of know live collector hosts : [] 2019-06-14 20:35:29,558 DEBUG main timeline.HadoopTimelineMetricsSink: No live collectors from configuration. You need to ensure the following properties exist. If not, first add them in the respective custom section via Ambari >Hive> Configs. Next, if you are using Ambari Metrics with more than one collector, then you need to make one more change due a BUG, which will likely be fixed after Ambari-2.7.4. Add *.sink.timeline.zookeeper.quorum=<ZK_QUORUM_ADDRESS> Example: *.sink.timeline.zookeeper.quorum=zk_host1:2181,zk_host2:2181,zk_host3:2181 to all the 4 files under /var/lib/ambari-server/resources/stacks/HDP/3.0/services/HIVE/package/templates/ located on Ambari Server host. Restart Ambari Server & Hive for changes to take effect. Now the metrics will be emitted and you should be able to see data on your Grafana Dashboard.
... View more
02-15-2019
08:52 PM
@Mahesh Balakrishnan Since there can be only one accepted answer 😞 , I am sharing 25 bounty points with you. Thanks for the guidance.
... View more