Member since
10-04-2016
243
Posts
281
Kudos Received
43
Solutions
10-18-2021
04:41 AM
@rajatsachan, to help others who may face similar issues, it will be great if you can mark the response that helped you resolve your issue as a solution. To Mark as the solution, you can click this button
If you resolved the issue in any other way, please provide the solution in this thread and you can mark that as a solution.
... View more
10-12-2021
08:47 PM
3 Kudos
Introduction
This article is the final part in the series Scaling the Namenode (See part 1, part 2, part 3 and part 4)
In part 4 we discussed about monitoring Namenode Logs for Excessive Skews.
In this part, we will look at a few optimizations around logging, access check, and block reports.
Audience
This article is for Hadoop administrators who are familiar with HDFS and its components.
Audit log specific operations only when debug is enabled.
By default, the following property is set to blank so none of the Namenode operations are restricted from making an entry into Audit log.
Operations like getfileinfo results in fetching the metadata associated with a file, and in a large/read-heavy cluster, it can generate too much audit log. So, it is recommended to audit log getfileinfo only when audit log debug is enabled.
Change in hdfs-site.xml
<property>
<name>dfs.namenode.audit.log.debug.cmdlist<\name>
<value>getfileinfo<\value>
<description>A comma separated list of NameNode commands that are written to the HDFS
namenode audit log only if the audit log level is debug.
<\description>
<\property>
In Cloudera Manager you can add the property under "NameNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml".
Further, the BlockStateChange and the StateChange related logging are really only useful when those operations have failed i.e. the log entry for those classes is ERROR. At the default INFO level, these two classes generate a large amount of log entry in the Namenode logs. You can reduce the frequency of logging by adding the following lines in log4j.properties file in your Hadoop configurations. Under Cloudera Manager these properties can be added under "NameNode Logging Advanced Configuration Snippet (Safety Valve)".
log4j.logger.BlockStateChange=ERROR
log4j.logger.org.apache.hadoop.hdfs.StateChange=ERROR
Avoid recursive call to external authorizer for getContentSummary
getContentSummary is an expensive operation in general. It becomes even more expensive in a secured environment where the security is managed by an external component like Apache Ranger as the permission check is performed via a recursive call to check for all descendants in a path. HDFS-14112 introduced an improvement to make just one call with subaccess, because often they don't have to evaluate for each and every component of the path.
Change in hdfs-site.xml
<property>
<name>dfs.permissions.ContentSummary.subAccess</name>
<value>false</value>
<description>
If "true", the ContentSummary permission checking will use subAccess.
If "false", the ContentSummary permission checking will NOT use subAccess.
subAccess means using recursion to check the access of all descendants.
</description>
</property>
Again in Cloudera Manager place the property under "NameNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml"
It is recommended to set this property to true so as to use subAccess.
Note: This improvement is only available in CDP releases, the older CDH/HDP releases do not have this improvement so adding this configuration on CDH/HDP releases is not recommended.
Optimizing Block Reports
In busy and large clusters (say 200 Datanodes), it is very important to not overwhelm NameNode with too frequent full block reports from the datanodes. If the NameNodes are already degraded, the block reports add further stress on the NameNodes. The NameNodes might be so slow to process the block reports that you would eventually see messages like 'Block report queue is full' in the NameNode logs.
It is interesting to note that while the default block report queue size is set to 1024, we can see this 'Block report queue is full' message even during a NameNode startup in what we call a block report flood event and also when your NameNode's RPC processing time is too high to indicate a severely degraded NameNode, thereby having a backlog of reports to process and eventually overflowing the queue.
While the block report queue size is configurable and you could essentially increase the queue size, a better approach is to optimize the way the data nodes send blocks reports.
We recommend a 3 prong approach to change the following in hdfs-site.xml:
Split block report by volume (Default value 1000000) <property>
<name>dfs.blockreport.split.threshold</name>
<value>0</value>
<description>
If the number of blocks on the DataNode is below this threshold then it will send block reports for all Storage Directories in a single message. If the number of blocks exceeds this threshold then the DataNode will send block reports for each Storage Directory in separate messages. Set to zero to always split.
</description>
</property>
Reduce full block report frequency from a default 6 hours to 12 hours <property>
<name>dfs.blockreport.intervalMsec</name>
<value>43200000</value>
<description>
Determines block reporting interval in milliseconds.
</description>
</property>
Batch incremental reports (Default value 0 disables batching) <property>
<name>dfs.blockreport.incremental.intervalMsec</name>
<value> 100 </value>
<description>
If set to a positive integer, the value in ms to wait between sending incremental block reports from the Datanode to the Namenode.
</description>
</property>
All 3 belong under "NameNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml" in Cloudera Manager.
Conclusion
This wraps up the series on getting the best performance possible out of your NameNode. We hope these tips will keep your cluster running at its best and your users happy.
... View more
08-06-2021
01:25 AM
Had the same issue on CDP 7.1.6, which comes with Tez 0.9.1. Looks like this: https://issues.apache.org/jira/browse/TEZ-4057 One workaround (probably not 100% secure) is to add the yarn user to the hive group: usermod -a -G hive yarn This needs to be done on all nodes and requires Yarn services restart. After that the issue has gone, no more random errors for Hive on Tez anymore.
... View more
03-27-2020
10:12 AM
@dineshc Please specify where the mentioned workaround property needs to be added. Ams-site.xml or Ams-Hbase-site.xml?
... View more
08-21-2019
07:34 PM
1 Kudo
In HDP-2.6/Ambari-2.6, it was not mandatory enable HS2 metrics explicitly. Thus, all metrics would be emitted without defining any configs explicitly. In HDP-3/Ambari-2.7, we will see similar erros in AMS Collector Log: Error : 2019-06-10 02:42:59,215 INFO timeline timeline.HadoopTimelineMetricsSink: No live collector to send metrics to. Metrics to be sent will be discarded. This message will be skipped for the next 20
Debug Error shows this :
2019-06-14 20:35:29,538 DEBUG main timeline.HadoopTimelineMetricsSink: Trying to find live collector host from : exp5.lab.com,exp4.lab.com 2019-06-14 20:35:29,538 DEBUG main timeline.HadoopTimelineMetricsSink: Requesting live collector nodes : http://exp5.lab.com,exp4.lab.com:6188/ws/v1/timeline/metrics/livenodes 2019-06-14 20:35:29,557 DEBUG main timeline.HadoopTimelineMetricsSink: Unable to connect to collector, http://exp5.lab.com,exp4.lab.com:6188/ws/v1/timeline/metrics/livenodes 2019-06-14 20:35:29,557 DEBUG main timeline.HadoopTimelineMetricsSink: java.net.UnknownHostException: exp5.lab.com,exp4.lab.com 2019-06-14 20:35:29,558 DEBUG main timeline.HadoopTimelineMetricsSink: Collector exp5.lab.com,exp4.lab.com is not longer live. Removing it from list of know live collector hosts : [] 2019-06-14 20:35:29,558 DEBUG main timeline.HadoopTimelineMetricsSink: No live collectors from configuration. You need to ensure the following properties exist. If not, first add them in the respective custom section via Ambari >Hive> Configs. Next, if you are using Ambari Metrics with more than one collector, then you need to make one more change due a BUG, which will likely be fixed after Ambari-2.7.4. Add *.sink.timeline.zookeeper.quorum=<ZK_QUORUM_ADDRESS> Example: *.sink.timeline.zookeeper.quorum=zk_host1:2181,zk_host2:2181,zk_host3:2181 to all the 4 files under /var/lib/ambari-server/resources/stacks/HDP/3.0/services/HIVE/package/templates/ located on Ambari Server host. Restart Ambari Server & Hive for changes to take effect. Now the metrics will be emitted and you should be able to see data on your Grafana Dashboard.
... View more
02-15-2019
08:52 PM
@Mahesh Balakrishnan Since there can be only one accepted answer 😞 , I am sharing 25 bounty points with you. Thanks for the guidance.
... View more
09-05-2018
09:05 PM
2 Kudos
If you have started using Hive LLAP, you would have noticed that by default its configured to use log4j2. Default configuration makes use of advanced features from log4j2 like Rolling Over logs based on time interval and size. With time, a lot of old log files would have accumulated and typically you would compress those files manually or add additional jars and change configuration when using log4j1 to achieve the same With log4j2, a simple change in configuration can ensure that every time a log file is rolled over, it gets compressed for optimal use of storage space. Default configuration: To automatically compress the rolled over log files, update the highlighted line to: appender.DRFA.filePattern = ${sys:hive.log.dir}/${sys:hive.log.file}.%d{yyyy-MM-dd}-%i.gz -%i will ensure that in a rare scenario when there has been increased logging and the threshold size can be been reached more than once in the specified interval, the previously rolled over file won't get over written. .gz will ensure that files are compressed using gzip To understand the finer details about log4j2 appenders, you may check out the official documentation. Similarly you can also make similar changes to llap-cli log settings:
... View more
Labels:
12-01-2017
06:06 PM
2 Kudos
When running a custom Java application that connects via JDBC to Hive, after migration to HDP-2.6.x, the application now fails to start with a NoClassDefFoundError or ClassNotFoundException related to a Hive class, like: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hive/service/cli/thrift/TCLIService$Iface
at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:270)
Root Cause Prior to HDP-2.6.x, the hive-jdbc.jar is a symlink which points to the "standalone" jdbc jar (the one intended to be used for non-hadoop apps, like a generic app that has JDBC driver DB accessibility), for example in HDP 2.5.0: /usr/hdp/current/hive-client/lib/hive-jdbc.jar -> hive-jdbc-1.2.1000.2.5.0.0-1245-standalone.jar But from newer versions, HDP-2.6.x onwards, the hive-jdbc.jar now points to the "hadoop env" JDBC driver, which has dependencies on many other Hadoop JARs, for example in HDP 2.6.2: /usr/hdp/current/hive-client/lib/hive-jdbc.jar -> hive-jdbc-1.2.1000.2.6.2.0-205.jar or in HDP-2.6.3 /usr/hdp/current/hive-client/lib/hive-jdbc.jar -> hive-jdbc-1.2.1000.2.6.3.0-235.jar Does this mean the HDP stack no longer includes a standalone JAR ? No. The standalone jar has been moved to this path: /usr/hdp/current/hive-client/jdbc Two ways to solve this: 1. Change the custom Java application's classpath to use the hive-jdbc-*-standalone.jar explicitly As noted above, the standalone jar is now available in a different path. For example in HDP-2.6.2: /usr/hdp/current/hive-client/jdbc/hive-jdbc-1.2.1000.2.6.2.0-205-standalone.jar
In HDP-2.6.3 /usr/hdp/current/hive-client/jdbc/hive-jdbc-1.2.1000.2.6.3.0-235-standalone.jar 2. Add the following to the HADOOP_CLASSPATH of the custom Java application if it uses other Hadoop components/JARs /usr/hdp/current/hive-client/lib/hive-metastore-*.jar:/usr/hdp/current/hive-client/lib/hive-common-*.jar:/usr/hdp/current/hive-client/lib/hive-cli-*.jar:/usr/hdp/current/hive-client/lib/hive-exec-*.jar:/usr/hdp/current/hive-client/lib/hive-service.jar:/usr/hdp/current/hive-client/lib/libfb303-*.jar:/usr/hdp/current/hive-client/lib/libthrift-*.jar:/usr/hdp/current/hadoop-client/lib/log4j*.jar:/usr/hdp/current/hadoop-client/lib/slf4j-api-*.jar:/usr/hdp/current/hadoop-client/lib/slf4j-log4j12-*.jar:/usr/hdp/current/hadoop-client/lib/commons-logging-*.jar
... View more
11-16-2017
03:58 PM
2 Kudos
Description During HDP Upgrade, Hive Metastore restart step fails with message - "ValueError: time data '2017-05-10 19:08:30' does not match format '%Y-%m-%d %H:%M:%S.%f'" Following is the stack trace: Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/common-services/HIVE/0.12.0.2.0/package/scripts/hive_metastore.py", line 211, in <module> HiveMetastore().execute()
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 329, in execute method(env)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 841, in restart self.pre_upgrade_restart(env, upgrade_type=upgrade_type)
File "/var/lib/ambari-agent/cache/common-services/HIVE/0.12.0.2.0/package/scripts/hive_metastore.py", line 118, in pre_upgrade_restart self.upgrade_schema(env)
File "/var/lib/ambari-agent/cache/common-services/HIVE/0.12.0.2.0/package/scripts/hive_metastore.py", line 150, in upgrade_schema status_params.tmp_dir)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/security_commons.py", line 242, in cached_kinit_executor if (now - datetime.strptime(last_run_time, "%Y-%m-%d %H:%M:%S.%f") > timedelta(minutes=expiration_time)):
File "/usr/lib64/python2.6/_strptime.py", line 325, in _strptime (data_string, format))
ValueError: time data '2017-05-10 19:08:30' does not match format '%Y-%m-%d %H:%M:%S.%f' Root cause During the upgrade, the data will be read from a file, such as *_tmp.txt, under the /var/lib/ambari-agent/tmp/kinit_executor_cache directory. This issue occurs if this file is not updated and points to an older date. Solution 1. Login to Hive Metastore host 2. Move *_tmp.txt files mv /var/lib/ambari-agent/tmp/kinit_executor_cache/*_tmp.txt /tmp
3. Retry Restart Hive Metastore step from Ambari Upgrade screen
... View more
Labels: