About PARTOMIA

PARTOMIA · ‎12-06-2018

Introduction Performance of a cluster running on Hadoop can be impacted by the OS partitioning. This document is intended to understand the best practices to setup the “/var” folder/partition with optimum size. Lets try to approach this problem by asking some important questions. What is “/var” used for? How can the “/var” folder run out of disk space? Common issue to expect on a Hadoop cluster if “/var” is out of disk space. How is the current setup of “/var” in my cluster ? Question 1 - What is /var used for? From OS perspective, “/var” is commonly used for constantly changing files i.e. variable. The short form of which is “var”. Example of such files could be the log file, mail, transient file, the printer spool, temporary files, cached data, etc. For example - “/var/tmp” holds the temporary files between system reboots. On any node (Hadoop or non-Hadoop), /var directory holds content for a number of applications. It also is used to store downloaded update packages on a temporary basis. The PackageKit update software downloads updated packages to /var/cache/yum/ by default. /var/ partition should be large enough to download package updates. An example of application which uses /var is MySql, which by default uses “/var/lib/mysql” as the MySql directory location. Question 2 - How can /var folder run out of disk space? /var is much more susceptible to filling up - by accident or by attack. Some of the directories which can be affected by this is “/var/log”, “/var/tmp”, “/var/crash” etc. If there is a serious OS issue, the logging can increase tremendously. If the disk space is set too low, like 10GB, this excessive logging can fill in the “disk” space for /var. Question 3 - Common issue to expect on a Hadoop cluster if “/var” is out of disk space. /var has been seen to be easily filled by a (possibly misbehaved) application, and that if it wasn't separate from /, the filling of / could cause a kernel panic. “/var” folder has some very important file/folders locations which are used by default by many kernel and OS applications. For example – “/var/run” is used for all the running process to keep their PIDs and system information. If “/var” is full due to low disk space configuration, then the application will fail to run. “/var/lock” is the folder which contains locks of the running applications for the files/devices they have locked on. If the disk space runs out the lock is not possible and the existing/new applications will fail. “/var/lib” holds all the dynamic data libraries and files for the applications. If there is no device space left, the application will fail to work. “/var” is very important from Hadoop perspective to keep all the service running. Running out of Disk space on “/var” can cause Hadoop and dependent services to fail to run on that node. Question 4 - How is the setup of “/var” in the clusters on my cluster? Are the “Hadoop” separated from the “/var” folder location. Are the huge sized logs or huge number of OS logs still located on the “/var” location, example - “/var/log/messages” and “/var/crash”. If the Kdump is configured to capture the crashdump logs, then risk increases, since these logs are usually huge file sizes - sometime 100 GB or more. The default configuration of the kdump logs use the directory location “/var/crash”. These days, the size of Physical Memory can easily be 500GB ot 1TB, which would spill the kdump logs of huge size ( *note* - kdump logs can be compressed) The size of “/var” therefore plays important role if /var/crash can be too low for saving the “crashdump” logs. If there is a OS crash (Kernel Panic etc.) then the crashdump will never be captured complete, since the size of “/var” is too low i.e. 10 GB or 50GB. Without the complete crashdump logs, there can never be a complete analysis of the cause of Kernel Crash. Answer - Recommendations on the optimum setup of “/var”. Increase the size of “/var” to 50GB at least for all the nodes and have a uniform size across the clusters. Change the location of log file for the “kdump”. Existing log file location is “/var/crash”. Kdump can be configured to put the logs on any other local disk with a size of around 300 - 500GB or as a best measure it can be dumped over network to a remote disk. /var should by default should be separated from the root partition. Depending on the requirement, the “/var/log” and “/var/log/audit” can also be created as a separate partitions. /var should be mounted on a LVM disk to allow increasing the sizes with ease if required. All the Hadoop Services logs should be separated from /var. The Hadoop Logs ideally should be placed in a separate Disk. This disk should be used only for Logs (from Hadoop and Dependent Applications Like MySql etc) and not for anything else. This Log location should never be shared with the core Hadoop Services like HDFS,YARN,ZOOKEEPER directory locations One way to achieve this could be by creating a symlink of "/var/<hadoop_logs> to separate LVM disks.

PARTOMIA · ‎09-21-2017

The HiveServer2 and HiveMetaStore can be configured for captured the GC logs based on Timestamp. This is useful in a production cluster, where having a timestamp on the log file add clarity and also avoids overwritting. Navigate as below in Ambari: Ambari UI > Hive > Configs > Advanced hive-env > hive-env template Add following : if [ "$SERVICE" = "metastore" ]; then export HADOOP_HEAPSIZE={{hive_metastore_heapsize}} # Setting for HiveMetastore else export HADOOP_HEAPSIZE={{hive_heapsize}} # Setting for HiveServer2 and Client fi export HADOOP_CLIENT_OPTS="-Xmx${HADOOP_HEAPSIZE}m -Xloggc:/var/log/hive/gc.log-$SERVICE-`date +'%Y%m%d%H%M'` -XX:ErrorFile=/var/log/hive/hive-metastore-error.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps $HADOOP_CLIENT_OPTS" if [ "$SERVICE" = "hiveserver2" ]; then export HADOOP_CLIENT_OPTS="-Xmx${HADOOP_HEAPSIZE}m -Xloggc:/var/log/hive/gc.log-$SERVICE-`date +'%Y%m%d%H%M'` -XX:ErrorFile=/var/log/hive/hive-server2-error.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps $HADOOP_CLIENT_OPTS" fi

t_onishi · ‎11-19-2018

The script seems wrong. With this configuration, the lines for "hiveserver2" will never be valid. The settings in the previous lines "export HADOOP_CLIENT_OPTS=" will overwirte them all with the "$HADOOP_CLIENT_OPTS" at the end. The name of ErrorFile will become "hive-metastore-error.log-`date +'%Y%m%d%H%M'" for example. The script should be: if [ "$SERVICE" = "metastore" ]; then .... elif [ "$SERVICE" = "hiveserver2" ]; then .... fi

LesterMartin · ‎07-06-2017

Great question and unfortunately, I don't think there is a well agreed upon formula/calculator out there as "it depends" is so often the rule. Some considerations are that the datanode doesn't really know about the directory structure; it just stores (and copies, deletes, etc) blocks as directed by the datanode (often indirectly since clients write actual blocks). Additionally, the checksums at the block level are actually stored on disk alongside the files for the data contained in a given block. It looks like there's some good info in the following HCC Q's that might be of help to you. https://community.hortonworks.com/questions/64677/datanode-heapsize-computation.html https://community.hortonworks.com/questions/45381/do-i-need-to-tune-java-heap-size.html https://community.hortonworks.com/questions/78981/data-node-heap-size-warning.html Good luck and happy Hadooping!

JeffG · ‎04-20-2017

Note that the <strong> and </strong> strings in the code block above should be removed, since they are HTML formatting commands that somehow became visible in the formatted text of the code block.

PARTOMIA · ‎06-29-2016

@Aman Mundra There is not much details why Namenode is failing to start. Can you share the NameNode log when trying to start the namenode service? It will help to identify what is causing the NameNode to fail to start. Meanwhile you can try to test if namenode can be manually started from command line? Run following : /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf start namenode or /var/lib/ambari-agent/ambari-sudo.sh su hdfs -l -s /bin/bash -c 'ulimit -c unlimited ;/usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf start namenode

ThomasLarsson · ‎08-11-2016

Hi @Arpit Agarwal, That is my understanding as well. Thanks for a short and to the point answer.

PARTOMIA · ‎11-03-2016

@Saurabh Try doing : set hive.exec.scratchdir=/new_dir

Online	Offline
Last Visited	‎10-30-2025 03:44 AM

Member Since	‎03-22-2019 01:55 PM
Last Visited	‎10-30-2025 03:44 AM
Posts	46
Kudos received	8

Cloudera Community

Re: Solr installation

Re: Does Ambari automatically adjust memory settin...

Re: I would like to know if there are any data dup...

Best Practices to Setup "/var" on a Hadoop Cluster

HiveServer2 and HiveMetaStore GC logging with Time...

Re: Configure HiveMetaStore and HiveServer2 for He...

Re: HOW TO CALCULATE THE HEAPSIZE FOR DATANODE?

Re: How to delete Grafana?

Re: Ambari-server failed to start namenode after i...

Re: What causes a datanode to consider a volume as...

Re: Can we change location of staging data dir in ...