About KuldeepK

KuldeepK · ‎04-07-2016

As a Hadoop Admin it’s our responsibility to perform Hadoop Cluster Maintenance frequently. Let’s see what we can do to keep our big elephant happy! 😉 1. FileSystem Checks We should check health of HDFS periodically by running fsck command sudo -u hdfs hadoop fsck / This command contacts the Namenode and checks each file recursively which comes under the provided path. Below is the sample output of fsck command sudo -u hdfs hadoop fsck / FSCK started by hdfs (auth:SIMPLE) from /10.0.2.15 for path / at Wed Apr 06 18:47:37 UTC 2016 Total size: 1842803118 B Total dirs: 4612 Total files: 11123 Total symlinks: 0 (Files currently being written: 4) Total blocks (validated): 11109 (avg. block size 165883 B) (Total open file blocks (not validated): 1) Minimally replicated blocks: 11109 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 11109 (100.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 1.0 Corrupt blocks: 0 Missing replicas: 22232 (66.680664 %) Number of data-nodes: 1 Number of racks: 1 FSCK ended at Wed Apr 06 18:46:54 UTC 2016 in 1126 milliseconds The filesystem under path '/' is HEALTHY We can schedule a weekly cron job on edge node which will run fsck and send the output via email to Hadoop Admin. 2. HDFS Balancer utility Over the period of time data becomes un-balanced across all the Datanodes in the cluster, this could be because of maintenance activity on specific Datanode, power failure, hardware failures, kernel panic, unexpected reboots etc. In this case because of data locality, Datanodes which are having more data will get churned and ultimately un-balanced cluster can directly affect your MapReduce job performance. You can use below command to run hdfs balancer sudo -u hdfs hdfs balancer -threshold <threshold-value> By default threshold value is 10, we can reduce it upto 1 ( It’s better to run balancer with lowest threshold ) Sample output: [root@sandbox ~]# sudo -u hdfs hdfs balancer -threshold 1 16/04/06 18:57:16 INFO balancer.Balancer: Using a threshold of 1.0 16/04/06 18:57:16 INFO balancer.Balancer: namenodes = [hdfs://sandbox.hortonworks.com:8020] 16/04/06 18:57:16 INFO balancer.Balancer: parameters = Balancer.Parameters [BalancingPolicy.Node, threshold = 1.0, max idle iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, run during upgrade = false] 16/04/06 18:57:16 INFO balancer.Balancer: included nodes = [] 16/04/06 18:57:16 INFO balancer.Balancer: excluded nodes = [] 16/04/06 18:57:16 INFO balancer.Balancer: source nodes = [] Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved 16/04/06 18:57:17 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec 16/04/06 18:57:17 INFO block.BlockTokenSecretManager: Setting block keys 16/04/06 18:57:17 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec 16/04/06 18:57:17 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 5400000 (default=5400000) 16/04/06 18:57:17 INFO balancer.Balancer: dfs.balancer.moverThreads = 1000 (default=1000) 16/04/06 18:57:17 INFO balancer.Balancer: dfs.balancer.dispatcherThreads = 200 (default=200) 16/04/06 18:57:17 INFO balancer.Balancer: dfs.datanode.balance.max.concurrent.moves = 5 (default=5) 16/04/06 18:57:17 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 2147483648 (default=2147483648) 16/04/06 18:57:17 INFO balancer.Balancer: dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760) 16/04/06 18:57:17 INFO block.BlockTokenSecretManager: Setting block keys 16/04/06 18:57:17 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 10737418240 (default=10737418240) 16/04/06 18:57:17 INFO balancer.Balancer: dfs.blocksize = 134217728 (default=134217728) 16/04/06 18:57:17 INFO net.NetworkTopology: Adding a new node: /default-rack/10.0.2.15:50010 16/04/06 18:57:17 INFO balancer.Balancer: 0 over-utilized: [] 16/04/06 18:57:17 INFO balancer.Balancer: 0 underutilized: [] The cluster is balanced. Exiting... Apr 6, 2016 6:57:17 PM 0 0 B 0 B -1 B Apr 6, 2016 6:57:17 PM Balancing took 1.383 seconds We can schedule a weekly cron job on edge node which will run balancer and send the results via email to Hadoop Admin. 3. Adding new nodes to the cluster We should always maintain the list of Datanodes which are authorized to communicate with Namenode, it can be achieved by setting dfs.hosts property in hdfs-site.xml <property> <name>dfs.hosts</name> <value>/etc/hadoop/conf/allowed-datanodes.txt</value> </property> If we don’t set this property then any machine which has Datanode installed and hdfs-site.xml property file can easily contact Namenode and become part of Hadoop cluster. 3.1 For Nodemanagers We can add below property in yarn-site.xml <property> <name>yarn.resourcemanager.nodes.include-path</name> <value>/etc/hadoop/conf/allowed-nodemanagers.txt</value> </property> 4. Decommissioning a node from the cluster It’s a bad idea to stop single or multiple Datanode daemons or shutdown them gracefully though HDFS is fault tolerant. Better solution is to add ip address of Datanode machine that we need to remove from cluster to exclude file which is maintained by dfs.hosts.exclude property and run below command sudo -u hdfs hdfs dfsadmin -refreshNodes After this, Namenode will start replicating all the blocks to other existing Datanodes in the cluster, once decommission process is complete then it’s safe to shutdown Datanode daemon. You can track progress of decommission process on NN Web UI. 4.1 For YARN: Add ip address of node manager machine to the file maintained by yarn.resourcemanager.nodes.exclude-path property and run below command. sudo -u yarn yarn rmadmin -refreshNodes 5. Datanode Volume Failures Namenode WebUI shows information about Datanode volume failures, we should check this information periodically or set some kind of automated monitoring system using Nagios or Ambari Metrics if you are using Hortonworks Hadoop Distribution or JMX monitoring (http://<namenode-host>:50070/jmx) etc. Multiple disk failures on single Datanode could cause shutdown of Datanode daemon. ( Please check dfs.datanode.failed.volumes.tolerated property and set it accordingly in hdfs-site.xml ) 6. Database Backups If we you have multiple Hadoop ecosystem components installed then you should schedule a backup script to take database dumps. for e.g. 1. Hive metastore database 2. Oozie-DB 3. Ambari DB 4. Ranger DB Create a simple shell script to have backup commands and schedule it on a weekend, add a logic to send an email once backups are done. 7. HDFS Metadata backup fsimage has metadata about your Hadoop file system and if for some reason it gets corrupted then your cluster is un-usable, it’s very important to keep periodic backups of filesystem fsimage. You can schedule a shell script which will have below command to take backup of fsimage hdfs dfsadmin -fetchImage fsimage.backup.ddmmyyyy 8. Purging older log files In production clusters, if we don’t clean older Hadoop log files then it can eat your entire disk and daemons could crash because of “no space left on device” error. Always get older log files cleaned via cleanup script! Please comment if you have any feedback/questions/suggestions. Happy Hadooping!! 🙂

KuldeepK · ‎04-01-2016

This articles explains about how to re-run only failed action for oozie workflow. Below are the steps: 1. Find out the WF id of the failed/killed job. 2. Prepare a job config file which needs to be passed to the rerun command. To do so Follow below steps: 2.1 You first need the oozie job configuration xml file. The easiest way to do that is to use the -configcontent option of the oozie job command. E.g. On commandline export OOZIE_URL="http://<oozie-host>:11000/oozie" oozie job -configcontent <workflow-id> > job_conf.xml 2.2 Delete oozie.coord.application.path property from job_conf.xml. This is to avoid E0301: Invalid resource oozie rerun error. 2.3 Now add below property in job_conf.xml. This determines what actions need to be run in the workflow. If we specify specific action nodes here then it will skip those actions. if nothing specified then it will run all actions of the workflow. To run all actions of a workflow: <property> <name>oozie.wf.rerun.skip.nodes</name> <value>,</value> </property> To skip few actions of a workflow ( all the action nodes specified here will be skipped and the rest will be run 😞 <property> <name>oozie.wf.rerun.skip.nodes</name> <value>action-name1,action-name2,etc.</value> </property> 3. Re-run wf with below command oozie job -config "job_conf.xml" -rerun <wf-id>

KuldeepK · ‎02-18-2016

Nice one!

KuldeepK · ‎02-15-2016

Step by step guide - Shell action in oozie workflow via Hue Step1: Create a sample shell script and upload it to hdfs [root@sandbox shell]# cat ~/sample.sh #!/bin/bash echo "`date` hi" > /tmp/output hadoop fs -put sample.sh /user/hue/oozie/workspaces/ [root@sandbox shell]# hadoop fs -ls /user/hue/oozie/workspaces/ -rw-r--r-- 3 root hdfs 44 2016-02-15 10:26 /user/hue/oozie/workspaces/sample.sh Step 2: Login to Hue Web UI and select Oozie editor/dashboard Step3: Goto "Workflows" tab and click on "Create" button Step 4: Fill in the required details and click on save button Step 5: Drag shell action between start and end phase Step 6: Fill in the required details about shell action and click on "Done" button Step 7: Submit your workflow You will see that your job is in progress Output 1 Output 2

KuldeepK · ‎01-23-2016

I noticed that using Ambari experimental functionality we can do lot of interesting stuff. For e.g. rolling/express upgrade is in progress and some service checked failed, to recover from failure we need to update some configuration parameters and restart that particular service, during upgrade Ambari will not allow us to modify any configuration property, so either we can use configs.sh script or another option is to use ambari experimental functionality and enable "opsDuringRollingUpgrade" option 🙂 Please feel free add more information about Ambari experimental functionality here!

KuldeepK · ‎12-21-2015

This article is an update of one on http://hadooped.blogspot.com/2013/10/apache-oozie-part-13-oozie-ssh-action_30.html, authored by @Anagha Khanolkar. Below are the steps to setup Oozie workflow using ssh-action: Step 1. Create job.properties. Example: #************************************************* # job.properties #************************************************* nameNode=hdfs://<namenode-machine-fqdn>:8020 jobTracker=<resource-manager-fqdn>:8050 queueName=default oozie.libpath=${nameNode}/user/oozie/share/lib oozie.use.system.libpath=true oozie.wf.rerun.failnodes=true oozieProjectRoot=${nameNode}/user/${user.name} appPath=${oozieProjectRoot} oozie.wf.application.path=${appPath} inputDir=${oozieProjectRoot} focusNodeLogin=<username>@<remote-host-where-you-have-your-shell-script(s)> shellScriptPath=~/uploadFile.sh emailToAddress=<email-id> Step 2. Write workflow.xml Example:    <workflow-app name="WorkFlowForSshAction" xmlns="uri:oozie:workflow:0.1"> <start to="sshAction"/> <action name="sshAction"> <ssh xmlns="uri:oozie:ssh-action:0.1"> <host>${focusNodeLogin}</host> <command>${shellScriptPath}</command> <capture-output/> </ssh> <ok to="sendEmail"/> <error to="killAction"/> </action> <action name="sendEmail"> <email xmlns="uri:oozie:email-action:0.1"> <to>${emailToAddress}</to> <subject>Output of workflow ${wf:id()}</subject> <body>Status of the file move: ${wf:actionData('sshAction')['STATUS']}</body> </email> <ok to="end"/> <error to="end"/> </action> <kill name="killAction"> <message>"Killed job due to error"</message> </kill> <end name="end"/> </workflow-app> Step 3. Write sample uploadFile.sh script Example: #!/bin/bash hadoop fs -put ~/test /user/<username>/uploadedbyoozie Step 4. Upload workflow.xml to ${appPath} defined in job.properties. Step 5. Login to Oozie host by "oozie" user. Step 6. Generate a key pair (if you don't have already ) using 'ssh-keygen' command Step 7. On Oozie host copy ~/.ssh/id_rsa.pub and paste it on <remote-host>'s ~/.ssh/authorized_keys file (focus node) Step 8. Test password-less ssh from oozie@oozie-host to <username>@<remote-host> Step 9. if step 7 succeeds then go ahead and run oozie job. it should complete without error Note - In order to get password-less ssh working please make sure that: 1. You have 700 permissions on ~/.ssh directory 2. 600 permissions on ~/.ssh/authorized_keys file on remote-host 3. 600 to ~/.ssh/id_rsa 4. 644 to ~/.ssh/id_rsa.pub

KuldeepK · ‎11-27-2015

Also, If we are using MR then we might need to revisit the mapper/reducer container's heap size accordingly.

KuldeepK · ‎11-24-2015

thanks @Guilherme Braccialli - that makes sense.

KuldeepK · ‎11-24-2015

Great! I believe ambari and hdfs services restart is needed if we delete configs using configs.sh script.

KuldeepK · ‎11-23-2015

@Guilherme Braccialli - Glad this helped! 🙂

Online	Offline
Last Visited	‎02-05-2026 01:46 AM

Member Since	‎04-03-2019 04:03 PM
Last Visited	‎02-05-2026 01:46 AM
Posts	962
Kudos received	1724

Cloudera Community

Re: oozie shell action

Re: Oozie Service Check fails after upgrading to ...

Hadoop Cluster Maintenance

How to re-run failed action from oozie workflow

Re: How to install ambari-shell

Shell action in oozie workflow via Hue

Ambari experimental functionality

Oozie ssh action

Re: Hive OOM - Caused by: java.lang.OutOfMemoryEr...

Re: Balancer not working in hdfs HA

Re: Balancer not working in hdfs HA

Re: Balancer not working in hdfs HA