About singh_maya48

singh_maya48 · ‎09-07-2017

Edit the following Core Hadoop Configuration files to setup the cluster. • hadoop-env.sh • core-site.xml • hdfs-site.xml • mapred-site.xml • masters • slaves HADOOP_HOME directory (the extracted directory(etc) is called as HADOOP_HOME. e.g. hadoop-2.6.0-cdh5.5.1) contain all the libraries, scripts, configuration files, etc. hadoop-env.sh 1. This file specifies environment variables that affect the JDK used by Hadoop Daemon (bin/hadoop). As Hadoop framework is written in Java and uses Java Runtime environment, one of the important environment variables for Hadoop daemon is $JAVA_HOME in hadoop-env.sh. 2. This variable directs Hadoop daemon to the Java path in the system Actual:export JAVA_HOME=<path-to-the-root-of-your-Java-installation> Change:export JAVA_HOME=</usr/lib/jvm/java-8-oracle/> core-site.sh 3. This file informs Hadoop daemon where NameNode runs in the cluster. It contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce. <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/dataflair/Hadmin</value> </property>  Location of namenode is specified by fs.defaultFS property  namenode running at 9000 port on localhost.  hadoop.tmp.dir property to specify the location where temporary as well as permanent data of Hadoop will be stored.  “/home/dataflair/hadmin” is my location; here you need to specify a location where you have Read Write privileges. hdfs-site.sh  we need to make changes in Hadoop configuration file hdfs-site.xml (which is located in HADOOP_HOME/etc/hadoop) by executing the below command: Hdata@ubuntu:~/hadoop-2.6.0-cdh5.5.1/etc/hadoop$ nano hdfs-site.xml Replication factor <property> <name>dfs.replication</name> <value>1</value> </property>  Replication factor is specified by dfs.replication property;  as it is a single node cluster hence we will set replication to 1. mapred-site.xml  we need to make changes in Hadoop configuration file mapred-site.xml (which is located in HADOOP_HOME/etc/hadoop)  Note: In order to edit mapred-site.xml file we need to first create a copy of file mapred-site.xml.template. A copy of this file can be created using the following command: Hdata@ubuntu:~/ hadoop-2.6.0-cdh5.5.1/etc/hadoop$ cp mapred-site.xml.template mapred-site.xml  We will now edit the mapred-site.xml file by using the following command: Hdata@ubuntu:~/hadoop-2.6.0-cdh5.5.1/etc/hadoop$ nano mapred-site.xml Changes <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> In order to specify which framework should be used for MapReduce, we use mapreduce.framework.name property, yarn is used here. yarn-site.xml Changes <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce. shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property>  In order to specify auxiliary service need to run with nodemanager“yarn.nodemanager.aux-services” property is used.  Here Shuffling is used as auxiliary service. And in order to know the class that should be used for shuffling we user “yarn.nodemanager.aux-services.mapreduce.shuffle.class”

singh_maya48 · ‎09-06-2017

Disaster Recovery in Hadoop cluster refers to the event of recovering all or most of your important data stored on a Hadoop Cluster in case of disasters like hardware failures,data loss ,applications error. There should be minimal or no downtime in cluster. Disaster can be handled through various techniques : 1) Data loss must be preveneted by writing metadata stored on namenode to a different NFS mount. However High Availability introduced in the latest version of Hadoop is a disaster management technique. 2) HDFS snapshots can also be used in case of recovery. 3) You can enable Trash feature in case of accidental deletion because file deleted first goes to trash folder in HDFS. 4) Hadoop distcp tool can also be used for cluster data copying building a mirror cluster in case of any hardware failure

singh_maya48 · ‎09-01-2017

Partitioning of the keys of the intermediate map output is controlled by the Partitioner. By hash function, key (or a subset of the key) is used to derive the partition. According to the key value each mapper output is partitioned and records having the same key value go into the same partition (within each mapper), and then each partition is sent to a reducer. Partition class determines which partition a given (key, value) pair will go. Partition phase takes place after map phase and before reduce phase. MapReduce job takes an input data set and produces the list of key value pair which is the result of map phase in which input data is split and each task processes the split and each map, output the list of key value pairs. Then, the output from the map phase is sent to reduce task which processes the user-defined reduce function on map outputs. But before reduce phase, partitioning of the map output take place on the basis of the key and sorted. To know more detail about the partitioning: Partition in MapReduce

Online	Offline
Last Visited	‎09-08-2017 09:32 AM

Member Since	‎08-31-2017 12:29 PM
Last Visited	‎09-08-2017 09:32 AM
Posts	6
Kudos received	2

Cloudera Community

Re: What are configuration files in Apache Hadoop?

Re: How to plan disaster recovery in Hadoop cluste...

Re: How to do partitioning in MapReduce??