Support Questions

Find answers, ask questions, and share your expertise

What are configuration files in Apache Hadoop?

avatar
Explorer
 
2 REPLIES 2

avatar
Master Mentor

@Shreya Gupta

core-site.xml & hdfs-site.xml are the important ones.

Hadoop’s Java configuration is driven by two types of important configuration files:

  • Read-only default configuration core-default.xml, hdfs-default.xml, yarn-default.xml and mapred-default.xml.
  • Site-specific configuration - core-site.xml, hdfs-site.xml, yarn-site.xml and mapred-site.xml.

https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/ClusterSetup.html#Configurin...

avatar

Edit the following Core Hadoop Configuration files to setup the cluster.

• hadoop-env.sh
• core-site.xml
• hdfs-site.xml
• mapred-site.xml
• masters
• slaves
HADOOP_HOME directory (the extracted directory(etc) is called as HADOOP_HOME. e.g. hadoop-2.6.0-cdh5.5.1) contain all the libraries, scripts, configuration files, etc.

hadoop-env.sh

1. This file specifies environment variables that affect the JDK used by Hadoop Daemon (bin/hadoop).
As Hadoop framework is written in Java and uses Java Runtime environment, one of the important environment variables for Hadoop daemon is $JAVA_HOME in hadoop-env.sh.

2. This variable directs Hadoop daemon to the Java path in the system
Actual:export JAVA_HOME=<path-to-the-root-of-your-Java-installation>
Change:export JAVA_HOME=</usr/lib/jvm/java-8-oracle/>
core-site.sh
3. This file informs Hadoop daemon where NameNode runs in the cluster. It contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.

<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/dataflair/Hadmin</value>
</property>

 Location of namenode is specified by fs.defaultFS property
 namenode running at 9000 port on localhost.
 hadoop.tmp.dir property to specify the location where temporary as well as permanent data of Hadoop will be stored.
 “/home/dataflair/hadmin” is my location; here you need to specify a location where you have Read Write privileges.

hdfs-site.sh
 we need to make changes in Hadoop configuration file hdfs-site.xml (which is located in HADOOP_HOME/etc/hadoop) by executing the below command:
Hdata@ubuntu:~/hadoop-2.6.0-cdh5.5.1/etc/hadoop$ nano hdfs-site.xml

Replication factor

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

 Replication factor is specified by dfs.replication property;
 as it is a single node cluster hence we will set replication to 1.

mapred-site.xml
 we need to make changes in Hadoop configuration file mapred-site.xml (which is located in HADOOP_HOME/etc/hadoop)
 Note: In order to edit mapred-site.xml file we need to first create a copy of file mapred-site.xml.template. A copy of this file can be created using the following command:
Hdata@ubuntu:~/ hadoop-2.6.0-cdh5.5.1/etc/hadoop$ cp mapred-site.xml.template mapred-site.xml
 We will now edit the mapred-site.xml file by using the following command:
Hdata@ubuntu:~/hadoop-2.6.0-cdh5.5.1/etc/hadoop$ nano mapred-site.xml
Changes

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

In order to specify which framework should be used for MapReduce, we use mapreduce.framework.name property, yarn is used here.

yarn-site.xml
Changes

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.
shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

 In order to specify auxiliary service need to run with nodemanager“yarn.nodemanager.aux-services” property is used.
 Here Shuffling is used as auxiliary service. And in order to know the class that should be used for shuffling we user “yarn.nodemanager.aux-services.mapreduce.shuffle.class”