Member since
07-22-2017
15
Posts
0
Kudos Received
0
Solutions
04-24-2021
08:58 AM
Thank you, clear and solid explanation.
... View more
09-27-2017
05:40 AM
HDFS clusters do not benefit using RAID for data storage, as the redundancy that RAID provides is not required since HDFS handles it by replicating data on different data nodes. RAID striping used to increase the performance turns out to be slower than the JBOD (Just a bunch of disks) used by HDFS which round-robins across all disks. Its because in RAID, the read/write operations are limited by the slowest disk in the array. In JBOD, the disk operations are independent, so the average speed of operations is greater than the slowest disk. If a disk fails in JBOD, HDFS can continue to operate with out it, but in RAID if a disk fails the whole array becomes unavailable. RAID is recommended for NameNode to protect corruptions against metadata.
... View more
09-21-2017
01:35 PM
2 Kudos
@Riddhi Sam
First of all, Spark is not faster than Hadoop. Hadoop is a distributed file system (HDFS) while Spark is a compute engine running on top of Hadoop or your local file system. Spark however is faster than MapReduce which was the first compute engine created when HDFS was created. So, when Hadoop was created, there were only two things. HDFS where data is stored and MapReduce which was the only compute engine on HDFS. To understand how Spark is faster than MapReduce, you need to understand how both MapReduce and Spark works. When a MR job starts, the first step is to read data from disk and run mappers. The output of mappers is stored back on disk. Then Shuffle and sort step starts and reads the mapper output from disk and after shuffle and sort completes, it stores the result back on disk (there is actually some network traffic also when keys for Reduce step are gathered on same node but that's true for Spark also, so let's focus on the disk step only). Then finally the reduce step starts, reads the output from shuffle and sort step and finally stores the result back in HDFS. That's six disk accesses to complete the job. Most Hadoop clusters have 7200 RPM disks which are ridiculously slow. Now, here is how Spark works. Like MapReduce job needs mappers and reducers, Spark has two types of processes. One is transformation and other is action. When you write a Spark job, it consists of a number of transformations and a few actions. When Spark job starts, it creates a DAG (Directed acyclic graph) of the job (steps it is supposed to run as part of the job). Then when a job starts, it looks at the DAG and assume the first 5 steps are transformations. It remembers the steps (the DAG) but doesn't really go to disk to perform the transformations. Then it encounters action. At that point a Spark job goes to disk, performs the first transformation, keeps the result of transformation in memory, performs the second transformation, keeps the result in memory and so on until all the steps complete. The only time it goes back to disk is to write the output of the job. So, two accesses to disk. This makes Spark faster. There are other things in Spark which makes it faster than MapReduce. For example, a rich set of API which enables to accomplish in one Spark job what might require two or more MapReduce jobs running one after the other. Imagine, how slow that would be. There are cases where Spark would spill to disk because of the amount of data and it will be slow but may or may not be as slow as MapReduce because of better rich API.
... View more
09-07-2017
01:58 PM
Edit the following Core Hadoop Configuration files to setup the cluster.
• hadoop-env.sh
• core-site.xml
• hdfs-site.xml
• mapred-site.xml
• masters
• slaves
HADOOP_HOME directory (the extracted directory(etc) is called as HADOOP_HOME. e.g. hadoop-2.6.0-cdh5.5.1) contain all the libraries, scripts, configuration files, etc. hadoop-env.sh 1. This file specifies environment variables that affect the JDK used by Hadoop Daemon (bin/hadoop).
As Hadoop framework is written in Java and uses Java Runtime
environment, one of the important environment variables for Hadoop
daemon is $JAVA_HOME in hadoop-env.sh. 2. This variable directs Hadoop daemon to the Java path in the system
Actual:export JAVA_HOME=<path-to-the-root-of-your-Java-installation>
Change:export JAVA_HOME=</usr/lib/jvm/java-8-oracle/>
core-site.sh
3. This file informs Hadoop daemon where NameNode runs in the cluster.
It contains the configuration settings for Hadoop Core such as I/O
settings that are common to HDFS and MapReduce. <property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/dataflair/Hadmin</value>
</property> Location of namenode is specified by fs.defaultFS property
namenode running at 9000 port on localhost.
hadoop.tmp.dir property to specify the location where temporary as well as permanent data of Hadoop will be stored.
“/home/dataflair/hadmin” is my location; here you need to specify a location where you have Read Write privileges. hdfs-site.sh
we need to make changes in Hadoop configuration file hdfs-site.xml
(which is located in HADOOP_HOME/etc/hadoop) by executing the below
command:
Hdata@ubuntu:~/hadoop-2.6.0-cdh5.5.1/etc/hadoop$ nano hdfs-site.xml Replication factor <property>
<name>dfs.replication</name>
<value>1</value>
</property> Replication factor is specified by dfs.replication property;
as it is a single node cluster hence we will set replication to 1. mapred-site.xml
we need to make changes in Hadoop configuration file mapred-site.xml (which is located in HADOOP_HOME/etc/hadoop)
Note: In order to edit mapred-site.xml file we need to first create a
copy of file mapred-site.xml.template. A copy of this file can be
created using the following command:
Hdata@ubuntu:~/ hadoop-2.6.0-cdh5.5.1/etc/hadoop$ cp mapred-site.xml.template mapred-site.xml
We will now edit the mapred-site.xml file by using the following command:
Hdata@ubuntu:~/hadoop-2.6.0-cdh5.5.1/etc/hadoop$ nano mapred-site.xml
Changes
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property> In order to specify which framework should be used for MapReduce, we use mapreduce.framework.name property, yarn is used here. yarn-site.xml
Changes
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.
shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property> In order to specify auxiliary service need to run with nodemanager“yarn.nodemanager.aux-services” property is used.
Here Shuffling is used as auxiliary service. And in order to know the
class that should be used for shuffling we user
“yarn.nodemanager.aux-services.mapreduce.shuffle.class”
... View more
08-21-2017
05:54 AM
foreach() operation is an action. > It do not return any value. > It executes input function on each element of an RDD. From : http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/#39_Foreach It executes the function on each item in RDD. It is good for writing database or publishing to web services. It executes parameter less function for each data items. Example: val mydata = Array(1,2,3,4,5,6,7,8,9,10)
val rdd1 = sc.parallelize(mydata)
rdd1.foreach{x=>println(x)}
OR
rdd1.foreach{println} Output: 1 2 3 4 5 6 7 8 9 10
... View more