Support Questions

Find answers, ask questions, and share your expertise

YARN v/s MapReduce?

avatar

Hi,

What are advantages of YARN over MapReduce, why YARN was required instead of MapReduce?

1 ACCEPTED SOLUTION

avatar
Master Mentor
@Rushikesh Deshmukh

Yarn provides the true multi tenancy. It lets to run multiple jobs at the same time. Yarn is the data operating system

The overall architecture is different.

YARN

2239-yarn-architecture.gif

MapReduce

2240-p4.png

Another link for you

Source 1 2

"You say "Differences between MapReduce and YARN". MapReduce and YARN definitely different. MapReduce is Programming Model, YARN is architecture for distribution cluster. Hadoop 2 using YARN for resource management. Besides that, hadoop support programming model which support parallel processing that we known as MapReduce. Before hadoop 2, hadoop already support MapReduce. In short, MapReduce run above YARN Architecture. Sorry, i don't mention in part of straggler problem.

"when MRmaster asks resource manger for resources?" when user submit MapReduce Job. After MapReduce job has done, resource will be back to free.

"resource manger will give MRmaster all resources it needs or it is according to cluster computing capabilities" I don't get this question point. Obviously, the resources manager will give all resource it needs no matter what cluster computing capabilities. Cluster computing capabilities will influence on processing time."

and

MRv1 uses the JobTracker to create and assign tasks to data nodes, which can become a resource bottleneck when the cluster scales out far enough (usually around 4,000 clusters).

MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. For each job, one slave node will act as the Application Master, monitoring resources/tasks, etc.

View solution in original post

9 REPLIES 9

avatar
Master Mentor

avatar

@Artem Ervits, thanks for suggestion and quick reply.

avatar
Master Mentor
@Rushikesh Deshmukh

Yarn provides the true multi tenancy. It lets to run multiple jobs at the same time. Yarn is the data operating system

The overall architecture is different.

YARN

2239-yarn-architecture.gif

MapReduce

2240-p4.png

Another link for you

Source 1 2

"You say "Differences between MapReduce and YARN". MapReduce and YARN definitely different. MapReduce is Programming Model, YARN is architecture for distribution cluster. Hadoop 2 using YARN for resource management. Besides that, hadoop support programming model which support parallel processing that we known as MapReduce. Before hadoop 2, hadoop already support MapReduce. In short, MapReduce run above YARN Architecture. Sorry, i don't mention in part of straggler problem.

"when MRmaster asks resource manger for resources?" when user submit MapReduce Job. After MapReduce job has done, resource will be back to free.

"resource manger will give MRmaster all resources it needs or it is according to cluster computing capabilities" I don't get this question point. Obviously, the resources manager will give all resource it needs no matter what cluster computing capabilities. Cluster computing capabilities will influence on processing time."

and

MRv1 uses the JobTracker to create and assign tasks to data nodes, which can become a resource bottleneck when the cluster scales out far enough (usually around 4,000 clusters).

MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. For each job, one slave node will act as the Application Master, monitoring resources/tasks, etc.

avatar

@Neeraj SabharwalCan reducers communicate with each other?

avatar
Expert Contributor

Nope, reducers don't communicate with each other and neither the mappers do. All of them runs in a separate JVM containers and don't have information of each other. AppMaster is the demon which takes care and manage these JVM based containers (Mapper/Reducer).

avatar
Master Guru

Yarn is a work scheduler that can run different types of workloads.

- Spark

- MapReduce2

- Storm

- Tez

...

While MapReduce is a core feature and most likely the majority of the workloads its not the only one anymore. Hive/Pig uses Tez and Spark and Storm are big as well. This is the biggest advantage.

Other advantages include better scalability ( local nodemanagers instead of a single bottleneck ) lots of convenience features etc. pp.

avatar

@Benjamin Leonhardi, thanks for sharing this useful information.

avatar
Expert Contributor

YARN has many advantages over MapReduce (MRv1).

1) Scalability - Decreasing the load on the Resource Manager(RM) by delegating the work of handling the tasks running on slaves to application Master, RM can now handle more requests than Job tracker facilitating addition of more nodes.

2) Unlike MPv1 which is strongly coupled with the MapReduce , YARN supports many kinds of code running on them like MR2,Tez, Storm, Spark etc

3) Optimized resource allocation - There are no fixed number of slots separately allocated for Mapper and Reducers in YARN, which is the case in MRv1. So the available capacity of the nodes can be used to any task which needs resources.

4) When Resource manager fails , the jobs running on the cluster need not be restarted again after the recovery of Resource Manager.

5) Failover mechanism is implemented by ZK which is already part of Resource manager which says, we don't need to run another deamon.

avatar
New Contributor

This is YARN framework which is responsible for doing Cluster Resource Management.

Cluster resource management means managing the resources of the Hadoop Clusters. And by resources we mean Memory, CPU etc. YARN took over this task of cluster management from MapReduce and MapReduce is streamlined to perform Data Processing only in which it is best.

YARN has central resource manager component which manages resources and allocates the resources to the application. Multiple applications can run on Hadoop via YARN and all application could share common resource management.

Advantage of YARN:

  1. Yarn does efficient utilization of the resource: There are no more fixed map-reduce slots. YARN provides central resource manager. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource.
  2. Yarn can even run application that do not follow MapReduce model: YARN decouples MapReduce's resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications. For example, Hadoop clusters can now run interactive querying and streaming data applications simultaneously with MapReduce batch jobs. This also streamlines MapReduce to do what is does best - process data.

Few Important Notes about YARN:

  1. YARN is backward compatible: This means that existing MapReduce job can run on Hadoop 2.0 without any change.
  2. No more JobTracker and TaskTracker needed in Hadoop 2.0: JobTracker and TaskTracker has totally disappeared. YARN splits the two major functionalities of the JobTracker i.e. resource management and job scheduling/monitoring into 2 separate daemons (components).
    • Resource Manager
    • Node Manager(node specific)

    Central Resource Manager and node specific Node Manager together constitutes YARN. YARN in Hadoop