Member since
08-09-2017
9
Posts
3
Kudos Received
0
Solutions
09-15-2017
06:40 AM
1 Kudo
Firstly lets understand why we need partitioning inMapReduceFramework: As we know that Map task take inputsplit as input and produces key,value pair as output. This key-value pairs are then feed to reduce task. But before the reduce phase , one more phase know as partitioning phase runs. This phase partition the map output based on key and keeps the record of the same key into the same partitions. Lets take an example of Employee Analysis: We want to find the highest paid Female and male employee from the data set. Data Set: Name Age Dept Gender Salary A 23 IT Male 35 B 35 Finance Female 50 C 29 IT Male 40 Considering two map tasks gives following <k,v> as output: Map1 o/p: Key Value Gender Value Male A 23 IT Male 35 Female B 35 Finance Female 50 Map2 o/p Key Value Gender Value Male C 29 IT Male 40 So , lets understand how to implement custom partitioner: Our custom partitioner will send all <K,V> by Gender Male to one partition and <K,V> with Female to other partition . here is the code: public static class MyPartitioner extends Partitioner<Text,Text>{ public int getPartition(Text key, Text value, int numReduceTasks){ if(numReduceTasks==0) return 0; if(key.equals(new Text("Male")) ) return 0; if(key.equals(new Text("Female"))) return 1; } } Here , the getPartition() will return 0 if the key is Male and 1 if key is Female. We can check our output in two files: part-r-0000 and part-r-0001.
... View more
08-30-2017
12:28 PM
Data locality means moving computation rather than moving data to save the bandwidth. This minimizes network congestion and increases the overall throughput of the system.
... View more
08-29-2017
12:11 PM
There are many features of hadoop. Some of the most important features of Hadoop are: Open source- ITs source code is open. You change its code accroding to your requirement Flexibility- Can store any types of data like structured, unstructured and semistructured High Availibility- Data ishighly availableand accessible despite hardware failure due to multiple copies of data. If a machine or few hardware crashes, then data will be accessed from another path. Data Locality Hadoop works on data locality principle which states that move computation to data instead of data to computation. When a client submits theMapReducealgorithm, this algorithm is moved to data in the cluster rather than bringing data to the location where the algorithm is submitted and then processing it. Fault tolerance Economic Reliability Easy to use To get complete details of all the feature of Hadoop refer below link: Hadoop Features
... View more
08-26-2017
12:31 PM
1 Kudo
Speed Apache Spark –Spark is lightning fast cluster computing tool. Apache Spark runs applications up to 100x faster in memory and 10x faster on disk than Hadoop. Because of reducing the number of read/write cycle to disk and storing intermediate data in-memory Spark makes it possible. Hadoop MapReduce –MapReduce reads and writes from disk, as a result, it slows down the processing speed. Difficulty Apache Spark –Spark is easy to program as it has tons of high-level operators with RDD – Resilient Distributed Dataset. Hadoop MapReduce –In MapReduce, developers need to hand code each and every operation which makes it very difficult to work. Easy to Manage Apache Spark –Spark is capable of performing batch, interactive and Machine Learning and Streaming all in the same cluster. As a result makes it a completedata analyticsengine. Thus, no need to manage different component for each need. Installing Spark on a cluster will be enough to handle all the requirements. Hadoop MapReduce –As MapReduce only provides the batch engine. Hence, we are dependent on different engines. For example- Storm, Giraph, Impala, etc. for other requirements. So, it is very difficult to manage many components. For more refer below link: Spark vs Hadoop
... View more
08-21-2017
05:54 AM
foreach() operation is an action. > It do not return any value. > It executes input function on each element of an RDD. From : http://data-flair.training/blogs/rdd-transformations-actions-apis-apache-spark/#39_Foreach It executes the function on each item in RDD. It is good for writing database or publishing to web services. It executes parameter less function for each data items. Example: val mydata = Array(1,2,3,4,5,6,7,8,9,10)
val rdd1 = sc.parallelize(mydata)
rdd1.foreach{x=>println(x)}
OR
rdd1.foreach{println} Output: 1 2 3 4 5 6 7 8 9 10
... View more
08-19-2017
12:59 PM
Spark is easy to program and don't require that much hand coding whereas MapReduce is not that easy in terms of programming and requires lots of hand coding It has interactive mode whereas in MapReduce there is no built-in interactive mode, MapReduce is developed for batch processing. For data processing Spark can use streaming, machine learning, and batch processing whereas Hadoop MapReduce can use the batch engine. Spark is general purpose cluster computation engine. Spark executes batch processing jobs about 10 to 100 times faster than Hadoop MapReduce. Spark uses an abstraction called RDD which makes Spark feature rich, whereas map reduce doesn't have any abstraction Spark uses lower latency by caching partial/complete results across distributed nodes whereas MapReduce is completely disk-based. For a detailed comparison between Spark & Hadoop-MapReduce, Please refer: Spark vs Hadoop MapReduce
... View more
08-18-2017
10:01 AM
When data fits into the memory Apache Spark runs faster.Spark runs applications up to 100x faster in memory and 10x faster on disk than Hadoop.Because of reducing the number of read/write cycle to disk and storing intermediate data in-memory Spark makes it possible.
... View more
08-17-2017
07:12 AM
1) Management of DAG's- People often do mistakes in DAG controlling. Always try to use reducebykey instead of groupbykey. The ReduceByKey and GroupByKey can perform almost similar functions, but GroupByKey contains large data. Hence, try to use ReduceByKey to the most. Always try to lower the side of maps as much as possible. Try not to waste more time in Partitioning.Try not to shuffle more. Try to keep away from Skews as well as partitions too.
2) Maintain the required size of the shuffle blocks.
... View more
08-16-2017
12:44 PM
1 Kudo
Sparkcan either in local or distributed manner in the cluster. 1.Local mode- There is no resource manager in local mode. This mode is used for test the spark application in test environment where we do not want to eat the resources and want to run applications faster. Here everything run on single JVM. 2.Distributed / Cluster modes: We can run spark on distributed manner with master-slave architecture.There will be multiple worker nodes in each cluster and cluster manager will be allocating the resources to each worked node. Spark can be deployed in distributed cluster in 3 ways. 1.Standalone mode 2.YARN 3.Mesos 1.Standalone: In standalone mode spark itself handle the resource allocation, their won't be any separate cluster manager. Spark allocated the CPU and memory to worker nodes based on the resource availability. 2.YARN: Here, YARN will be used as cluster manager. YARN distribution will be mainly used when spark running with other Hadoop components like MR in Cloudera or HortonWorks Distribution. YARN is a combination of Resource Manager and Node Manager. Resource manager has scheduled and Application manager. Scheduler: Scheduler allocate resources to various running application Application Manager: Manages all application across all nodes. Node Manager contains Application master and container. The container is the place where actual work happens. Application master negotiate resources from Resource manager. 3. Mesos: Mesos is used in large scala production deployments. In meson distribution, all the resources available in the cluster across all nodes will be clubbed together and dynamic sharing of resources will be done. Meson master, slave, and framework are the three components of mess. master-provides fault tolerance slave- actually does the resource allocation framework-help the application to request for resources For more information on Spark Cluster Manager read: Cluster Manager of Spark
... View more