Member since
02-25-2016
72
Posts
34
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3653 | 07-28-2017 10:51 AM | |
2831 | 05-08-2017 03:11 PM | |
1191 | 04-03-2017 07:38 PM | |
2896 | 03-21-2017 06:56 PM | |
1185 | 02-09-2017 08:28 PM |
07-14-2017
06:56 PM
1 Kudo
@Viswa According to official apache document by default number of reducers is set to 1 You can override this by using the following properties: For MR1 set mapred.reduce.tasks=N For MR2 set mapreduce.job.reduces=N The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum containers per node>). With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing. Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures. The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks. Now to understand the number of tasks spawned I would point you to this blog In MR1, the number of tasks launched per node was specified via the settings mapred.map.tasks.maximum and mapred.reduce.tasks.maximum. In MR2, one can determine how many concurrent tasks are launched per node by dividing the resources allocated to YARN by the resources allocated to each MapReduce task, and taking the minimum of the two types of resources (memory and CPU). Specifically, you take the minimum of yarn.nodemanager.resource.memory-mb divided by mapreduce.[map|reduce].memory.mb and yarn.nodemanager.resource.cpu-vcores divided by mapreduce.[map|reduce].cpu.vcores. This will give you the number of tasks that will be spawned per node.
... View more
06-30-2017
08:20 PM
2 Kudos
@Viswa In Tez, there are following types of DataMovements that take place between 2 vertex and is represented via an Edge in the DAG. BROADCAST Output on this edge produced by any source task is available to all destination tasks.
CUSTOM Custom routing defined by the user.
ONE_TO_ONE Output on this edge produced by the i-th source task is available to the i-th destination task.
SCATTER_GATHER The i-th output on this edge produced by all source tasks is available to the same destination task. To answer your question:
SIMPLE_EDGE refers to data movement type - SCATTER_GATHER (example - SHUFFLE JOIN )
BROADCAST_EDGE refers to data movement type - BROADCAST (example - MAP JOIN) I drew the above inference from createEdgeProperty() in source code Hope this helps.
... View more
06-05-2017
06:06 PM
tried creating RDD with collect() and print out using for loop. Was working fine. Was trying out in pyspark though. thank you
... View more
03-28-2017
06:19 PM
I have created new set of User with correct hostname and privileges and it worked, thankyou
... View more
03-14-2017
10:00 AM
2 Kudos
@Viswa
To check Namenode Safe mode status, Login to Namenode host and issue the below command, [user@NNhost1 ~]$ hdfs dfsadmin -safemode get
Safe mode is OFF in NNhost1/10.X.X.X:8020
Safe mode is OFF in NNhost2/10.X.X.X:8020
If Safe mode is turned ON, please issue the below command to leave from safemode.
[user@NNhost1 ~]$ hdfs dfsadmin -safemode leave
... View more
03-14-2017
12:05 AM
1 Kudo
@Viswa - Kindly accept the answer if my answer as helped you.
... View more
03-09-2017
11:43 AM
tried the same command again later, it worked. haven't changed anything. Thank you Jay SenSharma
... View more
03-10-2017
09:30 PM
1 Kudo
@Viswa - Can we close this now?
... View more
- « Previous
-
- 1
- 2
- Next »