Member since
09-24-2015
38
Posts
41
Kudos Received
6
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4428 | 10-18-2017 01:27 PM | |
28819 | 04-22-2016 09:23 PM | |
1623 | 12-22-2015 02:41 PM | |
2255 | 12-22-2015 12:54 PM | |
4295 | 12-08-2015 03:44 PM |
01-27-2016
01:16 AM
Also both of these log entries are shuffle related for the most part, its the Reduces fetching the Map outputs from nodes in the cluster. Which leads me to question things like network transfer rates the NICs had when this occurred, as its only 2 nodes that demonstrate the slow shuffle times. I am not sure if you have metrics available tho or not.
... View more
01-27-2016
01:08 AM
Because your running the Hive query on the MR engine the MR props will be respected. You can mess with your slowstart, heapsizes, compressions all by just setting MR props in the Hive session/job like below, dont bother setting the Hive you have above if we explicitly set the MR ones. Also can we get a screenshot of your counters page? You can get to it from the overview page on the left, I am most interested in the 'MapReduce Framework Counters' ##Setting MR Props in Hive## set mapreduce.map.output.compress=true; set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
... View more
01-26-2016
08:23 PM
Have you enabled mapside compression to reduce the amount of data moved across the clusters when shuffling to the reduces? How long did your longest map task take to run (start time and end time)? set mapreduce.map.output.compress=true; set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
... View more
01-26-2016
02:43 PM
Your problem is not in the Reduce stage, the problem shown here is network shuffle time and this can be 2 things 1) it could mean a heavy amount of key skew (I doubt it does in this case due to the reduce stage time) 2) it more likely is that these reduces have started before all the mappers have complete, so you have reduces stuck in the network shuffle phase for a long time, a reducer cannot complete until ALL Mappers are done so it can fetch all the keys it needs for processing. Try changing your slow-start so reduces start later then maps. set mapreduce.job.reduce.slowstart.completedmaps=0.5;
... View more
01-26-2016
12:57 PM
1 Kudo
Please go to the RM UI, and for the Job go the the 'Counters' Under MapReduce counters you need to check both how many Reducer_Input_Groups there are and additionally how many of these Groups were assigned to each Reducer. The Reducer_Input_Groups represents the number of distinct keys your Mappers produce, each distinct key will belong to a Reducer were a Reducer may have 1+N Keys it takes care of. therfore you can have as many reduces as you have reducer_input_groups, but not more then that. Also you should check the number of Reducer_Output_Records and compare that to the SpilledRecords value for the reduces. If your spllied records to reducer output records are not 1:1 then you may want to consider making your Reducers alittle larger to prevent the extra IO Spill from taking place.
... View more
01-11-2016
08:29 PM
Can you please provide the output begining output of 'hive --orcfiledump $pathtoOrcFile'
... View more
12-22-2015
02:41 PM
2 Kudos
You need to update the pig.properties value to be like the below to force it to use a different python alias when using the new Pig 13+ Python Streaming functions: pig.streaming.udf.python.command=python3
... View more
12-22-2015
02:31 PM
As for multiple networks you can multi-home the nodes so you have a Public network and a Cluster Traffic network. Hardware vendors like the Cisco Refernce architecture are designed expecting multi-homing to be configured. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html
... View more
12-22-2015
12:54 PM
1 Kudo
I would switch the engine back to MapReduce to make your life easier and then control the number of Mappers spawned by controlling the input split values. Where N is the min number of bytes you want a Mapper to process and X is the Max number of bytes you want a Mapper to process. This is easier then trying to understand how the Tez waves setting works becuase that involves the current capacity of a queue and not just the size of the underlaying data. #start pig with pig -x mr #in your pig script set mapreduce.input.fileinputformat.split.minsize= N set mapreduce.input.fileinputformat.split.maxsize= X
... View more