About jniemiec

jniemiec · ‎01-27-2016

Also both of these log entries are shuffle related for the most part, its the Reduces fetching the Map outputs from nodes in the cluster. Which leads me to question things like network transfer rates the NICs had when this occurred, as its only 2 nodes that demonstrate the slow shuffle times. I am not sure if you have metrics available tho or not.

jniemiec · ‎01-27-2016

Because your running the Hive query on the MR engine the MR props will be respected. You can mess with your slowstart, heapsizes, compressions all by just setting MR props in the Hive session/job like below, dont bother setting the Hive you have above if we explicitly set the MR ones. Also can we get a screenshot of your counters page? You can get to it from the overview page on the left, I am most interested in the 'MapReduce Framework Counters' ##Setting MR Props in Hive## set mapreduce.map.output.compress=true; set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;

jniemiec · ‎01-26-2016

Have you enabled mapside compression to reduce the amount of data moved across the clusters when shuffling to the reduces? How long did your longest map task take to run (start time and end time)? set mapreduce.map.output.compress=true; set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;

jniemiec · ‎01-26-2016

Your problem is not in the Reduce stage, the problem shown here is network shuffle time and this can be 2 things 1) it could mean a heavy amount of key skew (I doubt it does in this case due to the reduce stage time) 2) it more likely is that these reduces have started before all the mappers have complete, so you have reduces stuck in the network shuffle phase for a long time, a reducer cannot complete until ALL Mappers are done so it can fetch all the keys it needs for processing. Try changing your slow-start so reduces start later then maps. set mapreduce.job.reduce.slowstart.completedmaps=0.5;

jniemiec · ‎01-26-2016

Please go to the RM UI, and for the Job go the the 'Counters' Under MapReduce counters you need to check both how many Reducer_Input_Groups there are and additionally how many of these Groups were assigned to each Reducer. The Reducer_Input_Groups represents the number of distinct keys your Mappers produce, each distinct key will belong to a Reducer were a Reducer may have 1+N Keys it takes care of. therfore you can have as many reduces as you have reducer_input_groups, but not more then that. Also you should check the number of Reducer_Output_Records and compare that to the SpilledRecords value for the reduces. If your spllied records to reducer output records are not 1:1 then you may want to consider making your Reducers alittle larger to prevent the extra IO Spill from taking place.

jniemiec · ‎01-11-2016

Can you please provide the output begining output of 'hive --orcfiledump $pathtoOrcFile'

jniemiec · ‎12-22-2015

You need to update the pig.properties value to be like the below to force it to use a different python alias when using the new Pig 13+ Python Streaming functions: pig.streaming.udf.python.command=python3

jniemiec · ‎12-22-2015

As for multiple networks you can multi-home the nodes so you have a Public network and a Cluster Traffic network. Hardware vendors like the Cisco Refernce architecture are designed expecting multi-homing to be configured. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html

jniemiec · ‎12-22-2015

I would switch the engine back to MapReduce to make your life easier and then control the number of Mappers spawned by controlling the input split values. Where N is the min number of bytes you want a Mapper to process and X is the Max number of bytes you want a Mapper to process. This is easier then trying to understand how the Tez waves setting works becuase that involves the current capacity of a queue and not just the size of the underlaying data. #start pig with pig -x mr #in your pig script set mapreduce.input.fileinputformat.split.minsize= N set mapreduce.input.fileinputformat.split.maxsize= X

jniemiec · ‎12-08-2015

https://issues.apache.org/jira/browse/RANGER-586

Online	Offline
Last Visited	‎12-12-2018 11:12 PM

Member Since	‎09-24-2015 04:59 PM
Last Visited	‎12-12-2018 11:12 PM
Posts	38
Kudos received	41

Cloudera Community

Re: Replication factor in S3??

Re: Best Pratices for Hive Partitioning especially...

Re: Is there a way to have pig default to python 3...

Re: How do I limit the number of simultaneous task...

Re: Storm JAR Version Conflicts

Re: Need to understand why Job taking long time in...

Re: Need to understand why Job taking long time in...

Re: Need to understand why Job taking long time in...

Re: Need to understand why Job taking long time in...

Re: Need to understand why Job taking long time in...

Re: ORC Stripe size

Re: Is there a way to have pig default to python 3...

Re: Networking and edge nodes

Re: How do I limit the number of simultaneous task...

Re: Storm JAR Version Conflicts