About mbigelow

mbigelow · ‎03-02-2017

@matt123 You got it.

mbigelow · ‎03-02-2017

The only suggestion I have is to try running some tests to see if you can weed out any bad disks. DFSIO and Terasort may hit on it but may not. You can use 'dd' or other software to test the raw disks. Beyond that you may be chasing ghosts (spending more time than worth it on an ephemeral problem).

mbigelow · ‎03-02-2017

Yes. That settings only effect it to put a cap on how large a container can be. It does not mean that your containers will be this size. The yarn.scheduler.minimum-allocation-mb will set the container size if one is not provided by the user.

mbigelow · ‎03-01-2017

yarn.scheduler.maximum-allocation-mb - This is the max memory that a single container can get yarn.nodemanager.resource.memory-mb - This is how much memory per NM is allocated for containers I always set yarn.scheduler.maximum-allocation-mb eqaul to yarn.nodemanager.resource.memory-mb as the single largest container I could do on a host would be the amount of memory on a host allocated for YARN. You can set the yarn.scheduler.maximum-allocation-mb to any value, as mention it should not exceed what you set for yarn.nodemanager.resource.memory-mb. If it does, it won't harm anything until someone tries to get a container > yarn.nodemanager.resource.memory-mb. You might be able to set the configuration like this: mapreduce.task.io.sort.mb = 4gb - I can do this . The issue is that the sort buffer is part of the heap of the mapper. For instance a mapper of 3 GB and a heap of 2.5 GB would mean that the sort buffer could quickly fill up the 2.5 GB of heap available. You may not always hit OOM but it is likely due to the poor configuration. In summary yarn.nodemanager.resource.memory-mb > mapreduce.map.memory.mb > mapreduce.task.io.sort. It is not mandatory to increase the mapreduce.map.memory.mb if you increase the sort buffer. Lets use another example, say you are using 4 GB container with a heap of 3.2 heap. You are spilling a lot of records because you still are using the default sort buffer size. So you increase it to 1 GB. You just shrunk the available memory of your heap from 3.1 GB (3.2 - 100 M, roughly) to 2.2 GB (3.2 - 1). To compensate you could just increase your heap, and along with that your mapper memory. In ths example it would then look like 5 GB container, 4.2 GB heap, and 1 GB sort buffer.

mbigelow · ‎03-01-2017

The YARN service is unable to get the HDFS Delegation token on behalf of the users. What values do you have for the below settings hadoop.proxyuser.yarn.hosts hadoop.proxyuser.yarn.groups hadoop.proxyuser.mapred.hosts hadoop.proxyuser.mapred.groups

mbigelow · ‎03-01-2017

The best indicators are the job counters. Take a look at FILE: Number of bytes written and Spilled Records especially in relation to Map output records. If the spill records are a large portion of the map output records you are spilling a lot. The first setting below determines how much memory to use for the Map sort and the spill percentage is when it starts spilling to disk as a portion of the first setting. You can tweak both to reduce the amount spilled. The io.sort.mb is a port of the Map heap so there isn't a clear cut "it should be X". You can play around and test it for your job to see how much you can give without slowing down your Mappers from processing data. You could also increase the mapper memory as you increase the io.sort.mb. mapreduce.task.io.sort.mb mapreduce.map.sort.spill.percent

mbigelow · ‎03-01-2017

Check out this thread. There seems to be an issue with python getting the wrong number of columns from the alternatives command. The likely culprit is OpenJDK and if you can uninstall it but it could be something else. The thread contains a command to hunt down the exact issue. https://community.cloudera.com/t5/Cloudera-Manager-Installation/Problem-with-cloudera-agent/td-p/47698

mbigelow · ‎03-01-2017

I have not seen or heard of a spark-sql binary in which to launch spark jobs. My best guess is that it is used in conjunction with the Spark Thrift server. This feature of Spark is not included or supported in CDH (that is not saying you can't but the spark-sql binary will not exist by default). If you already installed the Spark Thrift server then you need to add the Spark SQL CLI as well (and add it to your $PATH if you want to use it without listing the full path). https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_spark_ki.html

mbigelow · ‎03-01-2017

When you added it through CM did it show up in the spark-default.conf?

mbigelow · ‎03-01-2017

It is just this one job right? Can you provide the full job logs?

Online	Offline
Last Visited	‎03-25-2019 05:55 PM

Member Since	‎08-16-2016 08:51 PM
Last Visited	‎03-25-2019 05:55 PM
Posts	642
Kudos received	129

Cloudera Community

Re: Configuring the HDFS superuser in Kerberos

Re: Hive process crash

Re: Upgrade from CDH 5.11 Express to Enterprise

Re: Adding user to Cloudera Manager using REST AP...

Re: Running in non-interactive mode, and data appe...

Re: How to see Mapreduce Spill Disk Activity

Re: Mapreduce failed on Could not deallocate conta...

Re: How to see Mapreduce Spill Disk Activity

Re: How to see Mapreduce Spill Disk Activity

Re: Test HA on ResourceManager

Re: How to see Mapreduce Spill Disk Activity

Re: Cluster Installation distributing parcel probl...

Re: Spark-sql command not found

Re: Wrong spark history redirection for finished j...

Re: Mapreduce failed on Could not deallocate conta...