Member since
08-16-2016
642
Posts
131
Kudos Received
68
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 3978 | 10-13-2017 09:42 PM | |
| 7477 | 09-14-2017 11:15 AM | |
| 3799 | 09-13-2017 10:35 PM | |
| 6041 | 09-13-2017 10:25 PM | |
| 6604 | 09-13-2017 10:05 PM |
03-02-2017
07:49 AM
The only suggestion I have is to try running some tests to see if you can weed out any bad disks. DFSIO and Terasort may hit on it but may not. You can use 'dd' or other software to test the raw disks. Beyond that you may be chasing ghosts (spending more time than worth it on an ephemeral problem).
... View more
03-02-2017
07:44 AM
Yes. That settings only effect it to put a cap on how large a container can be. It does not mean that your containers will be this size. The yarn.scheduler.minimum-allocation-mb will set the container size if one is not provided by the user.
... View more
03-01-2017
10:09 PM
1 Kudo
yarn.scheduler.maximum-allocation-mb - This is the max memory that a single container can get yarn.nodemanager.resource.memory-mb - This is how much memory per NM is allocated for containers I always set yarn.scheduler.maximum-allocation-mb eqaul to yarn.nodemanager.resource.memory-mb as the single largest container I could do on a host would be the amount of memory on a host allocated for YARN. You can set the yarn.scheduler.maximum-allocation-mb to any value, as mention it should not exceed what you set for yarn.nodemanager.resource.memory-mb. If it does, it won't harm anything until someone tries to get a container > yarn.nodemanager.resource.memory-mb. You might be able to set the configuration like this: mapreduce.task.io.sort.mb = 4gb - I can do this . The issue is that the sort buffer is part of the heap of the mapper. For instance a mapper of 3 GB and a heap of 2.5 GB would mean that the sort buffer could quickly fill up the 2.5 GB of heap available. You may not always hit OOM but it is likely due to the poor configuration. In summary yarn.nodemanager.resource.memory-mb > mapreduce.map.memory.mb > mapreduce.task.io.sort. It is not mandatory to increase the mapreduce.map.memory.mb if you increase the sort buffer. Lets use another example, say you are using 4 GB container with a heap of 3.2 heap. You are spilling a lot of records because you still are using the default sort buffer size. So you increase it to 1 GB. You just shrunk the available memory of your heap from 3.1 GB (3.2 - 100 M, roughly) to 2.2 GB (3.2 - 1). To compensate you could just increase your heap, and along with that your mapper memory. In ths example it would then look like 5 GB container, 4.2 GB heap, and 1 GB sort buffer.
... View more
03-01-2017
01:59 PM
The YARN service is unable to get the HDFS Delegation token on behalf of the users. What values do you have for the below settings hadoop.proxyuser.yarn.hosts hadoop.proxyuser.yarn.groups hadoop.proxyuser.mapred.hosts hadoop.proxyuser.mapred.groups
... View more
03-01-2017
12:33 PM
The best indicators are the job counters. Take a look at FILE: Number of bytes written and Spilled Records especially in relation to Map output records. If the spill records are a large portion of the map output records you are spilling a lot. The first setting below determines how much memory to use for the Map sort and the spill percentage is when it starts spilling to disk as a portion of the first setting. You can tweak both to reduce the amount spilled. The io.sort.mb is a port of the Map heap so there isn't a clear cut "it should be X". You can play around and test it for your job to see how much you can give without slowing down your Mappers from processing data. You could also increase the mapper memory as you increase the io.sort.mb. mapreduce.task.io.sort.mb mapreduce.map.sort.spill.percent
... View more
03-01-2017
11:34 AM
Check out this thread. There seems to be an issue with python getting the wrong number of columns from the alternatives command. The likely culprit is OpenJDK and if you can uninstall it but it could be something else. The thread contains a command to hunt down the exact issue. https://community.cloudera.com/t5/Cloudera-Manager-Installation/Problem-with-cloudera-agent/td-p/47698
... View more
03-01-2017
11:29 AM
I have not seen or heard of a spark-sql binary in which to launch spark jobs. My best guess is that it is used in conjunction with the Spark Thrift server. This feature of Spark is not included or supported in CDH (that is not saying you can't but the spark-sql binary will not exist by default). If you already installed the Spark Thrift server then you need to add the Spark SQL CLI as well (and add it to your $PATH if you want to use it without listing the full path). https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_spark_ki.html
... View more
03-01-2017
10:04 AM
When you added it through CM did it show up in the spark-default.conf?
... View more
03-01-2017
10:02 AM
It is just this one job right? Can you provide the full job logs?
... View more