To provide some more background, we are on a multi-tenant high performance "community" cluster at our university. I sat down with the HPCC administrator yesterday and he observed the Hadoop activity on our 4 nodes (1 namenode, 3 datanodes). This was his comment:
There's no apparent swapping or memory problem. The load goes up to 30 and the all cpus indicate over 90%, but the processes using them are in the "sleep" state rather than "running". Looks like they're waiting for I/O, but it's not clear yet if that means HDFS or lustre or the RAID array, or if it's related to the number and size of files in the data set.
FYI our storage is running on a Lustre appliance. So Ubuntu Linux is installed on each node, and they all mount the Lustre filesystem. HDFS resides on top of the Lustre appliance, which may in and of itself be the cause of certain problems. But I don't think that relates to my OP regarding 100% CPU utilization. Does this info provided by the HPCC administrator give anyone any other ideas for why we are experiencing this?
We have not been able to test the THP setting yet.
A basic Pig script job processing ~100 files takes up 100% of CPU resources on our cluster (1 NameNode, 3 DataNode)? We want to be able to run jobs in parallel. What configuration variable or environmental variable controls how much CPU a particular job is allowed to utilize?
You didn't mention what OS you are running, but on many platforms we've found the linux setting referred to as "Transparent Hugepage" or THP to be problematic like this. I also don't know the specifics of your MR job, but something in the userlogs might indicate what the tasks are doing when the CPUs max out. You should defintely check into disabling the "defrag" component of THP, though:
I have moved this thread to the Mapreduce board in hopes that someone in here might have some ideas. Thank you for your additional feedback.
We discovered a likely culprit after discussing with some colleagues at another institution.
mapred.tasktracker.map.tasks.maximum = 64
mapred.tasktracker.reduce.tasks.maximum = 32
MapReduce Child Java Maximum Heap Size = 1 GB
Modifying these to allow for fewer mappers/reducers and larger heap size will probably help out anyone with a similar situation. I will post back with results once we make the changes and re-run the job.
We will initially modify to the following values:
mapred.tasktracker.map.tasks.maximum = 16
mapred.tasktracker.reduce.tasks.maximum = 8
MapReduce Child Java Maximum Heap Size = 4 GB
Each of our DataNode servers has 128GB RAM, FYI.