Support Questions

Find answers, ask questions, and share your expertise

Yarn CDH5 -> Stuck in runing an example job : Starting scan to move intermediate done files

avatar
Explorer

Hello dear reader,

 

I am configuring an hadoop cluster CDH5 on 3 nodes.

1 NameNode + ResourceManager
1 DataNode + nodemanager + historyserver
1 DataNode + nodemanager

HDFS Works.

I configured YARN using advices from Cloudera and other websites

I try to use the example calcultating Pi with this command on my namenode :

asimon@namenode:/etc/hadoop/conf$ sudo hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.2.0.jar pi 2 2

which gives me :

 

Number of Maps  = 2
Samples per Map = 2
Wrote input for Map #0
Wrote input for Map #1
Starting Job
14/11/18 17:03:04 INFO client.RMProxy: Connecting to ResourceManager at namenode.beijaflore.com/192.168.54.111:8032
14/11/18 17:03:04 INFO input.FileInputFormat: Total input paths to process : 2
14/11/18 17:03:04 INFO mapreduce.JobSubmitter: number of splits:2
14/11/18 17:03:05 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1416326563510_0001
14/11/18 17:03:05 INFO impl.YarnClientImpl: Submitted application application_1416326563510_0001
14/11/18 17:03:05 INFO mapreduce.Job: The url to track the job: http://namenode.beijaflore.com:8088/proxy/application_1416326563510_0001/
14/11/18 17:03:05 INFO mapreduce.Job: Running job: job_1416326563510_0001

But the job is stuck here and can't show the actual job running... My 3 VMs are server and i don't have any UI to check the job is really pending.

For the actuals logs i get in mapred-mapred-historyserver-datanode1.out

14/11/18 17:32:34 INFO hs.JobHistory: History Cleaner started
14/11/18 17:32:34 INFO hs.JobHistory: History Cleaner complete
14/11/18 17:35:04 INFO hs.JobHistory: Starting scan to move intermediate done files
14/11/18 17:38:04 INFO hs.JobHistory: Starting scan to move intermediate done files
14/11/18 17:41:04 INFO hs.JobHistory: Starting scan to move intermediate done files
14/11/18 17:44:04 INFO hs.JobHistory: Starting scan to move intermediate done files

(I interupted the first Job and tried to run it a second time that's why time isn't matching)

Every 3 minutes it does this...(and i tried it for one night, it just keep going)

 

I don't know what is wrong. I thought i had an hint in configuring minimum and maximum memory size used (because my total config is only 6CPU 8GoRAM)... But it actualy change nothing

 

If you have any idea of what is wrong or where can I look to find the problem, I would be grateful to any information.

 

Cheers

1 ACCEPTED SOLUTION

avatar
Explorer

Thanks for your replies.

 

I did not touch the cluster for few days. Now i fixed the problem. May be the memory settings helped a little : i installed node manager also on my Name Node and fixed key: yarn.nodemanager.resource.memory-mb to 3GB but actually set the max container size to 2GB.

 

It appears in the logs of Node Manager i had a trouble with this :

 

  <property>

    <description>Where to aggregate logs to.</description>

    <name>yarn.nodemanager.remote-app-log-dir</name>

    <value>hdfs://namenode.*****.com:8020</value>

  </property>

 

so i just let it as default by deleting this key setting. Now it works !

 

See the beautiful result of my Pi example

 

asimon@namenode:/etc/hadoop/conf$ sudo -u mapred  hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.2.0.jar pi 2 2
Number of Maps  = 2
Samples per Map = 2
Wrote input for Map #0
Wrote input for Map #1
Starting Job
14/11/28 15:27:31 INFO client.RMProxy: Connecting to ResourceManager at namenode.beijaflore.com/192.168.54.111:8032
14/11/28 15:27:31 INFO input.FileInputFormat: Total input paths to process : 2
14/11/28 15:27:31 INFO mapreduce.JobSubmitter: number of splits:2
14/11/28 15:27:31 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1417184716260_0001
14/11/28 15:27:32 INFO impl.YarnClientImpl: Submitted application application_1417184716260_0001
14/11/28 15:27:32 INFO mapreduce.Job: The url to track the job: http://namenode.beijaflore.com:8088/proxy/application_1417184716260_0001/
14/11/28 15:27:32 INFO mapreduce.Job: Running job: job_1417184716260_0001
14/11/28 15:27:41 INFO mapreduce.Job: Job job_1417184716260_0001 running in uber mode : false
14/11/28 15:27:41 INFO mapreduce.Job:  map 0% reduce 0%
14/11/28 15:27:46 INFO mapreduce.Job:  map 50% reduce 0%
14/11/28 15:27:53 INFO mapreduce.Job:  map 100% reduce 0%
14/11/28 15:27:54 INFO mapreduce.Job:  map 100% reduce 100%
14/11/28 15:27:55 INFO mapreduce.Job: Job job_1417184716260_0001 completed successfully
14/11/28 15:27:55 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=50
                FILE: Number of bytes written=306673
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=560
                HDFS: Number of bytes written=215
                HDFS: Number of read operations=11
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=3
        Job Counters
                Launched map tasks=2
                Launched reduce tasks=1
                Data-local map tasks=2
                Total time spent by all maps in occupied slots (ms)=108368
                Total time spent by all reduces in occupied slots (ms)=44920
                Total time spent by all map tasks (ms)=13546
                Total time spent by all reduce tasks (ms)=5615
                Total vcore-seconds taken by all map tasks=13546
                Total vcore-seconds taken by all reduce tasks=5615
                Total megabyte-seconds taken by all map tasks=13871104
                Total megabyte-seconds taken by all reduce tasks=5749760
        Map-Reduce Framework
                Map input records=2
                Map output records=4
                Map output bytes=36
                Map output materialized bytes=56
                Input split bytes=324
                Combine input records=0
                Combine output records=0
                Reduce input groups=2
                Reduce shuffle bytes=56
                Reduce input records=4
                Reduce output records=0
                Spilled Records=8
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=142
                CPU time spent (ms)=1800
                Physical memory (bytes) snapshot=678866944
                Virtual memory (bytes) snapshot=3176591360
                Total committed heap usage (bytes)=454430720
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=236
        File Output Format Counters
                Bytes Written=97
Job Finished in 24.739 seconds
Estimated value of Pi is 4.00000000000000000000

 

 

Thank you again for your replies.. I'll continue to hang around here. Especialy when i get in trouble !

 

 

Miraculous

 

 

View solution in original post

6 REPLIES 6

avatar
Super Collaborator

Miraculous,

 

How have you setup your NM's: how much memeory and cores are avaialble on the nodes?

This sounds like you do not have enough space to start the containers.

 

Give the map and reduce conatiners 1GB and 1vcore to start with. Check the NM config values: yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores settings.

 

Retrry it after that.

Wilfred

avatar
Explorer

Thank you  WilfriedS for your answer.

 

I have the following resources :

2vCPU+2GovRAM on each of my 2 data nodes and 

2vCPU+4GovRAM on the NameNode

 

yarn-site.xml config is :

 

<configuration>
   <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>1024</value>
    </property>

    <property>
        <name>yarn.nodemanager.resource.cpu-vcores</name>
        <value>1</value>
        <description>Number of CPU cores that can be allocated for containers.</description>
    </property>

    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>128</value>
        <description>Minimum limit of memory to allocate to each container request at the Resource Manager.</description>
    </property>

    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>2048</value>
        <description>Maximum limit of memory to allocate to each container request at the Resource Manager.</description>
    </property>

    <property>
        <name>yarn.scheduler.minimum-allocation-vcores</name>
        <value>1</value>
        <description>The minimum allocation for every container request at the RM, in terms of virtual CPU cores. Requests lower than this won't take effect, and the specified value will get allocated the minimum.</description>
    </property>

    <property>
        <name>yarn.scheduler.maximum-allocation-vcores</name>
        <value>2</value>
        <description>The maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value.</description>
    </property>


  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>namenode.beijaflore.com</value>
  </property>

  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>

  <property>
    <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>

  <property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
  </property>

  <property>
    <description>List of directories to store localized files in.</description>
    <name>yarn.nodemanager.local-dirs</name>
    <value>file:///var/lib/hadoop-yarn/cache/${user.name}/nm-local-dir</value>
  </property>

  <property>
    <description>Where to store container logs.</description>
    <name>yarn.nodemanager.log-dirs</name>
    <value>file:///var/log/hadoop-yarn/containers</value>
  </property>

  <property>
    <description>Where to aggregate logs to.</description>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>hdfs://var/log/hadoop-yarn/apps</value>
  </property>

  <property>
    <description>Classpath for typical applications.</description>
     <name>yarn.application.classpath</name>
     <value>
        $HADOOP_CONF_DIR,
        $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
        $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
        $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
        $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
     </value>
  </property>
</configuration>

 

Then i restarted yarn and jobhistory services on the 3 nodes and tried the job again.. Same bug.

 

I would like to add that once i started the job, i can list and kill it with hadoop job -list  and  -kill

But he loop in jobhistory continue to go on every 3 minutes -> INFO hs.JobHistory: Starting scan to move intermediate done files.

Even when i restart services the jobhistory log restart repeating... How do i stop it to run?

 

/var/log/hadoop-yarn/containers is an empty directory..

 

Is it that i need more memory on the node manager?

 

Thanks

 

avatar
Mentor
> Is it that i need more memory on the node manager?

Yes, you can do with more memory on the node manager configs (key: yarn.nodemanager.resource.memory-mb). Try setting it to 2 or 3 GB across all 3 NodeManagers, restart them, and retry your job example.

avatar
Explorer

Thanks for your replies.

 

I did not touch the cluster for few days. Now i fixed the problem. May be the memory settings helped a little : i installed node manager also on my Name Node and fixed key: yarn.nodemanager.resource.memory-mb to 3GB but actually set the max container size to 2GB.

 

It appears in the logs of Node Manager i had a trouble with this :

 

  <property>

    <description>Where to aggregate logs to.</description>

    <name>yarn.nodemanager.remote-app-log-dir</name>

    <value>hdfs://namenode.*****.com:8020</value>

  </property>

 

so i just let it as default by deleting this key setting. Now it works !

 

See the beautiful result of my Pi example

 

asimon@namenode:/etc/hadoop/conf$ sudo -u mapred  hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.5.0-cdh5.2.0.jar pi 2 2
Number of Maps  = 2
Samples per Map = 2
Wrote input for Map #0
Wrote input for Map #1
Starting Job
14/11/28 15:27:31 INFO client.RMProxy: Connecting to ResourceManager at namenode.beijaflore.com/192.168.54.111:8032
14/11/28 15:27:31 INFO input.FileInputFormat: Total input paths to process : 2
14/11/28 15:27:31 INFO mapreduce.JobSubmitter: number of splits:2
14/11/28 15:27:31 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1417184716260_0001
14/11/28 15:27:32 INFO impl.YarnClientImpl: Submitted application application_1417184716260_0001
14/11/28 15:27:32 INFO mapreduce.Job: The url to track the job: http://namenode.beijaflore.com:8088/proxy/application_1417184716260_0001/
14/11/28 15:27:32 INFO mapreduce.Job: Running job: job_1417184716260_0001
14/11/28 15:27:41 INFO mapreduce.Job: Job job_1417184716260_0001 running in uber mode : false
14/11/28 15:27:41 INFO mapreduce.Job:  map 0% reduce 0%
14/11/28 15:27:46 INFO mapreduce.Job:  map 50% reduce 0%
14/11/28 15:27:53 INFO mapreduce.Job:  map 100% reduce 0%
14/11/28 15:27:54 INFO mapreduce.Job:  map 100% reduce 100%
14/11/28 15:27:55 INFO mapreduce.Job: Job job_1417184716260_0001 completed successfully
14/11/28 15:27:55 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=50
                FILE: Number of bytes written=306673
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=560
                HDFS: Number of bytes written=215
                HDFS: Number of read operations=11
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=3
        Job Counters
                Launched map tasks=2
                Launched reduce tasks=1
                Data-local map tasks=2
                Total time spent by all maps in occupied slots (ms)=108368
                Total time spent by all reduces in occupied slots (ms)=44920
                Total time spent by all map tasks (ms)=13546
                Total time spent by all reduce tasks (ms)=5615
                Total vcore-seconds taken by all map tasks=13546
                Total vcore-seconds taken by all reduce tasks=5615
                Total megabyte-seconds taken by all map tasks=13871104
                Total megabyte-seconds taken by all reduce tasks=5749760
        Map-Reduce Framework
                Map input records=2
                Map output records=4
                Map output bytes=36
                Map output materialized bytes=56
                Input split bytes=324
                Combine input records=0
                Combine output records=0
                Reduce input groups=2
                Reduce shuffle bytes=56
                Reduce input records=4
                Reduce output records=0
                Spilled Records=8
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=142
                CPU time spent (ms)=1800
                Physical memory (bytes) snapshot=678866944
                Virtual memory (bytes) snapshot=3176591360
                Total committed heap usage (bytes)=454430720
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=236
        File Output Format Counters
                Bytes Written=97
Job Finished in 24.739 seconds
Estimated value of Pi is 4.00000000000000000000

 

 

Thank you again for your replies.. I'll continue to hang around here. Especialy when i get in trouble !

 

 

Miraculous

 

 

avatar
Mentor
Glad to know it worked, thank you for following up. Make sure to mark the topic resolved, so others have a better time searching and finding this if they hit the same/similar issues.

avatar
New Contributor
In my case, neither the ResourceManager nor the NodeManager were up and running.

Phew!