Created 06-12-2016 10:11 AM
Hi,
We have written a mapreduce job to process log files. As of now we have around 52GB of input files but it is taking around an hour to process the data.It creates only one reducer job by default.Often we get to see a timeout error in the reduce task and then it restarts and gets completed. Below is the stats for the successful completion of the job. Kindly let us know how the performance can be improved.
 File System Counters
                FILE: Number of bytes read=876100387
                FILE: Number of bytes written=1767603407
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=52222279591
                HDFS: Number of bytes written=707429882
                HDFS: Number of read operations=351
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters 
                Failed reduce tasks=1
                Launched map tasks=116
                Launched reduce tasks=2
                Other local map tasks=116
                Total time spent by all maps in occupied slots (ms)=9118125
                Total time spent by all reduces in occupied slots (ms)=7083783
                Total time spent by all map tasks (ms)=3039375
                Total time spent by all reduce tasks (ms)=2361261
                Total vcore-seconds taken by all map tasks=3039375
                Total vcore-seconds taken by all reduce tasks=2361261
                Total megabyte-seconds taken by all map tasks=25676640000
                Total megabyte-seconds taken by all reduce tasks=20552415744
        Map-Reduce Framework
                Map input records=49452982
                Map output records=5730971
                Map output bytes=864140911
                Map output materialized bytes=876101077
                Input split bytes=13922
                Combine input records=0
                Combine output records=0
                Reduce input groups=1082133
                Reduce shuffle bytes=876101077
                Reduce input records=5730971
                Reduce output records=5730971
                Spilled Records=11461942
                Shuffled Maps =116
                Failed Shuffles=0
                Merged Map outputs=116
                GC time elapsed (ms)=190633
                CPU time spent (ms)=4536110
                Physical memory (bytes) snapshot=340458307584
                Virtual memory (bytes) snapshot=1082745069568
                Total committed heap usage (bytes)=378565820416
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=52222265669
        File Output Format Counters 
                Bytes Written=707429882
					
				
			
			
				
			
			
			
			
			
			
			
		Created 06-12-2016 10:27 AM
Hello Venkadesh
It would be worth investigating why your reducer gets a time out error and then gets completed. Do you have a slow node, is a code related error, are your reducers sitting around too long. According to those questions many options are available - you could increase the task timeout, (mapred.task.timeout)
- you could force increase the number of reducers to get better distribution (mapred.reduce.tasks=# of reducers)
-you could configure when reducers start get to start clsoer to the end of the map phase (mapred.reduce.slowstart.completed.maps)
- you could have speculative execution to see if some nodes are faster then others:
These are some ideas that come to mind, depending on a closer analysis their might also be other ways.
hope any of this helps
Created 06-12-2016 10:44 AM
Thanks @nmaillard. When the job starts it prints the below warnings and it takes around ten minutes before the job gets started. Is this something normal or need to be looked into ?
[root@emrbldbgdapd1 exchange]# yarn jar ExchangeLogsMR.jar /ExchangeLogs/OnPremise/2016 /ExchangeLogs/output1 16/06/12 13:40:33 INFO impl.TimelineClientImpl: Timeline service address: http://emrbldbgdapd2.emaar.ae:8188/ws/v1/timeline/ 16/06/12 13:40:33 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 16/06/12 13:40:34 INFO input.FileInputFormat: Total input paths to process : 8 16/06/12 13:40:35 WARN hdfs.DFSClient: DFSInputStream has been closed already 16/06/12 13:42:23 WARN hdfs.DFSClient: DFSInputStream has been closed already 16/06/12 13:44:27 WARN hdfs.DFSClient: DFSInputStream has been closed already 16/06/12 13:46:37 WARN hdfs.DFSClient: DFSInputStream has been closed already 16/06/12 13:48:42 WARN hdfs.DFSClient: DFSInputStream has been closed already 16/06/12 13:50:28 WARN hdfs.DFSClient: DFSInputStream has been closed already 16/06/12 13:50:28 WARN hdfs.DFSClient: DFSInputStream has been closed already 16/06/12 13:50:28 WARN hdfs.DFSClient: DFSInputStream has been closed already 16/06/12 13:50:28 INFO mapreduce.JobSubmitter: number of splits:116 16/06/12 13:50:28 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1465711050466_0004 16/06/12 13:50:28 INFO impl.YarnClientImpl: Submitted application application_1465711050466_0004
Created 06-12-2016 11:56 AM
The warning DFSInputStream has been closed already is a warning and is fixed in hadoop 2.7.2 : https://issues.apache.org/jira/browse/HDFS-8099 taking 10 minutes to submit the job seems to be another problem from your reduce issue. Do check what your available ressources look like in yarn and how long it takes to get an Application master. i t would interesting in the logs to see if he is waiting or what else is happening. You could also check it your timeline server is responding and not underwater as it could have impacts
Created 06-13-2016 07:03 AM
@nmaillardWhen the process starts I see the YARN memory as 100% on the Ambari homepage.How do we find the time taken to get an Application master.
 
					
				
				
			
		
