Created 05-26-2016 02:04 PM
Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1
Nodes : 1 NN(8 X 1TB hdd, 16 X 2.53 GHz core processor,48GB RAM, RHEL 6.5) + 8 DN(8 X 600GB hdd, 16 X 2.53 GHz core processor, 75GB RAM, RHEL 6.5). Nodes are connected by a 10-gig network
I have a staging/vanilla/simple Hive table with 24 billion records.
I created an empty ORC table as follows :
CREATE EXTERNAL TABLE IF NOT EXISTS FactSampleValue ( `Snapshot_Id` int /*OTHER COLUMNS*/ ) PARTITIONED BY (`SmapiName_ver` varchar(30)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS ORC LOCATION '/datastore/';
Some settings :
Logging initialized using configuration in file:/etc/hive/2.3.2.0-2950/0/hive-log4j.properties hive> > > set optimize.sort.dynamic.partitioning=true; hive> > set hive.exec.dynamic.partition.mode=nonstrict; hive> > set hive.exec.max.dynamic.partitions.pernode=3000; hive> > > set hive.enforce.sorting=true; hive> >
Executed an insert :
INSERT INTO odp_dw_may2016_orc.FactSampleValue PARTITION (SmapiName_ver) SELECT * FROM odp_dw_may2016.FactSampleValue DISTRIBUTE BY SmapiName_ver SORT BY SmapiName_ver; Query ID = hive_20160526125733_8834c7bc-b4f3-4539-8d48-fa46bba92a33 Total jobs = 1 Launching Job 1 out of 1 ******REDUCERS NOT STARTING Status: Running (Executing on YARN cluster with App id application_1446726117927_0092) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 RUNNING 3098 0 110 2988 0 0 Reducer 2 INITED 1009 0 0 1009 0 0 -------------------------------------------------------------------------------- VERTICES: 00/02 [>>--------------------------] 0% ELAPSED TIME: 12.70 s -------------------------------------------------------------------------------- Status: Running (Executing on YARN cluster with App id application_1446726117927_0092)
After a long time, the mappers completed but the reducers failed :
Status: Running (Executing on YARN cluster with App id application_1446726117927_0092) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 ........ RUNNING 3098 2655 94 349 0 0 Reducer 2 RUNNING 1009 45 110 854 49 91 -------------------------------------------------------------------------------- VERTICES: 01/02 [=================>>---------] 65% ELAPSED TIME: 8804.16 s --------------------------------------------------------------------------------
As seen above, A FEW mappers started again, I guess it's a reattempt. Again some failures, latest :
-------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 ........ RUNNING 3098 2773 113 212 0 8 Reducer 2 RUNNING 1009 45 110 854 57 119 -------------------------------------------------------------------------------- VERTICES: 01/02 [=================>>---------] 68% ELAPSED TIME: 10879.73 s --------------------------------------------------------------------------------
I suspect some memory or relevant issues but I dunno which all logs should I check. For example, under log/application_1446726117927_0092, I found several containers and many of them had the following error in the syslog_attempt_1446726117927_0092_1_01_000041_1 :
2016-05-26 15:45:11,932 [WARN] [TezTaskEventRouter{attempt_1446726117927_0092_1_01_000041_1}] |orderedgrouped.ShuffleScheduler|: Map_1: Duplicate fetch of input no longer needs to be fetched: InputAttemptIdentifier [inputIdentifier=InputIdentifier [inputIndex=709], attemptNumber=1, pathComponent=attempt_1446726117927_0092_1_00_000709_1_10012, spillType=0, spillId=-1] 2016-05-26 15:45:24,251 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Received should die response from AM 2016-05-26 15:45:24,251 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Asked to die via task heartbeat 2016-05-26 15:45:24,253 [INFO] [main] |task.TezTaskRunner|: Interrupted while waiting for task to complete. Interrupting task 2016-05-26 15:45:24,254 [INFO] [main] |task.TezTaskRunner|: Shutdown requested... returning 2016-05-26 15:45:24,254 [INFO] [main] |task.TezChild|: Got a shouldDie notification via heartbeats for container container_1446726117927_0092_01_000187. Shutting down 2016-05-26 15:45:24,254 [INFO] [main] |task.TezChild|: Shutdown invoked for container container_1446726117927_0092_01_000187 2016-05-26 15:45:24,254 [INFO] [main] |task.TezChild|: Shutting down container container_1446726117927_0092_01_000187 2016-05-26 15:45:24,255 [ERROR] [TezChild] |tez.ReduceRecordProcessor|: Hit error while closing operators - failing tree 2016-05-26 15:45:24,256 [ERROR] [TezChild] |tez.TezProcessor|: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052) at org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120) at org.apache.tez.runtime.InputReadyTracker.waitForAllInputsReady(InputReadyTracker.java:90) at org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:116) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:117) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:147) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2016-05-26 15:45:24,257 [INFO] [TezChild] |task.TezTaskRunner|: Encounted an error while executing task: attempt_1446726117927_0092_1_01_000041_1 java.lang.RuntimeException: java.lang.InterruptedException at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052) at org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120) at org.apache.tez.runtime.InputReadyTracker.waitForAllInputsReady(InputReadyTracker.java:90) at org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:116) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:117) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:147) ... 14 more 2016-05-26 15:45:24,260 [INFO] [TezChild] |task.TezTaskRunner|: Ignoring the following exception since a previous exception is already registered 2016-05-26 15:45:24,275 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Final Counters for attempt_1446726117927_0092_1_01_000041_1: Counters: 71 [[File System Counters FILE_BYTES_READ=290688, FILE_BYTES_WRITTEN=227062571, FILE_READ_OPS=0, FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, HDFS_BYTES_READ=0, HDFS_BYTES_WRITTEN=0, HDFS_READ_OPS=0, HDFS_LARGE_READ_OPS=0, HDFS_WRITE_OPS=0][org.apache.tez.common.counters.TaskCounter REDUCE_INPUT_GROUPS=0, REDUCE_INPUT_RECORDS=0, COMBINE_INPUT_RECORDS=0, SPILLED_RECORDS=0, NUM_SHUFFLED_INPUTS=73, NUM_SKIPPED_INPUTS=3022, NUM_FAILED_SHUFFLE_INPUTS=0, MERGED_MAP_OUTPUTS=53, GC_TIME_MILLIS=11162, CPU_MILLISECONDS=62510, PHYSICAL_MEMORY_BYTES=664797184, VIRTUAL_MEMORY_BYTES=2432901120, COMMITTED_HEAP_BYTES=664797184, OUTPUT_RECORDS=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=227062571, ADDITIONAL_SPILLS_BYTES_READ=0, SHUFFLE_BYTES=326055728, SHUFFLE_BYTES_DECOMPRESSED=2123360327, SHUFFLE_BYTES_TO_MEM=286174772, SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_DISK_DIRECT=39880956, NUM_MEM_TO_DISK_MERGES=2, NUM_DISK_TO_DISK_MERGES=0, SHUFFLE_PHASE_TIME=0, MERGE_PHASE_TIME=0, FIRST_EVENT_RECEIVED=264, LAST_EVENT_RECEIVED=49569][Shuffle Errors BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0][Shuffle Errors_Reducer_2_INPUT_Map_1 BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0][TaskCounter_Reducer_2_INPUT_Map_1 ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=227062571, COMBINE_INPUT_RECORDS=0, FIRST_EVENT_RECEIVED=264, LAST_EVENT_RECEIVED=49569, MERGED_MAP_OUTPUTS=53, MERGE_PHASE_TIME=0, NUM_DISK_TO_DISK_MERGES=0, NUM_FAILED_SHUFFLE_INPUTS=0, NUM_MEM_TO_DISK_MERGES=2, NUM_SHUFFLED_INPUTS=73, NUM_SKIPPED_INPUTS=3022, REDUCE_INPUT_GROUPS=0, REDUCE_INPUT_RECORDS=0, SHUFFLE_BYTES=326055728, SHUFFLE_BYTES_DECOMPRESSED=2123360327, SHUFFLE_BYTES_DISK_DIRECT=39880956, SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=286174772, SHUFFLE_PHASE_TIME=0, SPILLED_RECORDS=0][TaskCounter_Reducer_2_OUTPUT_out_Reducer_2 OUTPUT_RECORDS=0]] 2016-05-26 15:45:24,275 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Joining on EventRouter 2016-05-26 15:45:24,276 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Closed processor for vertex=Reducer 2, index=1 2016-05-26 15:45:24,276 [INFO] [TezChild] |orderedgrouped.Shuffle|: Shutting down Shuffle for source: Map_1 2016-05-26 15:45:24,276 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Already shutdown. Ignoring error 2016-05-26 15:45:24,276 [INFO] [TezChild] |orderedgrouped.ShuffleInputEventHandlerOrderedGrouped|: Map 1: numDmeEventsSeen=3480, numDmeEventsSeenWithNoData=3395, numObsoletionEventsSeen=443, updateOnClose 2016-05-26 15:45:24,277 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #1, status:false, isInterrupted:false 2016-05-26 15:45:24,277 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #2, status:false, isInterrupted:false 2016-05-26 15:45:24,277 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #3, status:false, isInterrupted:false 2016-05-26 15:45:24,278 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #4, status:false, isInterrupted:false 2016-05-26 15:45:24,278 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #5, status:false, isInterrupted:false 2016-05-26 15:45:24,278 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #6, status:false, isInterrupted:false 2016-05-26 15:45:24,278 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #7, status:false, isInterrupted:false 2016-05-26 15:45:24,279 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #8, status:false, isInterrupted:false 2016-05-26 15:45:24,279 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #9, status:false, isInterrupted:false 2016-05-26 15:45:24,279 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #10, status:false, isInterrupted:false 2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #11, status:false, isInterrupted:false 2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #12, status:false, isInterrupted:false 2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #13, status:false, isInterrupted:false 2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #14, status:false, isInterrupted:false 2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #15, status:false, isInterrupted:false 2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #16, status:false, isInterrupted:false 2016-05-26 15:45:24,291 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #17, status:false, isInterrupted:false 2016-05-26 15:45:24,302 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #18, status:false, isInterrupted:false 2016-05-26 15:45:24,314 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #19, status:false, isInterrupted:false 2016-05-26 15:45:24,314 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #20, status:false, isInterrupted:false 2016-05-26 15:45:24,314 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #21, status:false, isInterrupted:false 2016-05-26 15:45:24,318 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #22, status:false, isInterrupted:false 2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #23, status:false, isInterrupted:false 2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #24, status:false, isInterrupted:false 2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #25, status:false, isInterrupted:false 2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #26, status:false, isInterrupted:false 2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #27, status:false, isInterrupted:false 2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #28, status:false, isInterrupted:false 2016-05-26 15:45:24,320 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #29, status:false, isInterrupted:false 2016-05-26 15:45:24,320 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #30, status:false, isInterrupted:false 2016-05-26 15:45:24,320 [INFO] [TezChild] |orderedgrouped.MergeManager|: finalMerge called with 8 in-memory map-outputs and 14 on-disk map-outputs 2016-05-26 15:45:24,321 [INFO] [TezChild] |impl.TezMerger|: Merging 8 sorted segments 2016-05-26 15:45:24,321 [INFO] [TezChild] |impl.TezMerger|: Down to the last merge-pass, with 8 segments left of total size: 376486161 bytes
Which all logs and what errors shall I look for ? How can Ambari help ?
**********EDIT-1**********
The Hive query finally failed with the following error :
Status: Failed Vertex re-running, vertexName=Map 1, vertexId=vertex_1446726117927_0092_1_00 Vertex failed, vertexName=Reducer 2, vertexId=vertex_1446726117927_0092_1_01, diagnostics=[Task failed, taskId=task_1446726117927_0092_1_01_000066, diagnostics=[TaskAttempt 0 failed, info=[Container container_1446726117927_0092_01_000036 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch. Container id: container_1446726117927_0092_01_000036 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:576) at org.apache.hadoop.util.Shell.run(Shell.java:487) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 ]], TaskAttempt 1 failed, info=[Container container_1446726117927_0092_01_000199 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch. Container id: container_1446726117927_0092_01_000199 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:576) at org.apache.hadoop.util.Shell.run(Shell.java:487) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 ]], TaskAttempt 2 failed, info=[Error: Fatal Error cause TezChild exit.:java.lang.OutOfMemoryError: Java heap space at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:133) at java.io.OutputStreamWriter.write(OutputStreamWriter.java:220) at java.io.Writer.write(Writer.java:157) at org.apache.log4j.helpers.QuietWriter.write(QuietWriter.java:48) at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) at org.apache.log4j.Category.callAppenders(Category.java:206) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.log(Category.java:856) at org.apache.commons.logging.impl.Log4JLogger.error(Log4JLogger.java:218) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:156) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ], TaskAttempt 3 failed, info=[Container container_1446726117927_0092_01_000299 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch. Container id: container_1446726117927_0092_01_000299 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:576) at org.apache.hadoop.util.Shell.run(Shell.java:487) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 ]]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:944, Vertex vertex_1446726117927_0092_1_01 [Reducer 2] killed/failed due to:OWN_TASK_FAILURE] Vertex killed, vertexName=Map 1, vertexId=vertex_1446726117927_0092_1_00, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:23, Vertex vertex_1446726117927_0092_1_00 [Map 1] killed/failed due to:OTHER_VERTEX_FAILURE] DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 1, vertexId=vertex_1446726117927_0092_1_00Vertex failed, vertexName=Reducer 2, vertexId=vertex_1446726117927_0092_1_01, diagnostics=[Task failed, taskId=task_1446726117927_0092_1_01_000066, diagnostics=[TaskAttempt 0 failed, info=[Container container_1446726117927_0092_01_000036 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch. Container id: container_1446726117927_0092_01_000036 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:576) at org.apache.hadoop.util.Shell.run(Shell.java:487) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 ]], TaskAttempt 1 failed, info=[Container container_1446726117927_0092_01_000199 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch. Container id: container_1446726117927_0092_01_000199 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:576) at org.apache.hadoop.util.Shell.run(Shell.java:487) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 ]], TaskAttempt 2 failed, info=[Error: Fatal Error cause TezChild exit.:java.lang.OutOfMemoryError: Java heap space at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:133) at java.io.OutputStreamWriter.write(OutputStreamWriter.java:220) at java.io.Writer.write(Writer.java:157) at org.apache.log4j.helpers.QuietWriter.write(QuietWriter.java:48) at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) at org.apache.log4j.Category.callAppenders(Category.java:206) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.log(Category.java:856) at org.apache.commons.logging.impl.Log4JLogger.error(Log4JLogger.java:218) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:156) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ], TaskAttempt 3 failed, info=[Container container_1446726117927_0092_01_000299 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch. Container id: container_1446726117927_0092_01_000299 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:576) at org.apache.hadoop.util.Shell.run(Shell.java:487) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 ]]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:944, Vertex vertex_1446726117927_0092_1_01 [Reducer 2] killed/failed due to:OWN_TASK_FAILURE]Vertex killed, vertexName=Map 1, vertexId=vertex_1446726117927_0092_1_00, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:23, Vertex vertex_1446726117927_0092_1_00 [Map 1] killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1
As I suspected, there is a memory issue BUT THIS IS OCCURING ON THE SECOND ATTEMPT, the FIRST ONE FAILED for unknown reason :
FatalError cause TezChildexit.:java.lang.OutOfMemoryError:Java heap space
The question is which parameters need to be changed ?
Created 05-26-2016 07:16 PM
One thing I observed is , we do not mention row format when we tell to store it as ORC format. Can you please change and try it again.
Created 05-27-2016 12:38 AM
I see you are using constant value for partition column. You might be hitting this issue: