Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hive INSERT failing for a large table

avatar
Super Collaborator
Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1

Nodes : 1 NN(8 X 1TB hdd, 16 X 2.53 GHz core processor,48GB RAM, RHEL 6.5) + 8 DN(8 X 600GB hdd, 16 X 2.53 GHz core processor, 75GB RAM, RHEL 6.5). Nodes are connected by a 10-gig network

I have a staging/vanilla/simple Hive table with 24 billion records.

I created an empty ORC table as follows :

CREATE EXTERNAL TABLE IF NOT EXISTS FactSampleValue (
`Snapshot_Id` int 

/*OTHER COLUMNS*/
)
PARTITIONED BY (`SmapiName_ver` varchar(30))
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS ORC LOCATION '/datastore/';

Some settings :

Logging initialized using configuration in file:/etc/hive/2.3.2.0-2950/0/hive-log4j.properties
hive>
    >
    > set optimize.sort.dynamic.partitioning=true;
hive>
    > set hive.exec.dynamic.partition.mode=nonstrict;
hive>
    > set hive.exec.max.dynamic.partitions.pernode=3000;
hive>
    >
    > set hive.enforce.sorting=true;
hive>
    >

Executed an insert :

INSERT INTO odp_dw_may2016_orc.FactSampleValue PARTITION (SmapiName_ver) SELECT * FROM odp_dw_may2016.FactSampleValue DISTRIBUTE BY SmapiName_ver SORT BY SmapiName_ver;
Query ID = hive_20160526125733_8834c7bc-b4f3-4539-8d48-fa46bba92a33
Total jobs = 1
Launching Job 1 out of 1
******REDUCERS NOT STARTING
Status: Running (Executing on YARN cluster with App id application_1446726117927_0092)
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1                RUNNING   3098          0      110     2988       0       0
Reducer 2             INITED   1009          0        0     1009       0       0
--------------------------------------------------------------------------------
VERTICES: 00/02  [>>--------------------------] 0%    ELAPSED TIME: 12.70 s
--------------------------------------------------------------------------------
Status: Running (Executing on YARN cluster with App id application_1446726117927_0092)

After a long time, the mappers completed but the reducers failed :

Status: Running (Executing on YARN cluster with App id application_1446726117927_0092)
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ........       RUNNING   3098       2655       94      349       0       0
Reducer 2            RUNNING   1009         45      110      854      49      91
--------------------------------------------------------------------------------
VERTICES: 01/02  [=================>>---------] 65%   ELAPSED TIME: 8804.16 s
--------------------------------------------------------------------------------

As seen above, A FEW mappers started again, I guess it's a reattempt. Again some failures, latest :

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ........       RUNNING   3098       2773      113      212       0       8
Reducer 2            RUNNING   1009         45      110      854      57     119
--------------------------------------------------------------------------------
VERTICES: 01/02  [=================>>---------] 68%   ELAPSED TIME: 10879.73 s
--------------------------------------------------------------------------------

I suspect some memory or relevant issues but I dunno which all logs should I check. For example, under log/application_1446726117927_0092, I found several containers and many of them had the following error in the syslog_attempt_1446726117927_0092_1_01_000041_1 :

2016-05-26 15:45:11,932 [WARN] [TezTaskEventRouter{attempt_1446726117927_0092_1_01_000041_1}] |orderedgrouped.ShuffleScheduler|: Map_1: Duplicate fetch of input no longer needs to be fetched: InputAttemptIdentifier [inputIdentifier=InputIdentifier [inputIndex=709], attemptNumber=1, pathComponent=attempt_1446726117927_0092_1_00_000709_1_10012, spillType=0, spillId=-1]
2016-05-26 15:45:24,251 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Received should die response from AM
2016-05-26 15:45:24,251 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Asked to die via task heartbeat
2016-05-26 15:45:24,253 [INFO] [main] |task.TezTaskRunner|: Interrupted while waiting for task to complete. Interrupting task
2016-05-26 15:45:24,254 [INFO] [main] |task.TezTaskRunner|: Shutdown requested... returning
2016-05-26 15:45:24,254 [INFO] [main] |task.TezChild|: Got a shouldDie notification via heartbeats for container container_1446726117927_0092_01_000187. Shutting down
2016-05-26 15:45:24,254 [INFO] [main] |task.TezChild|: Shutdown invoked for container container_1446726117927_0092_01_000187
2016-05-26 15:45:24,254 [INFO] [main] |task.TezChild|: Shutting down container container_1446726117927_0092_01_000187
2016-05-26 15:45:24,255 [ERROR] [TezChild] |tez.ReduceRecordProcessor|: Hit error while closing operators - failing tree
2016-05-26 15:45:24,256 [ERROR] [TezChild] |tez.TezProcessor|: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
at org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120)
at org.apache.tez.runtime.InputReadyTracker.waitForAllInputsReady(InputReadyTracker.java:90)
at org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:116)
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:117)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:147)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2016-05-26 15:45:24,257 [INFO] [TezChild] |task.TezTaskRunner|: Encounted an error while executing task: attempt_1446726117927_0092_1_01_000041_1
java.lang.RuntimeException: java.lang.InterruptedException
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
at org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120)
at org.apache.tez.runtime.InputReadyTracker.waitForAllInputsReady(InputReadyTracker.java:90)
at org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:116)
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:117)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:147)
... 14 more
2016-05-26 15:45:24,260 [INFO] [TezChild] |task.TezTaskRunner|: Ignoring the following exception since a previous exception is already registered
2016-05-26 15:45:24,275 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Final Counters for attempt_1446726117927_0092_1_01_000041_1: Counters: 71 [[File System Counters FILE_BYTES_READ=290688, FILE_BYTES_WRITTEN=227062571, FILE_READ_OPS=0, FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, HDFS_BYTES_READ=0, HDFS_BYTES_WRITTEN=0, HDFS_READ_OPS=0, HDFS_LARGE_READ_OPS=0, HDFS_WRITE_OPS=0][org.apache.tez.common.counters.TaskCounter REDUCE_INPUT_GROUPS=0, REDUCE_INPUT_RECORDS=0, COMBINE_INPUT_RECORDS=0, SPILLED_RECORDS=0, NUM_SHUFFLED_INPUTS=73, NUM_SKIPPED_INPUTS=3022, NUM_FAILED_SHUFFLE_INPUTS=0, MERGED_MAP_OUTPUTS=53, GC_TIME_MILLIS=11162, CPU_MILLISECONDS=62510, PHYSICAL_MEMORY_BYTES=664797184, VIRTUAL_MEMORY_BYTES=2432901120, COMMITTED_HEAP_BYTES=664797184, OUTPUT_RECORDS=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=227062571, ADDITIONAL_SPILLS_BYTES_READ=0, SHUFFLE_BYTES=326055728, SHUFFLE_BYTES_DECOMPRESSED=2123360327, SHUFFLE_BYTES_TO_MEM=286174772, SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_DISK_DIRECT=39880956, NUM_MEM_TO_DISK_MERGES=2, NUM_DISK_TO_DISK_MERGES=0, SHUFFLE_PHASE_TIME=0, MERGE_PHASE_TIME=0, FIRST_EVENT_RECEIVED=264, LAST_EVENT_RECEIVED=49569][Shuffle Errors BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0][Shuffle Errors_Reducer_2_INPUT_Map_1 BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0][TaskCounter_Reducer_2_INPUT_Map_1 ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=227062571, COMBINE_INPUT_RECORDS=0, FIRST_EVENT_RECEIVED=264, LAST_EVENT_RECEIVED=49569, MERGED_MAP_OUTPUTS=53, MERGE_PHASE_TIME=0, NUM_DISK_TO_DISK_MERGES=0, NUM_FAILED_SHUFFLE_INPUTS=0, NUM_MEM_TO_DISK_MERGES=2, NUM_SHUFFLED_INPUTS=73, NUM_SKIPPED_INPUTS=3022, REDUCE_INPUT_GROUPS=0, REDUCE_INPUT_RECORDS=0, SHUFFLE_BYTES=326055728, SHUFFLE_BYTES_DECOMPRESSED=2123360327, SHUFFLE_BYTES_DISK_DIRECT=39880956, SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=286174772, SHUFFLE_PHASE_TIME=0, SPILLED_RECORDS=0][TaskCounter_Reducer_2_OUTPUT_out_Reducer_2 OUTPUT_RECORDS=0]]
2016-05-26 15:45:24,275 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Joining on EventRouter
2016-05-26 15:45:24,276 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Closed processor for vertex=Reducer 2, index=1
2016-05-26 15:45:24,276 [INFO] [TezChild] |orderedgrouped.Shuffle|: Shutting down Shuffle for source: Map_1
2016-05-26 15:45:24,276 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Already shutdown. Ignoring error
2016-05-26 15:45:24,276 [INFO] [TezChild] |orderedgrouped.ShuffleInputEventHandlerOrderedGrouped|: Map 1: numDmeEventsSeen=3480, numDmeEventsSeenWithNoData=3395, numObsoletionEventsSeen=443, updateOnClose
2016-05-26 15:45:24,277 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #1, status:false, isInterrupted:false
2016-05-26 15:45:24,277 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #2, status:false, isInterrupted:false
2016-05-26 15:45:24,277 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #3, status:false, isInterrupted:false
2016-05-26 15:45:24,278 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #4, status:false, isInterrupted:false
2016-05-26 15:45:24,278 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #5, status:false, isInterrupted:false
2016-05-26 15:45:24,278 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #6, status:false, isInterrupted:false
2016-05-26 15:45:24,278 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #7, status:false, isInterrupted:false
2016-05-26 15:45:24,279 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #8, status:false, isInterrupted:false
2016-05-26 15:45:24,279 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #9, status:false, isInterrupted:false
2016-05-26 15:45:24,279 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #10, status:false, isInterrupted:false
2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #11, status:false, isInterrupted:false
2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #12, status:false, isInterrupted:false
2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #13, status:false, isInterrupted:false
2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #14, status:false, isInterrupted:false
2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #15, status:false, isInterrupted:false
2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #16, status:false, isInterrupted:false
2016-05-26 15:45:24,291 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #17, status:false, isInterrupted:false
2016-05-26 15:45:24,302 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #18, status:false, isInterrupted:false
2016-05-26 15:45:24,314 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #19, status:false, isInterrupted:false
2016-05-26 15:45:24,314 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #20, status:false, isInterrupted:false
2016-05-26 15:45:24,314 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #21, status:false, isInterrupted:false
2016-05-26 15:45:24,318 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #22, status:false, isInterrupted:false
2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #23, status:false, isInterrupted:false
2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #24, status:false, isInterrupted:false
2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #25, status:false, isInterrupted:false
2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #26, status:false, isInterrupted:false
2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #27, status:false, isInterrupted:false
2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #28, status:false, isInterrupted:false
2016-05-26 15:45:24,320 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #29, status:false, isInterrupted:false
2016-05-26 15:45:24,320 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #30, status:false, isInterrupted:false
2016-05-26 15:45:24,320 [INFO] [TezChild] |orderedgrouped.MergeManager|: finalMerge called with 8 in-memory map-outputs and 14 on-disk map-outputs
2016-05-26 15:45:24,321 [INFO] [TezChild] |impl.TezMerger|: Merging 8 sorted segments
2016-05-26 15:45:24,321 [INFO] [TezChild] |impl.TezMerger|: Down to the last merge-pass, with 8 segments left of total size: 376486161 bytes

Which all logs and what errors shall I look for ? How can Ambari help ?

**********EDIT-1**********

The Hive query finally failed with the following error :

Status: Failed
Vertex re-running, vertexName=Map 1, vertexId=vertex_1446726117927_0092_1_00
Vertex failed, vertexName=Reducer 2, vertexId=vertex_1446726117927_0092_1_01, diagnostics=[Task failed, taskId=task_1446726117927_0092_1_01_000066, diagnostics=[TaskAttempt 0 failed, info=[Container container_1446726117927_0092_01_000036 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch.
Container id: container_1446726117927_0092_01_000036
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
        at org.apache.hadoop.util.Shell.run(Shell.java:487)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
]], TaskAttempt 1 failed, info=[Container container_1446726117927_0092_01_000199 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch.
Container id: container_1446726117927_0092_01_000199
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
        at org.apache.hadoop.util.Shell.run(Shell.java:487)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
]], TaskAttempt 2 failed, info=[Error: Fatal Error cause TezChild exit.:java.lang.OutOfMemoryError: Java heap space
        at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:133)
        at java.io.OutputStreamWriter.write(OutputStreamWriter.java:220)
        at java.io.Writer.write(Writer.java:157)
        at org.apache.log4j.helpers.QuietWriter.write(QuietWriter.java:48)
        at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
        at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
        at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
        at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
        at org.apache.log4j.Category.callAppenders(Category.java:206)
        at org.apache.log4j.Category.forcedLog(Category.java:391)
        at org.apache.log4j.Category.log(Category.java:856)
        at org.apache.commons.logging.impl.Log4JLogger.error(Log4JLogger.java:218)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:156)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
], TaskAttempt 3 failed, info=[Container container_1446726117927_0092_01_000299 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch.
Container id: container_1446726117927_0092_01_000299
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
        at org.apache.hadoop.util.Shell.run(Shell.java:487)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
]]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:944, Vertex vertex_1446726117927_0092_1_01 [Reducer 2] killed/failed due to:OWN_TASK_FAILURE]
Vertex killed, vertexName=Map 1, vertexId=vertex_1446726117927_0092_1_00, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:23, Vertex vertex_1446726117927_0092_1_00 [Map 1] killed/failed due to:OTHER_VERTEX_FAILURE]
DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 1, vertexId=vertex_1446726117927_0092_1_00Vertex failed, vertexName=Reducer 2, vertexId=vertex_1446726117927_0092_1_01, diagnostics=[Task failed, taskId=task_1446726117927_0092_1_01_000066, diagnostics=[TaskAttempt 0 failed, info=[Container container_1446726117927_0092_01_000036 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch.
Container id: container_1446726117927_0092_01_000036
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
        at org.apache.hadoop.util.Shell.run(Shell.java:487)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
]], TaskAttempt 1 failed, info=[Container container_1446726117927_0092_01_000199 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch.
Container id: container_1446726117927_0092_01_000199
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
        at org.apache.hadoop.util.Shell.run(Shell.java:487)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
]], TaskAttempt 2 failed, info=[Error: Fatal Error cause TezChild exit.:java.lang.OutOfMemoryError: Java heap space
        at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:133)
        at java.io.OutputStreamWriter.write(OutputStreamWriter.java:220)
        at java.io.Writer.write(Writer.java:157)
        at org.apache.log4j.helpers.QuietWriter.write(QuietWriter.java:48)
        at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
        at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
        at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
        at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
        at org.apache.log4j.Category.callAppenders(Category.java:206)
        at org.apache.log4j.Category.forcedLog(Category.java:391)
        at org.apache.log4j.Category.log(Category.java:856)
        at org.apache.commons.logging.impl.Log4JLogger.error(Log4JLogger.java:218)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:156)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
        at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
], TaskAttempt 3 failed, info=[Container container_1446726117927_0092_01_000299 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch.
Container id: container_1446726117927_0092_01_000299
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
        at org.apache.hadoop.util.Shell.run(Shell.java:487)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
]]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:944, Vertex vertex_1446726117927_0092_1_01 [Reducer 2] killed/failed due to:OWN_TASK_FAILURE]Vertex killed, vertexName=Map 1, vertexId=vertex_1446726117927_0092_1_00, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:23, Vertex vertex_1446726117927_0092_1_00 [Map 1] killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1

As I suspected, there is a memory issue BUT THIS IS OCCURING ON THE SECOND ATTEMPT, the FIRST ONE FAILED for unknown reason :

FatalError cause TezChildexit.:java.lang.OutOfMemoryError:Java heap space

The question is which parameters need to be changed ?

2 REPLIES 2

avatar
Explorer

One thing I observed is , we do not mention row format when we tell to store it as ORC format. Can you please change and try it again.

avatar
Expert Contributor

I see you are using constant value for partition column. You might be hitting this issue:

https://issues.apache.org/jira/browse/HIVE-12893