Created 07-24-2016 05:43 PM
when I am running the following query, I am getting the error below. It is launching around 2 mapper tasks and 240 reducer tasks and the job is going smooth and fast upto 239 tasks, it is taking 3 hrs to do do 4 task attempts and then the job is failing.
insert overwrite table mybug select row_number() over (order by clickstream_key) as key, clickstream_key as data from mytest;
Table mytest contains 2 billion records with one column "clickstream_key".
2016-07-24 09:54:37,918 [ERROR] [TezChild] |tez.TezProcessor|: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3181) at java.util.ArrayList.grow(ArrayList.java:261) at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235) at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227) at java.util.ArrayList.add(ArrayList.java:458) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber$RowNumberBuffer.incr(GenericUDAFRowNumber.java:73) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber$GenericUDAFRowNumberEvaluator.iterate(GenericUDAFRowNumber.java:102) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:185) at org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.evaluateWindowFunction(WindowingTableFunction.java:164) at org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.iterator(WindowingTableFunction.java:580) at org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.finishPartition(PTFOperator.java:340) at org.apache.hadoop.hive.ql.exec.PTFOperator.closeOp(PTFOperator.java:95) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:617) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:631) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:290) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:164) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-07-24 09:54:37,920 [INFO] [TezChild] |task.TezTaskRunner|: Encounted an error while executing task: attempt_1469365079244_0001_1_01_000000_0 java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3181) at java.util.ArrayList.grow(ArrayList.java:261) at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235) at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227) at java.util.ArrayList.add(ArrayList.java:458) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber$RowNumberBuffer.incr(GenericUDAFRowNumber.java:73) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber$GenericUDAFRowNumberEvaluator.iterate(GenericUDAFRowNumber.java:102) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:185) at org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.evaluateWindowFunction(WindowingTableFunction.java:164) at org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.iterator(WindowingTableFunction.java:580) at org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.finishPartition(PTFOperator.java:340) at org.apache.hadoop.hive.ql.exec.PTFOperator.closeOp(PTFOperator.java:95) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:617) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:631) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:290) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:164) ... 14 more 2016-07-24 09:54:37,947 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Final Counters for attempt_1469365079244_0001_1_01_000000_0: Counters: 71 [[File System Counters FILE_BYTES_READ=38467196060, FILE_BYTES_WRITTEN=36640725116, FILE_READ_OPS=0, FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, HDFS_BYTES_READ=0, HDFS_BYTES_WRITTEN=0, HDFS_READ_OPS=0, HDFS_LARGE_READ_OPS=0, HDFS_WRITE_OPS=0][org.apache.tez.common.counters.TaskCounter REDUCE_INPUT_GROUPS=1, REDUCE_INPUT_RECORDS=2000000000, COMBINE_INPUT_RECORDS=0, SPILLED_RECORDS=2000000000, NUM_SHUFFLED_INPUTS=1, NUM_SKIPPED_INPUTS=0, NUM_FAILED_SHUFFLE_INPUTS=0, MERGED_MAP_OUTPUTS=1, GC_TIME_MILLIS=155973, CPU_MILLISECONDS=4116570, PHYSICAL_MEMORY_BYTES=51539607552, VIRTUAL_MEMORY_BYTES=55293632512, COMMITTED_HEAP_BYTES=51539607552, OUTPUT_RECORDS=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, ADDITIONAL_SPILLS_BYTES_READ=93928296, SHUFFLE_BYTES=93928296, SHUFFLE_BYTES_DECOMPRESSED=2000000020, SHUFFLE_BYTES_TO_MEM=0, SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_DISK_DIRECT=93928296, NUM_MEM_TO_DISK_MERGES=0, NUM_DISK_TO_DISK_MERGES=0, SHUFFLE_PHASE_TIME=41, MERGE_PHASE_TIME=46, FIRST_EVENT_RECEIVED=19, LAST_EVENT_RECEIVED=19][Shuffle Errors BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0][Shuffle Errors_Reducer_2_INPUT_Map_1 BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0][TaskCounter_Reducer_2_INPUT_Map_1 ADDITIONAL_SPILLS_BYTES_READ=93928296, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, COMBINE_INPUT_RECORDS=0, FIRST_EVENT_RECEIVED=19, LAST_EVENT_RECEIVED=19, MERGED_MAP_OUTPUTS=1, MERGE_PHASE_TIME=46, NUM_DISK_TO_DISK_MERGES=0, NUM_FAILED_SHUFFLE_INPUTS=0, NUM_MEM_TO_DISK_MERGES=0, NUM_SHUFFLED_INPUTS=1, NUM_SKIPPED_INPUTS=0, REDUCE_INPUT_GROUPS=1, REDUCE_INPUT_RECORDS=2000000000, SHUFFLE_BYTES=93928296, SHUFFLE_BYTES_DECOMPRESSED=2000000020, SHUFFLE_BYTES_DISK_DIRECT=93928296, SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=0, SHUFFLE_PHASE_TIME=41, SPILLED_RECORDS=2000000000][TaskCounter_Reducer_2_OUTPUT_out_Reducer_2 OUTPUT_RECORDS=0]] 2016-07-24 09:54:37,947 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Joining on EventRouter 2016-07-24 09:54:37,948 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Closed processor for vertex=Reducer 2, index=1 2016-07-24 09:54:37,948 [INFO] [TezChild] |orderedgrouped.Shuffle|: Shutting down Shuffle for source: Map_1 2016-07-24 09:54:37,949 [INFO] [TezChild] |orderedgrouped.ShuffleInputEventHandlerOrderedGrouped|: Map 1: numDmeEventsSeen=1, numDmeEventsSeenWithNoData=0, numObsoletionEventsSeen=0, updateOnClose 2016-07-24 09:54:37,949 [INFO] [TezChild] |orderedgrouped.MergeManager|: finalMerge called with 0 in-memory map-outputs and 0 on-disk map-outputs 2016-07-24 09:54:37,949 [INFO] [TezChild] |orderedgrouped.MergeManager|: Merging 0 files, 0 bytes from disk 2016-07-24 09:54:37,949 [INFO] [TezChild] |orderedgrouped.MergeManager|: Merging 0 segments, 0 bytes from memory into reduce 2016-07-24 09:54:37,949 [INFO] [TezChild] |impl.TezMerger|: Merging 0 sorted segments 2016-07-24 09:54:37,949 [INFO] [TezChild] |impl.TezMerger|: Nothing to merge. Returning an empty iterator 2016-07-24 09:54:37,950 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Closed input for vertex=Reducer 2, sourceVertex=Map 1 2016-07-24 09:54:37,950 [INFO] [TezChild] |output.MROutput|: out_Reducer 2 closed 2016-07-24 09:54:37,950 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Closed input for vertex=Reducer 2, sourceVertex=out_Reducer 2 2016-07-24 09:54:37,951 [INFO] [main] |task.TezChild|: Shutdown invoked for container container_e42_1469365079244_0001_01_000002 2016-07-24 09:54:37,951 [INFO] [main] |task.TezChild|: Shutting down container container_e42_1469365079244_0001_01_000002
I tried changing data node heap, reducer heap, mapper heap, hive.tez.java.opts but none of them worked. Any leads on this would be highly appreciated. The error says java heap space, which java heap space does it mean ?
Created 07-24-2016 06:39 PM
set hive.auto.convert.join=false and give a try
Created 07-25-2016 02:20 AM
This requires some detailed analysis.
Any ways, you can try to increase the following:
tez.am.resource.memory.mb
tez.am.java.opts
These are for the "tez" client config properties.
Created 07-25-2016 04:48 AM
Have you ever seen this case where the last reducer consume a lot of time to complete the task ?
Created 07-25-2016 05:12 AM
@Neeraj Sabharwal Could you help me with this ?
Created 07-25-2016 06:33 PM
Yes. There could be many reasons for it. It is possible that the data is skewed and the query is taking longer time there.
You can try to check the query "explain" and "explain formatted" for further understanding.
Created 07-25-2016 10:22 PM
Please try by setting below properties in your hive/beeline session and run the query which might help you.
set hive.execution.engine=tez; set hive.auto.convert.join=true; set hive.auto.convert.join.noconditionaltask=true; set hive.auto.convert.join.noconditionaltask.size=405306368; set hive.vectorized.execution.enabled=true; set hive.vectorized.execution.reduce.enabled = true; set hive.cbo.enable=true; set hive.compute.query.using.stats=true; set hive.stats.fetch.column.stats=true; set hive.stats.fetch.partition.stats=true; set hive.merge.mapfiles = true; set hive.merge.mapredfiles=true; set hive.merge.size.per.task=134217728; set hive.merge.smallfiles.avgsize= 44739242; set mapreduce.job.reduce.slowstart.completedmaps=0.7 set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; set hive.enforce.bucketing=true; set hive.exec.max.dynamic.partitions.pernode=100000; set hive.exec.max.created.files = 1000000; set hive.exec.parallel = true;
Link http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster