Hive: Union all and aggregation are failing with large parquet tables (150 col, 5 mil rows)

PJ1982 — Fri, 16 Sep 2022 11:30:30 GMT

I have following query with 2 parquet tables (t_par_string, t_par_datatype).

select count(*)
from (
select max(source) source,
col1, col2, col3
.
.
.
col149,col150 , count(*)
from (
select 1 source,
col1, col2, col3
.
.
.
col149,col150
from t_par_string
union all
select 1 source,
col1, col2, col3
.
.
.
col149,col150
from t_par_datatype
) merged_data
group by
col1, col2, col3
.
.
.
col149,col150
having count(*) = 1
) minus_data
where source = 1

It is failed with following error

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:507)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170)
... 8 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public org.apache.hadoop.io.Text org.apache.hadoop.hive.ql.udf.UDFToString.evaluate(org.apache.hadoop.hive.serde2.io.TimestampWritable) on object org.apache.hadoop.hive.ql.udf.UDFToString@134ff8f8 of class org.apache.hadoop.hive.ql.udf.UDFToString with arguments {2015-10-17 00:00:00:org.apache.hadoop.hive.serde2.io.TimestampWritable} of size 1
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:989)
at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.evaluate(GenericUDFBridge.java:182)
at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:186)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:77)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:97)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:497)
... 9 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:965)
... 18 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3181)
at java.text.DateFormatSymbols.copyMembers(DateFormatSymbols.java:850)
at java.text.DateFormatSymbols.initializeData(DateFormatSymbols.java:758)
at java.text.DateFormatSymbols.<init>(DateFormatSymbols.java:145)
at sun.util.locale.provider.DateFormatSymbolsProviderImpl.getInstance(DateFormatSymbolsProviderImpl.java:85)
at java.text.DateFormatSymbols.getProviderInstance(DateFormatSymbols.java:364)
at java.text.DateFormatSymbols.getInstance(DateFormatSymbols.java:340)
at java.util.Calendar.getDisplayName(Calendar.java:2110)
at java.text.SimpleDateFormat.subFormat(SimpleDateFormat.java:1125)
at java.text.SimpleDateFormat.format(SimpleDateFormat.java:966)
at java.text.SimpleDateFormat.format(SimpleDateFormat.java:936)
at java.text.DateFormat.format(DateFormat.java:345)
at org.apache.hadoop.hive.serde2.io.TimestampWritable.toString(TimestampWritable.java:383)
at org.apache.hadoop.hive.ql.udf.UDFToString.evaluate(UDFToString.java:150)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:965)
at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.evaluate(GenericUDFBridge.java:182)
at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:186)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:77)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:97)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:497)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)

Re: Hive: Union all and aggregation are failing with large parquet tables (150 col, 5 mil rows)

mbigelow — Tue, 25 Apr 2017 14:54:38 GMT

Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

Increase the container and heap size. I am not sure whether it is a mapper or reducer that is failing but here are the settings to look into.

set hive.exec.reducers.bytes.per.reducer=

set mapreduce.map.memory.mb=

set mapreduce.reduce.memory.mb=

set mapreduce.map.java.opts=<roughly 80% of container size>

set mapreduce.reduce.java.opts=<roughly 80% of container size>

question Hive: Union all and aggregation are failing with large parquet tables (150 col, 5 mil rows) in Archives of Support Questions (Read Only)

Hive: Union all and aggregation are failing with large parquet tables (150 col, 5 mil rows)

Re: Hive: Union all and aggregation are failing with large parquet tables (150 col, 5 mil rows)