Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Move data from one Hive table to another with LIMIT clause is very slow and fails

Highlighted

Move data from one Hive table to another with LIMIT clause is very slow and fails

Expert Contributor

I have a huge table table1 in Hive, which contains around 60 million rows (~500 GB ORC files in HDFS). It is partitioned by the column partCol.

Now I want to create a new table table2 in Hive, that has the same schema and shall contain only 50 million rows of table1.

Therefore I run this query:

set hive.exec.dynamic.partition=true;

INSERT OVERWRITE TABLE testdb.table2 partition(partCol) SELECT colA, colB, ..., partCol FROM testdb.table1 LIMIT 50000000;

This creates a lot of Tez Mapper tasks, which looks and works fine - The tasks take around 1 h to finish.

And now the problem: Afterwards there's only 1 Reducer Task, which runs for hours and then fails!

How to increase this number of Reducer tasks for this query? Is the LIMIT clause the issue?

System Information:

I'm using the Hortonworks Data Platform 2.6.5 with Hive 1.2.1

The following Hive settings are configured:

  • hive.execution.engine = TEZ
  • hive.tez.auto.reducer.parallelism = true
  • hive.exec.reducers.bytes.per.reducer = 64 MB
  • hive.exec.reducers.max = 1009

Tez settings:

  • tez.grouping.min-size = 16 MB
  • tez.grouping.max-size = 1 GB