Support Questions

fsm17 · ‎08-18-2022

I have created an external table in hive; the table is a set of partitioned parquet files (about 200 of them). After creating the table, I did an msck repair table and analyze table compute statistics.

Then in the hive cli if I run something like "select count(*) from table" it runs a single mapper in a single container. I have hive.exec.parallel set to true; I am using tez as the execution engine. I have also tried setting hive.exec.parallel.thread.number=32 (though I don't think that should matter?). What else am I missing?

rki_ · ‎08-19-2022

Hi @fsm17 , As you are using tez as an execution engine, I would suggest to set the below in hive which controls the number of mappers.

tez.grouping.max-size(default 1073741824 which is 1GB) : The most data Tez will assign to a task. Decreasing this means more parallelism
tez.grouping.min-size(default 52428800 which is 50MB) : The least data Tez will assign to a task. Increasing this means less parallelism

fsm17 · ‎08-19-2022

Thanks. The data in the table is 100s of GB, and each individual file (partition) is 200MB, so I think it should be dividing the tasks appropriately.

It seems the issue is at startup; the hive cli is assigned a single container by yarn. And then the execution proceeds only in that container (regardless of the setting of hive.exec.submitviachild or

hive.exec.submit.local.task.via.child).

fsm17 · ‎08-19-2022

To follow up the number of containers...e.g. if I were running spark-sql, I'd run spark-sql --num-executors=32 and run the count query would run across the 32 executors. I cannot figure out a similar things for hive cli.

rki_ · ‎08-19-2022

Something similar is discussed in this post but then again we discussed this already. maybe tez.grouping.split-count can help. Some more info here as well as in https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/group...

VidyaSargur · ‎08-28-2022

@fsm17, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.

Regards,

Vidya Sargur,
Community Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum

Support Questions

Hive cli uses only a single container