Created 08-18-2022 03:17 PM
I have created an external table in hive; the table is a set of partitioned parquet files (about 200 of them). After creating the table, I did an msck repair table and analyze table compute statistics.
Then in the hive cli if I run something like "select count(*) from table" it runs a single mapper in a single container. I have hive.exec.parallel set to true; I am using tez as the execution engine. I have also tried setting hive.exec.parallel.thread.number=32 (though I don't think that should matter?). What else am I missing?
Created 08-19-2022 12:18 AM
Hi @fsm17 , As you are using tez as an execution engine, I would suggest to set the below in hive which controls the number of mappers.
Created 08-19-2022 07:21 AM
Thanks. The data in the table is 100s of GB, and each individual file (partition) is 200MB, so I think it should be dividing the tasks appropriately.
It seems the issue is at startup; the hive cli is assigned a single container by yarn. And then the execution proceeds only in that container (regardless of the setting of hive.exec.submitviachild or
hive.exec.submit.local.task.via.child).
Created 08-19-2022 10:08 AM
To follow up the number of containers...e.g. if I were running spark-sql, I'd run spark-sql --num-executors=32 and run the count query would run across the 32 executors. I cannot figure out a similar things for hive cli.
Created 08-19-2022 10:40 AM
Something similar is discussed in this post but then again we discussed this already. maybe tez.grouping.split-count can help. Some more info here as well as in https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/group...
Created 08-28-2022 10:07 PM
@fsm17, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.
Regards,
Vidya Sargur,