I'm using cdh5.13.1.
i have two parquet table: p_t_customer(3billion rows) and p_t_contract_product(4billion rows).
i use this simple join sql:
select count(*) from p_t_customer c inner join p_t_contract_product p on c.customer_id = p.insured_2;
but when i use hive on spark,it is not stop after hours,but when i use sparksql ,it executes succesfull in 15miutes.
here is hive on spark screenshot:
it finallly begin to fail....
why this happen?is my hive configuration not right?i manuly close dynamic allocation to avoid timeout.
You should ideally be running with the dynamic allocation enabled.
Are you able to try the query again with a 16GB executor heap to test if this is some sort of memory inefficiency? If so, what are the results?
set spark.executor.memory = 16g;