Created 05-12-2018 03:21 AM
hi,all.
I'm using cdh5.13.1.
i have two parquet table: p_t_customer(3billion rows) and p_t_contract_product(4billion rows).
i use this simple join sql:
select count(*) from p_t_customer c inner join p_t_contract_product p on c.customer_id = p.insured_2;
but when i use hive on spark,it is not stop after hours,but when i use sparksql ,it executes succesfull in 15miutes.
here is hive on spark screenshot:
Created 05-12-2018 03:34 AM
it finallly begin to fail....
why this happen?is my hive configuration not right?i manuly close dynamic allocation to avoid timeout.
Created 05-16-2018 09:56 AM
Hello,
You should ideally be running with the dynamic allocation enabled.
Are you able to try the query again with a 16GB executor heap to test if this is some sort of memory inefficiency? If so, what are the results?
set spark.executor.memory = 16g;
https://www.cloudera.com/documentation/enterprise/latest/topics/admin_hos_tuning.html
Thanks.