05-12-2018 03:21 AM
I'm using cdh5.13.1.
i have two parquet table: p_t_customer(3billion rows) and p_t_contract_product(4billion rows).
i use this simple join sql:
select count(*) from p_t_customer c inner join p_t_contract_product p on c.customer_id = p.insured_2;
but when i use hive on spark,it is not stop after hours,but when i use sparksql ,it executes succesfull in 15miutes.
here is hive on spark screenshot:
05-16-2018 09:56 AM
You should ideally be running with the dynamic allocation enabled.
Are you able to try the query again with a 16GB executor heap to test if this is some sort of memory inefficiency? If so, what are the results?
set spark.executor.memory = 16g;