Created 10-25-2017 01:24 AM
We have 2-node cluster(1 master 4 CPU,16 GB RAM + 1 data node 8 CPU,30 GB RAM) and the estimated amount of data being processed through HIVE tables are 100 GB. We are using Ambari Hive 2.0 view instance running in Master and the estimated number of support/analytics users are around 15-20. When we try to access the HIVE instance differently for each user (per session), all HIVE queries (using Tez) are processed via YARN default queue. However the expectation is to get the HIVE results in parallel for each session, but these Tez jobs are executed in sequence and the performance is major constraint here. We dont want to add more nodes as the data being processed is still in GBs and we wanted to improve the parallelism in HIVE query execution with the current hardware configuration. We have also applied tuning parameters related to HIVE such as et hive.cbo.enable=true; set hive.compute.query.using.stats=true; set hive.stats.fetch.column.stats=true; set hive.stats.fetch.partition.stats=true; along with converting the table into ORC format. Even then the performance of query response time and parallelism are not improved. Any help related to this,highly appreciated. Thanks!!!
Created 10-25-2017 07:32 AM
1 data node 8 CPU,30 GB RAM
Some assumptions : you have 8 container in your cluster .
1. Even if you have 2 Gb of DATA 8 conatiners will be consumed completely by the job.
2. If two parallel job runs together job will slow down significantly if preemption happens.
You should tune the Queue , but at least add one more node to achieve some significant advantage of parallelism.
Created 10-25-2017 07:32 AM
1 data node 8 CPU,30 GB RAM
Some assumptions : you have 8 container in your cluster .
1. Even if you have 2 Gb of DATA 8 conatiners will be consumed completely by the job.
2. If two parallel job runs together job will slow down significantly if preemption happens.
You should tune the Queue , but at least add one more node to achieve some significant advantage of parallelism.