- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Adding nodes will improve performance ?
- Labels:
-
Apache Hive
-
Apache Spark
Created on ‎02-21-2017 01:18 PM - edited ‎09-16-2022 04:07 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi.
If adding 3 nodes to my 3 nodes clusters would obviously increase performance by x2 at least ? or there is more parameters to consider to improve to x2 ?
Thanks
Created ‎02-21-2017 02:16 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Very Hypothetical "one line" question.
I don't think just adding few extra nodes will double the performance...Few of the additional parameters that you need to consider as
1. The way services are configured in the cluster is also very important. Ex: You have 3 nodes now, Consider 10 services are configured in 3 nodes. After 3 more nodes are added, you need to properly distribute the services to the new nodes as well
On Existing Cluster - without adding new nodes:
1. If possible, Add RAM to existing nodes
2. Identify which particular services required better performance like hive, impala, etc. You can tune the environment configuration for those services. Ex: Increase Java heap size, etc
3. Prioritize the jobs
etc
Created ‎02-22-2017 02:43 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Note: you will need to rebalance the data across the new nodes to see consistent improvement and not burnt out the old nodes.
Created ‎04-10-2017 01:57 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've added 4 nodes to my 4 nodes cluster and i don't see any benefits. Queries againsts 8 nodes cluster perform the same as against 4 nodes cluster. All datanodes have same specifications.
Created ‎04-10-2017 02:10 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try running the terasort test as it will and you will see the different.
Now you could possible tune Hive and/or the query to use more of the cluster or otherwise be faster.
I wasn't as clear in my previous answer though. This will not cause a performance boost directly to all queries or jobs but will allow the cluster the scale and improve the overall cluster performance I.E. you can now run twice as many jobs or the same out of jobs but on double the amount of data. There are other factors for Hive performance as well such as the metestore and HS2.
Created ‎04-10-2017 02:15 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎04-10-2017 02:17 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes we have also rebalanced the data.
Created on ‎04-10-2017 11:23 PM - edited ‎04-10-2017 11:24 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would say just Adding nodes would not result good performance all the time , as you said there are some parameters thats needs to be take care . I would consider doing few things like Optimized joins , making the large table in the query as last when performing join or use hint like Streamtable . Enabling the Local mode , mainly tuning the number of mappers and reducers , JMV reuse and finally using the good old Index ,sometimes help speed up the group by query in hive . I also agree with @saranvisa and @mbigelow on their thoughts .
Created ‎04-12-2017 12:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Even a simple select without any joins does not have any benefits to double the number of workers. Performance remain the same either on 4 nodes or 8 nodes.
Created ‎04-14-2017 01:03 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That seems rather normal. Low complexity queries tend to use a small amount of yarn containers.
Adding containers where you don't have a shortage issue of containers will not speed-up things.
But you will be able to handle more concurrent queries without slowing down.
