Reply
Explorer
Posts: 40
Registered: ‎01-13-2017

Adding nodes will improve performance ?

Hi.

 

If adding 3 nodes to my 3 nodes clusters would obviously increase performance by x2 at least ? or there is more parameters to consider to improve to x2 ?

 

Thanks

Posts: 388
Topics: 11
Kudos: 60
Solutions: 34
Registered: ‎09-02-2016

Re: Adding nodes will improve performance ?

@MasterOfPuppets

 

Very Hypothetical "one line" question.

 

I don't think just adding few extra nodes will double the performance...Few of the additional parameters that you need to consider as

 

1. The way services are configured in the cluster is also very important. Ex: You have 3 nodes now, Consider 10 services are configured in 3 nodes. After 3 more nodes are added, you need to properly distribute the services to the new nodes as well


On Existing Cluster - without adding new nodes:
1. If possible, Add RAM to existing nodes
2. Identify which particular services required better performance like hive, impala, etc. You can tune the environment configuration for those services. Ex: Increase Java heap size, etc
3. Prioritize the jobs
etc

Posts: 642
Topics: 3
Kudos: 105
Solutions: 67
Registered: ‎08-16-2016

Re: Adding nodes will improve performance ?

If you are talking about Worker nodes (datanode, nodemanager, impala daemon, etc.) and the resource allocation and configuration is identical, then in theory yes it should be a 2x improvement.

Note: you will need to rebalance the data across the new nodes to see consistent improvement and not burnt out the old nodes.
Explorer
Posts: 40
Registered: ‎01-13-2017

Re: Adding nodes will improve performance ?

I've added 4 nodes to my 4 nodes cluster and i don't see any benefits. Queries againsts 8 nodes cluster perform the same as against 4 nodes cluster. All datanodes have same specifications.

 

Posts: 642
Topics: 3
Kudos: 105
Solutions: 67
Registered: ‎08-16-2016

Re: Adding nodes will improve performance ?

Then the queries themselves do not utilize more than the existing cluster or current capacity.

Try running the terasort test as it will and you will see the different.

Now you could possible tune Hive and/or the query to use more of the cluster or otherwise be faster.

I wasn't as clear in my previous answer though. This will not cause a performance boost directly to all queries or jobs but will allow the cluster the scale and improve the overall cluster performance I.E. you can now run twice as many jobs or the same out of jobs but on double the amount of data. There are other factors for Hive performance as well such as the metestore and HS2.
Posts: 642
Topics: 3
Kudos: 105
Solutions: 67
Registered: ‎08-16-2016

Re: Adding nodes will improve performance ?

Oh, did you rebalance after adding the new nodes. If no, then the data being accessed is not there and there for it is less likely that you will have containers running on the other nodes.
Explorer
Posts: 40
Registered: ‎01-13-2017

Re: Adding nodes will improve performance ?

Yes we have also rebalanced the data.

Champion
Posts: 563
Registered: ‎05-16-2016

Re: Adding nodes will improve performance ?

[ Edited ]

I would say just Adding nodes would not result good performance all the time  , as you said there are some parameters thats needs to be take care . I would consider doing few things like Optimized joins , making the large table in the query as last when performing join or use hint like Streamtable . Enabling the Local mode , mainly tuning the number of mappers and reducers , JMV reuse and finally using the good old Index  ,sometimes help speed up the group by query in hive .   I also agree with @saranvisa and @mbigelow on their thoughts . 

Explorer
Posts: 40
Registered: ‎01-13-2017

Re: Adding nodes will improve performance ?

Even a simple select without any joins does not have any benefits to double the number of workers. Performance remain the same either on 4 nodes or 8 nodes.

Posts: 173
Topics: 8
Kudos: 19
Solutions: 19
Registered: ‎07-16-2015

Re: Adding nodes will improve performance ?

That seems rather normal. Low complexity queries tend to use a small amount of yarn containers.

Adding containers where you don't have a shortage issue of containers will not speed-up things.

 

But you will be able to handle more concurrent queries without slowing down.

Announcements