Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Adding nodes will improve performance ?

avatar

Hi.

 

If adding 3 nodes to my 3 nodes clusters would obviously increase performance by x2 at least ? or there is more parameters to consider to improve to x2 ?

 

Thanks

11 REPLIES 11

avatar
Champion

@MasterOfPuppets

 

Very Hypothetical "one line" question.

 

I don't think just adding few extra nodes will double the performance...Few of the additional parameters that you need to consider as

 

1. The way services are configured in the cluster is also very important. Ex: You have 3 nodes now, Consider 10 services are configured in 3 nodes. After 3 more nodes are added, you need to properly distribute the services to the new nodes as well


On Existing Cluster - without adding new nodes:
1. If possible, Add RAM to existing nodes
2. Identify which particular services required better performance like hive, impala, etc. You can tune the environment configuration for those services. Ex: Increase Java heap size, etc
3. Prioritize the jobs
etc

avatar
Champion
If you are talking about Worker nodes (datanode, nodemanager, impala daemon, etc.) and the resource allocation and configuration is identical, then in theory yes it should be a 2x improvement.

Note: you will need to rebalance the data across the new nodes to see consistent improvement and not burnt out the old nodes.

avatar

I've added 4 nodes to my 4 nodes cluster and i don't see any benefits. Queries againsts 8 nodes cluster perform the same as against 4 nodes cluster. All datanodes have same specifications.

 

avatar
Champion
Then the queries themselves do not utilize more than the existing cluster or current capacity.

Try running the terasort test as it will and you will see the different.

Now you could possible tune Hive and/or the query to use more of the cluster or otherwise be faster.

I wasn't as clear in my previous answer though. This will not cause a performance boost directly to all queries or jobs but will allow the cluster the scale and improve the overall cluster performance I.E. you can now run twice as many jobs or the same out of jobs but on double the amount of data. There are other factors for Hive performance as well such as the metestore and HS2.

avatar
Champion
Oh, did you rebalance after adding the new nodes. If no, then the data being accessed is not there and there for it is less likely that you will have containers running on the other nodes.

avatar

Yes we have also rebalanced the data.

avatar
Champion

I would say just Adding nodes would not result good performance all the time  , as you said there are some parameters thats needs to be take care . I would consider doing few things like Optimized joins , making the large table in the query as last when performing join or use hint like Streamtable . Enabling the Local mode , mainly tuning the number of mappers and reducers , JMV reuse and finally using the good old Index  ,sometimes help speed up the group by query in hive .   I also agree with @saranvisa and @mbigelow on their thoughts . 

avatar

Even a simple select without any joins does not have any benefits to double the number of workers. Performance remain the same either on 4 nodes or 8 nodes.

avatar
Super Collaborator

That seems rather normal. Low complexity queries tend to use a small amount of yarn containers.

Adding containers where you don't have a shortage issue of containers will not speed-up things.

 

But you will be able to handle more concurrent queries without slowing down.