Community Articles

Find and share helpful community-sourced technical articles.
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)

There are several areas where a traditional RDBMS platform is used within an HDP environment, Ambari uses one to store the cluster configuration, Hive stores it's metastore information, Oozie stores its jobs and config and Ranger stores its policies.

There are a range of DB options you can choose from for many different components, an example compatibility matrix is shown here:

One element that is not very well documented is how much space may be required if you're starting down the path of building a fairly large cluster. There are a number of reasons for this, the main one being that it does actually vary with how the cluster is used.

However that being said I've gathered some database, cluster and time metrics from a number of production environments used by Hortonworks customers and come up with a simple formula that may at least get you a rough order of magnitude estimate of the size database that's required for each major component.

There are two major variables that seem to play a part in some of the calculations, the first is the node count within the cluster, the second is the duration that the cluster is run for. For simplicity sake I'm using them in everything just to keep this article simple, and while not strictly accurate it should give you a rough estimate.

Node count is also an indicator towards environment complexity during these calculations.

So, the numbers in this case are:

  • Ambari 0.7MB
  • Ranger 0.5MB
  • Oozie 0.5MB
  • Hive Metastore 5MB

Then all you need to do is take the number above and multiply it by the number of nodes in the cluster, and the duration (in months) you want to calculate the cluster DB utilisation duration for.

For example:

Ambari on a 100 node cluster over 2 years would be: 0.7 x 100 x 24 = 1680MB or 1.68GB approx

Hive Metastore on a 75 node cluster over 1 year would be: 5 x 75 x 12 = 4500MB or 4.5GB approx

Now, please remember that this is a very rough approximation, built from a handful of data points from a small set of customers with real world clusters, don't take this simplistic estimate as a concrete promise. As always with this, your utilisation of the cluster can severely skew any of these statistics, for example if you run thousands of jobs via Oozie every day, expect that to increase significantly quicker, similarly if you are making continuous config changes via Ambari on the API for example.

However I think the above is a reasonable start, and feedback would be very welcome. In the longer term once I've received some more feedback I'll look to get this into the formal Hortonworks documentation further down the line.

Hope this helps.

New Contributor

This is very helpful. Do you have any rough estimates around CPU and Memory requirements?