02-01-2017 11:54 AM - edited 02-01-2017 01:26 PM
We have an application that is backed by a 'small' database (~1TB of data in Oracle SE). The database is growing slowly at the moment (about 1GB per month), however that will increase in the future. A lot of this data is historical / cold. We are considering using a product called Gluent to help us offload data to a Hadoop cluster, however Hadoop is completely new to us and initially it seems like overkill. That said, I can see many advantages for having a Hadoop cluster as a 'data lake' for both the database data and various data related to the application that is not stored in the database at the moment (e.g. data files that currently reside on the application server). Based on my (very limited) understanding of Gluent, for our usage the majority of the processing would still be in Oracle, with only the occasional queries to the 'cold' data in Hadoop, so responsiveness is not a very high priority. i.e. I believe we might be able to get away with 'low end' specs.
My question is: what are the minimum hardware specs for a small cluster to fit our scenario? I'm sure there's an element of "it depends" in the answer, but I guess I want to verify that it is feasible to start a 'production' cluster with minimal resources (e.g. balanced nodes w/ ~4-6 CPU, ~16-32GB RAM, and ~500GB - 1TB disks), with the ability to scale up in the years to come.
02-06-2017 03:43 PM
Josh here, from Cloudera. Thanks for reaching out on this.
As far as verifying whether or not your outlined configuration would work, the short answer would be perhaps.
You might have already seen it, but I'll point to this blog post as a reference. It's a good read, and includes a matrix for deciding the specs for your cluster's nodes. If you look, you'll see the configuration you're proposing is in the neighborhood of a "Light Processing Configuration", but for every other configuration listed, it starts to fall short. As long as you don't make a fully stacked cluster with every service imaginable(it seems like you don't intend on doing that), the "Light Processing" config could suffice. You can also check out this other community post to get a better idea of how speccing your cluster could pan out in terms of how many nodes you would want.
So, in short, perhaps. Let me know if this helps or if you have any other questions.
02-09-2017 03:45 AM