Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Physical layout of architecture

avatar
Contributor

I've been mulling over a design architecture for a deployment and i'm looking for some input on how to physically layout the environment.

My current thought is building a single data storage cluster with HDFS, made up with small machines with large storage and building a separate cluster for my processing layer (spark/yarn/oozie/elastic/etc...) and a DB cluster holding Hive. I don't know if this model is necessarily efficient though or if I should stick with a single cluster, and just manage the services on each individual node.

What is everyone's thoughts on these two model options?

1 ACCEPTED SOLUTION

avatar

Hi @Christopher Amatulli. I'd strongly advise against siloing your cluster based on storage, processing, and services. This goes against the concepts of a cluster and moves you back into traditional application silos.

Think of it more as a single cluster with distributed and shared storage and processing. You may want to assign certain servers to certain services based on high availability requirements, or IO\CPU\Memory requirements, but the cluster as a whole will be under a single operations and management service (Ambari) as well as a single resource layer (YARN).

For small clusters you may have 2 Master servers, an edge node, and n number of data nodes. You should review our cluster planning guide http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_cluster-planning-guide/content/ch_hardwar... as well as any number of good design articles on HCC.

Hope this helps.

View solution in original post

6 REPLIES 6

avatar

Hi @Christopher Amatulli. I'd strongly advise against siloing your cluster based on storage, processing, and services. This goes against the concepts of a cluster and moves you back into traditional application silos.

Think of it more as a single cluster with distributed and shared storage and processing. You may want to assign certain servers to certain services based on high availability requirements, or IO\CPU\Memory requirements, but the cluster as a whole will be under a single operations and management service (Ambari) as well as a single resource layer (YARN).

For small clusters you may have 2 Master servers, an edge node, and n number of data nodes. You should review our cluster planning guide http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_cluster-planning-guide/content/ch_hardwar... as well as any number of good design articles on HCC.

Hope this helps.

avatar
Contributor

Thanks, and I do agree. still working on breaking out of the layered architecture mindset. HDF messed with my progress on that as in another thread someone recommended it be a different cluster from HDP. Any chance you have a link which outlines the components that are recommended together on the same machine vs the components recommended not? like for example, I know it's not recommended to have a HDFS namenode on the same machine as the datanode. I was curious if there are some documents breaking that out better.

avatar

Building out a cluster is a bit of puzzle and gets especially hairy when the cluster is small, say < 12 nodes. For good or bad this is how I tend to generalize my approach:

1. There are master services (NN, RM) and there are client services (Spark, Hive). Think HA and redundancy for master services. Best not to co-locate multiple master services since that could cause a SPOF. Do not co-locate master and worker (HDFS) services.

2. Services such as Storm, HBase, and Solr will do better on dedicated servers because of their high resource requirements. Not required of course, but be cognizant of the trade-offs.

3. Spark is memory bound, Kafka is IO bound, Storm is CPU bound. When looking at co-locating services try to mix and match. Don't put 2 memory bound services on a single server.

4. I prefer to have a small, dedicated Ambari server. Seems cleaner to me but your mileage may vary.

5. Try to use existing database infrastructure for all your metastores, e.g. Oracle.

6. Never use SAN

7. Think about virtualizing master services, edge nodes, and dev.

This list is by no means conclusive and every architect will have additional details (e.g. placing the Spark History server on the same server as HiveServer2). When it really comes down to it you can plan for the worst and hope for the best. Your cluster WILL change over time...guaranteed.

Of course you could just deploy in Azure HDInsight and be done with it.... 😉

avatar
Contributor

Thanks, that answers all my questions.

I'd be all in HDInsight if MS would give me a free dev environment 🙂

avatar
Super Guru
@Christopher Amatulli

Hadoop was created to work with local attached storage. The whole idea of bringing compute to storage. This enables you to have failure redundancy, parallel processing on local data and reliability as a disk or node failure will simply kick off an automatic mechanism to re replicate lost data.

For most performance you should have your data local to where your compute is and where your job is running. So Spark should read data on a local partition, rather than a remote storage. That bing said, for cost purpose companies might put old data in low cost storage like S3 and then run their jobs on that remote storage. This works but with the expectation that it is going to be slow compared to reading data from local disk.

So, it depends on what your requirements are. Are you going to have a PB of data? That might be a good reason to have remote low cost storage like S3 to save money.

Depending on your requirements, you may keep storage separate but it is not how you would usually go about.

Also, Hive doesn't need to be separate. It runs on your compute nodes. Hive reads data that is in your HDFS. It is not a database in the sense that it will have its own storage. So you store data in HDFS (whether local or remote storage) and create Hive tables on that data and run your queries. As you can imagine, it would be more efficient if the data is local.

Hope this helps. Please feel free to comment if you have additional questions.

avatar
Super Guru

@Christopher Amatulli

you can see a certified reference architecture for HDP here. This document will show you distribution of services across different machines. See page 5.

https://hortonworks.com/wp-content/uploads/2013/10/4AA5-9017ENW.pdf