Many organizations have come to rely on Hadoop for dealing with the ever-increasing quantities of data that they gather. Today, it is clear what problems Hadoop can solve, however, cloud is still not the first choice for Hadoop deployment. Pros and cons for Hadoop in the cloud have been shared across multiple blogs and books, but the question is always coming-up in discussions with enterprises considering Hadoop in the Cloud. Thus, I thought it would be useful to collate together a few pros and cons, as well as mention a pragmatic approach to consider a hybrid cloud for organizations that have made significant investments on-prem. For organizations going to Hadoop for the first time, Cloud is probably a better bet especially if they don't have a lot of IT expertise and a great stream of revenue exists and needs to be exploited immediately.
Lack of space. You don’t have space to keep racks of physical servers, along with the necessary power and cooling.
Flexibility. It is much easier to reorganize instances, or expand or contract your footprint, for changing business needs. Everything is controlled through cloud provider APIs and web consoles. Changes can be scripted and put into effect manually or even automatically and dynamically based on current conditions.
New usage patterns. Cloud providers abstract computing resources such that they are not tied to physical configurations, which means they can be managed in ways that are otherwise impractical. For example, individuals could have their own instances, clusters, and even networks to work with, without much managerial overhead. The overall budget for CPU cores in your cloud provider account can be concentrated in a set of large instances, a larger set of smaller instances, or some mixture, and can even change over time. When an instance malfunctions, instead of troubleshooting what went wrong, you can just tear it down and replace it.
Worldwide availability. The largest cloud providers have data centers around the world. You can use resources close to where you work, or close to where your customers are, for the best performance. You can set up redundant clusters, or even entire computing environments, in multiple data centers, so that if local problems occur in one data center, you can shift to working elsewhere.
Data retention restrictions. If you have data that is required by law to be stored within specific geographic areas, you can keep it in clusters that are hosted in data centers in those areas.
Cloud provider features. Each major cloud provider offers an ecosystem of features to support the core functions of computing, networking, and storage. To use those features most effectively, your clusters should run in the cloud provider as well.
Capacity. Very few customers tax the infrastructure of a major cloud provider. You can establish large systems in the cloud that are not nearly as easy to put together, not to mention maintain, on-prem.
Simplicity. Cloud providers start you off with reasonable defaults, but then it is up to you to figure out how all of their features work and when they are appropriate. It takes a lot of experience to become proficient at picking the right types of instances and arranging networks properly.
High levels of control. Beyond the general geographic locations of cloud provider data centers and the hardware specifications that providers reveal for their resources, it is not possible to have exacting, precise control over your cloud architecture. You cannot tell exactly where the physical devices sit, or what the devices near them are doing, or how data across them shares the same physical network1. When the cloud provider has internal problems such as network outages, there’s not much you can do but wait.
Unique hardware needs. You cannot have cloud providers attach specialized peripherals or dongles to their hardware for you. If your application requires resources that exceed what a cloud provider offers, you will need to host that part on-prem away from your Hadoop clusters.
Saving money. For one thing, you are still paying for the resources you use. The hope is that the economy of scale that a cloud provider can achieve makes it more economical for you to pay to “rent” their hardware than to run your own. You will also still need people who understand system administration and networking to take care of your cloud infrastructure. Inefficient architectures can cost a lot of money in storage and data transfer costs, or instances that are running idle.
Best of Both
Instead of running your clusters and associated applications completely in the cloud or completely on-prem, the overall system is split between the two - Hybrid Cloud.
Data channels are established between the cloud and on-prem worlds to connect the components needed to perform work.
Suppose there is a large, existing on-prem data processing system, perhaps using Hadoop clusters, which works well. In order to expand its capacity for running new analyses, rather than adding more on-prem hardware, Hadoop clusters can be created in the cloud. Data needed for the analyses is copied up to the Hadoop clusters where it is analyzed, and the results are sent back on-prem. The cloud clusters can be brought up and torn down in response to demand, which helps to keep costs lower.
Assume the management of vast amounts of incoming data that needs to be centralized and processed. To avoid having one single choke point where all of the raw data is sent, a set of cloud clusters can share the load, perhaps each in a geographic location convenient to where the data is generated. These clusters can perform pre-processing of the data, such as cleaning and summarization, to reduce the work that the final centralized system must perform.