Community Articles
Find and share helpful community-sourced technical articles.
Labels (1)
Super Collaborator

In past years many companies have come up with (Hybrid) Cloud strategies, and there is no shortage of recommendations for when and how to use Cloud providers such as AWS, Azure, GCP, and others. However, most advice is geared towards a generic setting, and may not translate well to a Big Data context. 


And here is the main reason why:


BIG Data requires BIG resources that have a BIG cost impact


This does not mean that any specific vendor is expensive, and it is not limited to a single deployment form such as PaaS or SaaS. It comes down to the fundamental economics that infrastructure provides must cover the costs they make, and if you use their solution for a big amount of data, these costs will be larger than for a small lightweight app. 


In this context, hopefully, the following advice makes sense.


Avoid Network Usage

Perhaps the largest difference between a Big Data architecture, and a regular architecture is that it can make a large difference whether data is processed close to the source. This is often casually referred to as 'Data Gravity'.

If you get a large volume of data from a source in one location (E.g. On-premise or in a specific Cloud zone) it can be a very good idea to process the data close to the source. By filtering, aggregating, or otherwise reducing data before sending it across zones, data transfer fees can be avoided. Fun fact:


Buying a 10TB hard drive can cost you less than downloading 10TB from the Cloud.


Of course, the costs are less visible for uploading data to the cloud, but if your data comes from on-premises, you will easily find yourself needing to expand the internet capacity once the data volumes grow.

Flexible Cloud Infrastructure vs Economical On-Prem infrastructure

This may come as a shock, but even hyperscale Cloud providers such as AWS, Azure and GCP do not claim their infrastructure is cheaper. The careful observer can note that instead Cloud providers indicate that the TCO should be lower when using Cloud, rather than purely the infrastructure costs. Though this can certainly be true, one should realize that a cost analysis for a few lightweight applications will be different than for a heavy platform.


My personal rule of thumb is that the break-even point for a cheaper TCO is around 30% utilization. and though it depends on your company context and exact solution, I have never seen anyone assume a break-even point outside the 15%-60% range, which leads to the following disturbing point:


If your server utilization is near 100%, the TCO will increase when going to the Cloud


This is completely independent of the solution. The only exceptions that I found so far are if your on-premise licenses cost several times more than the underlying hardware, or if it is actually possible to shut down a poorly utilized data center. 


Of course, the cost is often not the key objective when starting a Cloud journey. There are many reasons for using the Cloud, such as infrastructure flexibility and ease of use. The value of these may in fact outweigh the total cost of any scenario. In a (Hybrid) cloud strategy, the trick is to identify the key value points and meet these without incurring explosive costs. For example:

  • Do some use cases have low average and huge peak loads? --> These seem like excellent candidates for the Cloud.
  • Do some process steps, such as Development, require more flexibility? --> These can also be excellent candidates for the Cloud.

Of course, this last point only applies if you have a consistent solution in the Cloud and On-premises, such as the Cloudera Data Platform. In short, the key takeaway is:


Identify use cases and process steps that benefit from flexibility, and bring these to the Cloud while keeping the TCO under control. 

Stay in control of your data

A third reason why Cloud strategies do not automatically cover Big Data is that it is all about Data! In order to stay in control, it is important to ensure both accessibility and security.


A general security architecture will think about infrastructure level security, perhaps even file-level security. However, in a big data world, we must go one level deeper, and really nail down data level security. It has become very common that business units may only see a limited set of rows (e.g. from their own unit), or columns (e.g. not sensitive data) from a single table. So it is great that one can define a security policy on a Cloud object storage bucket, or perhaps files within this, but that is really not sufficient anymore in this ever-changing world.


It is no longer sufficient to give permissions on the file or table level, security MUST be applied on rows and columns


As a result one can choose between two solutions:


  1. Putting the data 'inside' a database solution so it cannot get accessed directly. This is what most Cloud-specific (and classical) on-premise database solutions do. However, in a Big Data context, this not only inflates run costs, but it also means there is no way to get data out except through the database engine. This makes integration possibilities limited and significantly increases the difficulty of ever leaving the solution behind.
  2. Using open formats for the data, letting it live in a Cloud storage with proper and detailed security policies in place. An open solution such as the Cloudera Data Platform can facilitate this.


Especially when working with structured data (tables that can be queried with SQL), it can be very tempting to put the data in a database and assume it will always be extracted using the engine. However, especially in the Big Data world, the load that data processing solutions would put on these engines would be so large, that often direct data access is preferred. Rather than sending a query to the engine, the solution directly reads the files from the (Cloud) storage layer, gaining much speed and cost-efficiency. 


In a Big Data world, queries should NOT always need to hit the query engine

Therefore it is really recommended to work with a (Database) solution that can write directly to accessible files on the Cloud native storage. 


Though far from exhaustive, this hopefully illustrates that when making a (Hybrid) Cloud strategy, it is important to realize that there are some key challenges to overcome when working with Big Data. The Cloudera Data Platform makes things easier from a technology perspective, and this article has hopefully at least identified the points may require close attention. Of course, do reach out to your Cloudera contact when there are more detailed questions on how to enrich or fulfill the IT strategy of your company.

0 Kudos
Don't have an account?
Version history
Last update:
‎08-18-2021 10:05 PM
Updated by:
Top Kudoed Authors