Support Questions

prodgers125 · ‎09-17-2016

What is the bigger advantage of using Hadoop instead SQL Server or ODI when we aren't in a Big Data Scenario? Many thanks!

gkeys · ‎09-17-2016

So let's define a Big Data scenario. Typically this is defined in terms of 3 Vs: It is a Big Data scenario when one or more of the following is true:

Volume: data exists in such large volumes (typically TB or PB) that a traditional relational db is not able to store it physically or at a reasonable cost
Variety: in addition to structured data, data is also semi-structured (e.g. tweets) or non-structured (e.g video)
Velocity: data arrives at extremely high rates, typically as streaming

If neither is true, we are in the world of traditional data -- and your question.

Hadoop still has advantages over SQL server or ODI in this case, and often will coexist with them. Advantages of Hadoop are:

Easy to ingest data: Hadoop does not need the data structure or schema known at ingest-time. You dump it in the lake and structure it when you need to process it. You can structure the same data differently at different times, according to your needs. This is called schema-on-read. Traditional relational databases are schema-on-write. You have to define the schema when you write to it and you are stuck with this schema unless you transform it to something else. These design needs and commitments make acquiring data slow and reusing it inflexible.
Batch processing: Hadoop processes data in parallel (map-reduce or spark) and excels at batch processing quickly and cheaply.
Cheap storage: data stored on Hadoop is much cheaper than storing it in a relational db.

Note that the above leads to a common EDW Offloading use case. In a typical Enterprise Data Warehouses 70% of the data is stored in temporary staging tables, where it sits to be ETLd into tables that are queried. It is much cheaper to store this staging data in Hadoop. Additionally, the ETL process uses typically 50-60% of the database cpu. This background processing slows queries run by the end user to run reports, Business Intelligence, etc. Organizations that offload the staged data to Hadoop and the ETL to Hadoop batch processing save literal millions of dollars per year by avoiding paying for expensive storage in the EDW. Additionally, the queries on the EDW are significantly faster.

Other advantages to Hadoop in a non Big Data scenario are the following:

Central data store: storing data from various sources on the same platform provides new opportunities to analyze and provide business value. For example, it is possible to know more about a customer (ie. achieve a Customer 360 view) and therefore cross-sell, upsell, market, and recommend in ways that are not otherwise possible or easy.
Great toolset: Hadoop has excellent tools like Hive, Spark, Zeppelin, HBase, Phoenix to work with data. These tools are all out of the box with the Hortonworks HDP (Hadoop distribution) and are easily installed, managed and monitored through Ambari which is also part of the distribution.

And another advantage of Hadoop in a non Big Data scenario is that you most likely will move into a Big Data scenario and need Hadoop. You will either be forced to move to Big Data because of one or more of the 3 Vs above, or because you want to achieve new capabilities (like Customer 360) that Hadoop enables, often because your competitors are already doing this and you are falling behind.

These I believe cover the main advantages of using Hadoop even in a non Big Data Scenario. I am sure others have some more points .. let's hear them!

View solution in original post

gkeys · ‎09-17-2016

So let's define a Big Data scenario. Typically this is defined in terms of 3 Vs: It is a Big Data scenario when one or more of the following is true:

Volume: data exists in such large volumes (typically TB or PB) that a traditional relational db is not able to store it physically or at a reasonable cost
Variety: in addition to structured data, data is also semi-structured (e.g. tweets) or non-structured (e.g video)
Velocity: data arrives at extremely high rates, typically as streaming

If neither is true, we are in the world of traditional data -- and your question.

Hadoop still has advantages over SQL server or ODI in this case, and often will coexist with them. Advantages of Hadoop are:

Easy to ingest data: Hadoop does not need the data structure or schema known at ingest-time. You dump it in the lake and structure it when you need to process it. You can structure the same data differently at different times, according to your needs. This is called schema-on-read. Traditional relational databases are schema-on-write. You have to define the schema when you write to it and you are stuck with this schema unless you transform it to something else. These design needs and commitments make acquiring data slow and reusing it inflexible.
Batch processing: Hadoop processes data in parallel (map-reduce or spark) and excels at batch processing quickly and cheaply.
Cheap storage: data stored on Hadoop is much cheaper than storing it in a relational db.

Note that the above leads to a common EDW Offloading use case. In a typical Enterprise Data Warehouses 70% of the data is stored in temporary staging tables, where it sits to be ETLd into tables that are queried. It is much cheaper to store this staging data in Hadoop. Additionally, the ETL process uses typically 50-60% of the database cpu. This background processing slows queries run by the end user to run reports, Business Intelligence, etc. Organizations that offload the staged data to Hadoop and the ETL to Hadoop batch processing save literal millions of dollars per year by avoiding paying for expensive storage in the EDW. Additionally, the queries on the EDW are significantly faster.

Other advantages to Hadoop in a non Big Data scenario are the following:

Central data store: storing data from various sources on the same platform provides new opportunities to analyze and provide business value. For example, it is possible to know more about a customer (ie. achieve a Customer 360 view) and therefore cross-sell, upsell, market, and recommend in ways that are not otherwise possible or easy.
Great toolset: Hadoop has excellent tools like Hive, Spark, Zeppelin, HBase, Phoenix to work with data. These tools are all out of the box with the Hortonworks HDP (Hadoop distribution) and are easily installed, managed and monitored through Ambari which is also part of the distribution.

And another advantage of Hadoop in a non Big Data scenario is that you most likely will move into a Big Data scenario and need Hadoop. You will either be forced to move to Big Data because of one or more of the 3 Vs above, or because you want to achieve new capabilities (like Customer 360) that Hadoop enables, often because your competitors are already doing this and you are falling behind.

These I believe cover the main advantages of using Hadoop even in a non Big Data Scenario. I am sure others have some more points .. let's hear them!

prodgers125 · ‎09-17-2016

gkeys, many thanks! This was a fantastic answer and cover all of my doubts! 😄 😄

Cloudera Community

Support Questions

Hadoop versus (SQL Server or ODI)

Need Clarification for Hive Data Server Configurat...

Move data from locally installed SQL Server to Azu...

SQL server with Cloudera

SQL SELECT into Variable

Converting JSON to SQL DDL

Hadoop - Sharepoint SQL

FAILED with DBCPConnectionPool 1.9.2 to SQL SERVER

SQL Based authorization in hive

Sqoop import SQL Server NON-DEFAULT schema

Hadoop Cluster Maintenance