Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

When to go with ETL on Hive using Tez VS When to go with Spark ETL ?

Solved Go to solution

When to go with ETL on Hive using Tez VS When to go with Spark ETL ?

New Contributor
 
1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: When to go with ETL on Hive using Tez VS When to go with Spark ETL ?

Rising Star

@revan

Apache Hive Strengths:

The Apache Hive facilitates querying and managing large datasets residing in distributed storage. Built on top of Apache Hadoop, it provides:

  • Tools to enable easy data extract/transform/load (ETL)
  • A mechanism to impose structure on a variety of data formats
  • Access to files stored either directly in Apache HDFS or in other data storage systems such as Apache HBase Query execution via MapReduce
  • Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.
  • QL can also be extended with custom scalar functions (UDF's), aggregations (UDAF's), and table functions (UDTF's).
  • Indexing to provide acceleration, index type including compaction and Bitmap index as of 0.10.
  • Different storage types such as plain text, RCFile, HBase, ORC, and others.
  • Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution.
  • Operating on compressed data stored into the Hadoop ecosystem using algorithms including DEFLATE, BWT, snappy, etc.
  • Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions.
  • SQL-like queries (HiveQL), which are implicitly converted into MapReduce, or Spark jobs.

Apache Spark Strengths:

Spark SQL has multiple interesting features:

  • it supports multiple file formats such as Parquet, Avro, Text, JSON, ORC
  • it supports data stored in HDFS, Apache HBase, Cassandra and Amazon S3
  • it supports classical Hadoop codecs such as snappy, lzo, gzip
  • it provides security through authentification via the use of a "shared secret" (spark.authenticate=true on YARN, or spark.authenticate.secret on all nodes if not YARN)
  • encryption, Spark supports SSL for Akka and HTTP protocols
  • it supports UDFs
  • it supports concurrent queries and manages the allocation of memory to the jobs (it is possible to specify the storage of RDD like in-memory only, disk only or memory and disk
  • it supports caching data in memory using a SchemaRDD columnar format (cacheTable(““))exposing ByteBuffer, it can also use memory-only caching exposing User object
  • it supports nested structures

When to use Spark or Hive-

  • Hive is still a great choice when low latency/multiuser support is not a requirement, such as for batch processing/ETL. Hive-on-Spark will narrow the time windows needed for such processing, but not to an extent that makes Hive suitable for BI
  • Spark SQL, lets Spark users selectively use SQL constructs when writing Spark pipelines. It is not intended to be a general-purpose SQL layer for interactive/exploratory analysis. However, Spark SQL reuses the Hive frontend and metastore, giving you full compatibility with existing Hive data, queries, and UDFs. Spark SQL includes a cost-based optimizer, columnar storage and code generation to make queries fast. At the same time, it scales to thousands of nodes and multi hour queries using the Spark engine, which provides full mid-query fault tolerance. The performance is biggest advantage of Spark SQL.
2 REPLIES 2
Highlighted

Re: When to go with ETL on Hive using Tez VS When to go with Spark ETL ?

Rising Star

@revan

Apache Hive Strengths:

The Apache Hive facilitates querying and managing large datasets residing in distributed storage. Built on top of Apache Hadoop, it provides:

  • Tools to enable easy data extract/transform/load (ETL)
  • A mechanism to impose structure on a variety of data formats
  • Access to files stored either directly in Apache HDFS or in other data storage systems such as Apache HBase Query execution via MapReduce
  • Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.
  • QL can also be extended with custom scalar functions (UDF's), aggregations (UDAF's), and table functions (UDTF's).
  • Indexing to provide acceleration, index type including compaction and Bitmap index as of 0.10.
  • Different storage types such as plain text, RCFile, HBase, ORC, and others.
  • Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution.
  • Operating on compressed data stored into the Hadoop ecosystem using algorithms including DEFLATE, BWT, snappy, etc.
  • Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions.
  • SQL-like queries (HiveQL), which are implicitly converted into MapReduce, or Spark jobs.

Apache Spark Strengths:

Spark SQL has multiple interesting features:

  • it supports multiple file formats such as Parquet, Avro, Text, JSON, ORC
  • it supports data stored in HDFS, Apache HBase, Cassandra and Amazon S3
  • it supports classical Hadoop codecs such as snappy, lzo, gzip
  • it provides security through authentification via the use of a "shared secret" (spark.authenticate=true on YARN, or spark.authenticate.secret on all nodes if not YARN)
  • encryption, Spark supports SSL for Akka and HTTP protocols
  • it supports UDFs
  • it supports concurrent queries and manages the allocation of memory to the jobs (it is possible to specify the storage of RDD like in-memory only, disk only or memory and disk
  • it supports caching data in memory using a SchemaRDD columnar format (cacheTable(““))exposing ByteBuffer, it can also use memory-only caching exposing User object
  • it supports nested structures

When to use Spark or Hive-

  • Hive is still a great choice when low latency/multiuser support is not a requirement, such as for batch processing/ETL. Hive-on-Spark will narrow the time windows needed for such processing, but not to an extent that makes Hive suitable for BI
  • Spark SQL, lets Spark users selectively use SQL constructs when writing Spark pipelines. It is not intended to be a general-purpose SQL layer for interactive/exploratory analysis. However, Spark SQL reuses the Hive frontend and metastore, giving you full compatibility with existing Hive data, queries, and UDFs. Spark SQL includes a cost-based optimizer, columnar storage and code generation to make queries fast. At the same time, it scales to thousands of nodes and multi hour queries using the Spark engine, which provides full mid-query fault tolerance. The performance is biggest advantage of Spark SQL.

Re: When to go with ETL on Hive using Tez VS When to go with Spark ETL ?

Expert Contributor

I'd say whenever you need some Spark specific features like ML, GraphX or Streaming - use spark as ETL engine since it provides All-in-one solution for most usecases.

If you have no such requirements - use Hive on TEZ

If you have no TEZ - use Hive on MR

In any case Hive acts just like a metastore..