<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: When to go with ETL on Hive using Tez  VS When to go with Spark ETL ? in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/When-to-go-with-ETL-on-Hive-using-Tez-VS-When-to-go-with/m-p/146769#M109324</link>
    <description>&lt;P&gt;I'd say whenever you need some Spark specific features like ML, GraphX or Streaming - use spark as ETL engine since it provides All-in-one solution for most usecases.&lt;/P&gt;&lt;P&gt;If you have no such requirements - use Hive on TEZ&lt;/P&gt;&lt;P&gt;If you have no TEZ - use Hive on MR&lt;/P&gt;&lt;P&gt;In any case Hive acts just like a metastore..&lt;/P&gt;</description>
    <pubDate>Mon, 20 Jun 2016 23:15:12 GMT</pubDate>
    <dc:creator>bluesmix</dc:creator>
    <dc:date>2016-06-20T23:15:12Z</dc:date>
    <item>
      <title>When to go with ETL on Hive using Tez  VS When to go with Spark ETL ?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/When-to-go-with-ETL-on-Hive-using-Tez-VS-When-to-go-with/m-p/146767#M109322</link>
      <description />
      <pubDate>Mon, 20 Jun 2016 14:54:17 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/When-to-go-with-ETL-on-Hive-using-Tez-VS-When-to-go-with/m-p/146767#M109322</guid>
      <dc:creator>wabale_revan</dc:creator>
      <dc:date>2016-06-20T14:54:17Z</dc:date>
    </item>
    <item>
      <title>Re: When to go with ETL on Hive using Tez  VS When to go with Spark ETL ?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/When-to-go-with-ETL-on-Hive-using-Tez-VS-When-to-go-with/m-p/146768#M109323</link>
      <description>&lt;P&gt;@revan&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Apache Hive
Strengths: &lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;The Apache Hive facilitates querying and managing large datasets
residing in distributed storage. Built on top of Apache Hadoop, it provides:&lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;Tools to enable easy data extract/transform/load (ETL)&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;A mechanism to impose structure on a variety of data formats&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;Access to files stored either directly in Apache HDFS or in other data
storage systems such as Apache HBase Query execution via MapReduce&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;Hive defines a simple SQL-like query language, called QL, that enables
users familiar with SQL to query the data. At the same time, this language also
allows programmers who are familiar with the MapReduce framework to be able to
plug in their custom mappers and reducers to perform more sophisticated
analysis that may not be supported by the built-in capabilities of the
language.&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;QL can also be extended with custom scalar functions (UDF's),
aggregations (UDAF's), and table functions (UDTF's).&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;Indexing to provide acceleration, index type including compaction and
Bitmap index as of 0.10.&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;Different storage types such as plain text, RCFile, HBase, ORC, and
others.&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;Metadata storage in an RDBMS, significantly reducing the time to perform
semantic checks during query execution.&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;Operating on compressed data stored into the Hadoop ecosystem using
algorithms including DEFLATE, BWT, snappy, etc.&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;Built-in user defined functions (UDFs) to manipulate dates, strings, and
other data-mining tools. Hive supports extending the UDF set to handle
use-cases not supported by built-in functions.&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;SQL-like queries (HiveQL), which are implicitly converted into
MapReduce, or Spark jobs.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;Apache Spark
Strengths:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Spark SQL has multiple interesting features: &lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;it supports multiple file formats such as Parquet, Avro, Text, JSON, ORC&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;it supports data stored in HDFS, Apache HBase, Cassandra and Amazon S3&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;it supports classical Hadoop codecs such as snappy, lzo, gzip&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;it provides security through authentification via the use of a
"shared secret" (spark.authenticate=true on YARN, or
spark.authenticate.secret on all nodes if not YARN) &lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;encryption, Spark supports SSL for Akka and HTTP protocols&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;it supports UDFs&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;it supports concurrent queries and manages the allocation of memory to
the jobs (it is possible to specify the storage of RDD like in-memory only,
disk only or memory and disk&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;it supports caching data in memory using a SchemaRDD columnar format
(cacheTable(““))exposing ByteBuffer, it can also use memory-only caching
exposing User object&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;it supports nested structures &lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;When to use
Spark or Hive-&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;Hive is still a great choice when low latency/multiuser support is not a
requirement, such as for batch processing/ETL. Hive-on-Spark will narrow the
time windows needed for such processing, but not to an extent that makes Hive
suitable for BI&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;Spark SQL, lets Spark users selectively use SQL constructs when writing
Spark pipelines. It is not intended to be a general-purpose SQL layer for
interactive/exploratory analysis. However, Spark SQL reuses the Hive frontend
and metastore, giving you full compatibility with existing Hive data, queries,
and UDFs. Spark SQL includes a cost-based optimizer, columnar storage and code
generation to make queries fast. At the same time, it scales to thousands of
nodes and multi hour queries using the Spark engine, which provides full
mid-query fault tolerance. The performance is biggest advantage of Spark SQL.&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Mon, 20 Jun 2016 23:00:42 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/When-to-go-with-ETL-on-Hive-using-Tez-VS-When-to-go-with/m-p/146768#M109323</guid>
      <dc:creator>GeeKay2015</dc:creator>
      <dc:date>2016-06-20T23:00:42Z</dc:date>
    </item>
    <item>
      <title>Re: When to go with ETL on Hive using Tez  VS When to go with Spark ETL ?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/When-to-go-with-ETL-on-Hive-using-Tez-VS-When-to-go-with/m-p/146769#M109324</link>
      <description>&lt;P&gt;I'd say whenever you need some Spark specific features like ML, GraphX or Streaming - use spark as ETL engine since it provides All-in-one solution for most usecases.&lt;/P&gt;&lt;P&gt;If you have no such requirements - use Hive on TEZ&lt;/P&gt;&lt;P&gt;If you have no TEZ - use Hive on MR&lt;/P&gt;&lt;P&gt;In any case Hive acts just like a metastore..&lt;/P&gt;</description>
      <pubDate>Mon, 20 Jun 2016 23:15:12 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/When-to-go-with-ETL-on-Hive-using-Tez-VS-When-to-go-with/m-p/146769#M109324</guid>
      <dc:creator>bluesmix</dc:creator>
      <dc:date>2016-06-20T23:15:12Z</dc:date>
    </item>
    <item>
      <title>Re: When to go with ETL on Hive using Tez  VS When to go with Spark ETL ?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/When-to-go-with-ETL-on-Hive-using-Tez-VS-When-to-go-with/m-p/300560#M220294</link>
      <description>&lt;P&gt;Apache Hive Strengths:&lt;/P&gt;&lt;P&gt;The Apache Hive encourages questioning and overseeing huge datasets living in circulated capacity. Based on head of Apache Hadoop, it gives:&lt;/P&gt;&lt;P&gt;Tools to empower simple data separate/change/load (ETL)&lt;/P&gt;&lt;P&gt;A system to force structure on an assortment of data positions&lt;/P&gt;&lt;P&gt;Access to documents put away either legitimately in Apache HDFS or in other data stockpiling frameworks, for example, Apache HBase Query execution by means of MapReduce&lt;/P&gt;&lt;P&gt;Hive characterizes a straightforward SQL-like inquiry language, called QL, that empowers clients acquainted with SQL to question the data. Simultaneously, this language additionally permits developers who know about the MapReduce system to have the option to connect their custom mappers and reducers to perform increasingly modern investigation that may not be bolstered by the inherent capacities of the language.&lt;/P&gt;&lt;P&gt;QL can likewise be stretched out with custom scalar capacities (UDF's), accumulations (UDAF's), and table capacities (UDTF's).&lt;/P&gt;&lt;P&gt;Ordering to give quickening, list type including compaction and Bitmap file as of 0.10.&lt;/P&gt;&lt;P&gt;Diverse capacity types, for example, plain content, RCFile, HBase, ORC, and others.&lt;/P&gt;&lt;P&gt;Metadata stockpiling in a RDBMS, essentially decreasing an opportunity to perform semantic checks during inquiry execution.&lt;/P&gt;&lt;P&gt;Working on compacted data put away into the Hadoop biological system utilizing calculations including DEFLATE, BWT, smart, and so on.&lt;/P&gt;&lt;P&gt;Worked in client characterized capacities (UDFs) to control dates, strings, and other data-mining tools. Hive underpins stretching out the UDF set to deal with use-cases not bolstered by worked in capacities.&lt;/P&gt;&lt;P&gt;SQL-like questions (HiveQL), which are verifiably changed over into MapReduce, or Spark employments.&lt;/P&gt;&lt;P&gt;Apache Spark Strengths:&lt;/P&gt;&lt;P&gt;Flash SQL has various intriguing highlights:&lt;/P&gt;&lt;P&gt;it underpins various document arrangements, for example, Parquet, Avro, Text, JSON, ORC&lt;/P&gt;&lt;P&gt;it bolsters data put away in HDFS, Apache HBase, Cassandra and Amazon S3&lt;/P&gt;&lt;P&gt;it underpins traditional Hadoop codecs, for example, smart, lzo, gzip&lt;/P&gt;&lt;P&gt;it gives security through authentification by means of the utilization of a "common mystery" (spark.authenticate=true on YARN, or spark.authenticate.secret on all hubs if not YARN)&lt;/P&gt;&lt;P&gt;encryption, Spark underpins SSL for Akka and HTTP conventions&lt;/P&gt;&lt;P&gt;it bolsters UDFs&lt;/P&gt;&lt;P&gt;it bolsters simultaneous questions and deals with the distribution of memory to the employments (it is conceivable to indicate the capacity of RDD like in-memory just, circle just or memory and plate&lt;/P&gt;&lt;P&gt;it underpins reserving data in memory utilizing a SchemaRDD columnar arrangement (cacheTable(""))exposing ByteBuffer, it can likewise utilize memory-just storing uncovering User object&lt;/P&gt;&lt;P&gt;it underpins settled structures&lt;/P&gt;&lt;P&gt;When to utilize Spark or Hive-&lt;/P&gt;&lt;P&gt;Hive is as yet an extraordinary decision when low inactivity/multiuser support isn't a prerequisite, for example, for clump preparing/ETL. Hive-on-Spark will limit the time windows required for such handling, yet not to a degree that makes Hive appropriate for BI&lt;/P&gt;&lt;P&gt;Flash SQL, lets Spark clients specifically use SQL builds when composing Spark pipelines. It isn't proposed to be a universally useful SQL layer for intelligent/exploratory investigation. In any case, Spark SQL reuses the Hive frontend and metastore, giving you full similarity with existing Hive data, questions, and UDFs. Flash SQL incorporates a cost-based streamlining agent, columnar capacity and code age to make inquiries quick. Simultaneously, it scales to a great many hubs and multi hour inquiries utilizing the Spark motor, which gives full mid-question adaptation to internal failure. The exhibition is greatest bit of leeway of Spark SQL.&lt;/P&gt;</description>
      <pubDate>Wed, 29 Jul 2020 20:15:16 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/When-to-go-with-ETL-on-Hive-using-Tez-VS-When-to-go-with/m-p/300560#M220294</guid>
      <dc:creator>Henry2410</dc:creator>
      <dc:date>2020-07-29T20:15:16Z</dc:date>
    </item>
  </channel>
</rss>

