Support Questions

Stewart12586 · ‎05-18-2016

Hello people, I'm working on a Big Data project and I've multiple data sets.

I already do some transformations in Hive, now I want to load the data to an Analytical Tool (Like SAS, MicroStrategy or other). I want to use Spark to make another transformations and obtain more knowlodge about the data and then load the data to the analytical tool. But I'm confusing about the advantage to use Spark to do that. In my head, I will use Spark to get more knowlodge and to divide the data into clusters (data sets) more smallers. And then are this data sets that will be loaded to the analytical tool. Thanks!!!

bleonhardi · ‎05-18-2016

You would use Spark for :

1 Data Preparation and Aggregation

data preparation, cleansing and aggregation ( like you would use pig/hive/mapreduce). Then easiest save the aggregated table into Hive tables and access them with your analytical tool.

As an example: Have all transactions in hive, crunch daily transactions into an aggregation table and export that into tableau.

2 Advanced Analytics

Spark provides advanced analytics like Spark MLib and GraphX

3 Analytics on all data

Advanced Analysts can use spark directly for example out of Zeppelin to run queries directly on the full dataset. That may not be as comfortable as their usual tool in 1), however you can run the queries on the full data set.

4 Streaming

View solution in original post

bleonhardi · ‎05-18-2016

You would use Spark for :

1 Data Preparation and Aggregation

data preparation, cleansing and aggregation ( like you would use pig/hive/mapreduce). Then easiest save the aggregated table into Hive tables and access them with your analytical tool.

As an example: Have all transactions in hive, crunch daily transactions into an aggregation table and export that into tableau.

2 Advanced Analytics

Spark provides advanced analytics like Spark MLib and GraphX

3 Analytics on all data

Advanced Analysts can use spark directly for example out of Zeppelin to run queries directly on the full dataset. That may not be as comfortable as their usual tool in 1), however you can run the queries on the full data set.

4 Streaming

Stewart12586 · ‎05-18-2016

Hi Benjamin,

Many thanks for you response. Your first poitn was very important because I was thinking to do all the data preparation and aggregation in Hive in only after use Spark for more transformations and aggregations. So you tell me that: 1) I need to put the files into Hive from HDFS

2) put Spark reading from the tables created in Hive 3) export the final data to Hive again (with final format) Is that you're saying?

bleonhardi · ‎05-18-2016

Its Hadoop you can do whatever you want and are more comfortable in. Functionally Spark, Pig, and Hive are equivalent and performance is also very close ( if you use tez for pig and hive ), complex queries will be much better in Hive, any transformations that require a lot of distinct steps with a lot of data being kept in memory is a strong suit of Spark.

But all in all it depends more on what you are comfortable with and what kind of data prep you want to do. Lots of people know SQL they should use Hive, Lots of people like pig because its well integrated with oozie and very mature its also really easy to write UDFs. Spark is a bit less stable and mature but has a ton of addons and you can rapidly program functions in Scala or python if you are so inclined. All of them can read and write to hive tables or two and from unstructured files. (hive being better at the first, pig/spark better at the latter ). Choose your poison.

Cloudera Community

Support Questions

Spark to divide data into more smaller clusters

Streamlining Data Processing with Spark HBase Inte...

Accessing the remote CDP cluster HBase data from a...

Spark Text Analytics - Uncovering Data-Driven Topi...

Divide file in nifi

How to parse XMLs in Cloudera Data Engineering wit...

Performance diff between single big file vs multip...

Setting up a Hadoop/Spark cluster with Docker on a...

Parquet timestamp handling for non UTC data in CDP...

Spark in CML: Recommendations for using Spark in C...

Submit a Spark Job to CDP Data Hub using the Livy ...