Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark to divide data into more smaller clusters

avatar
Rising Star

Hello people, I'm working on a Big Data project and I've multiple data sets.

I already do some transformations in Hive, now I want to load the data to an Analytical Tool (Like SAS, MicroStrategy or other). I want to use Spark to make another transformations and obtain more knowlodge about the data and then load the data to the analytical tool. But I'm confusing about the advantage to use Spark to do that. In my head, I will use Spark to get more knowlodge and to divide the data into clusters (data sets) more smallers. And then are this data sets that will be loaded to the analytical tool. Thanks!!!

1 ACCEPTED SOLUTION

avatar
Master Guru

You would use Spark for :

1 Data Preparation and Aggregation

data preparation, cleansing and aggregation ( like you would use pig/hive/mapreduce). Then easiest save the aggregated table into Hive tables and access them with your analytical tool.

As an example: Have all transactions in hive, crunch daily transactions into an aggregation table and export that into tableau.

2 Advanced Analytics

Spark provides advanced analytics like Spark MLib and GraphX

3 Analytics on all data

Advanced Analysts can use spark directly for example out of Zeppelin to run queries directly on the full dataset. That may not be as comfortable as their usual tool in 1), however you can run the queries on the full data set.

4 Streaming

View solution in original post

3 REPLIES 3

avatar
Master Guru

You would use Spark for :

1 Data Preparation and Aggregation

data preparation, cleansing and aggregation ( like you would use pig/hive/mapreduce). Then easiest save the aggregated table into Hive tables and access them with your analytical tool.

As an example: Have all transactions in hive, crunch daily transactions into an aggregation table and export that into tableau.

2 Advanced Analytics

Spark provides advanced analytics like Spark MLib and GraphX

3 Analytics on all data

Advanced Analysts can use spark directly for example out of Zeppelin to run queries directly on the full dataset. That may not be as comfortable as their usual tool in 1), however you can run the queries on the full data set.

4 Streaming

avatar
Rising Star

Hi Benjamin,

Many thanks for you response. Your first poitn was very important because I was thinking to do all the data preparation and aggregation in Hive in only after use Spark for more transformations and aggregations. So you tell me that: 1) I need to put the files into Hive from HDFS

2) put Spark reading from the tables created in Hive 3) export the final data to Hive again (with final format) Is that you're saying?

avatar
Master Guru

Its Hadoop you can do whatever you want and are more comfortable in. Functionally Spark, Pig, and Hive are equivalent and performance is also very close ( if you use tez for pig and hive ), complex queries will be much better in Hive, any transformations that require a lot of distinct steps with a lot of data being kept in memory is a strong suit of Spark.

But all in all it depends more on what you are comfortable with and what kind of data prep you want to do. Lots of people know SQL they should use Hive, Lots of people like pig because its well integrated with oozie and very mature its also really easy to write UDFs. Spark is a bit less stable and mature but has a ton of addons and you can rapidly program functions in Scala or python if you are so inclined. All of them can read and write to hive tables or two and from unstructured files. (hive being better at the first, pig/spark better at the latter ). Choose your poison.