Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark and Structured Data

avatar
Rising Star

It makes sense use Spark to divide a structured model (I know the schema of my data) into clusters? My question is because I don't know If will take some advantage in use Python instead of SQL (Hive) to divide the data into clusters.

1 ACCEPTED SOLUTION

avatar
Master Guru

What do you mean with clusters? A Datamining clustering or segmentation algorithm? In this case you have different options but Spark ML is definitely a strong contender.

http://spark.apache.org/docs/latest/ml-clustering.html

View solution in original post

3 REPLIES 3

avatar
Master Guru

What do you mean with clusters? A Datamining clustering or segmentation algorithm? In this case you have different options but Spark ML is definitely a strong contender.

http://spark.apache.org/docs/latest/ml-clustering.html

avatar
Rising Star

Hi Benjamin, Yes, I'm talking about datamining clustering. So, in your opinion even If I know the schema is a excelent choice use Spark to achieve that

avatar
Master Guru

There is no clustering algorithm in hive. I think Spark is greatly overselling its story as "unstructured" data analytics. To run a clustering algorithm you always need a schema and you need to create one in Spark as well to run a clustering model. You can use spark to read directly from an Hive/ORC table for example.

Frameworks with data mining algorithms in the hadoop ecosystem:

SparkML ( cool kid on the block and a lot of the algorithms are parallelized )

SparkR: a lot of data prep functions get pushed down to Spark and you have the full power of R and work with RStudio

R Mapreduce frameworks ( RMR ... 😞 If you don't like Spark

...

Mahout ( a bit out of vogue wouldn't use it )

And many more ( like running Python MapReduce streaming ... )

If you ask for an opinion I would put the tables in Hive (ORC ) and use SparkML for the clustering. It has just a lot of push and you can use Python or Scala ( use Scala ).

If you know R better, something like SparkR might be the way to go