Support Questions

Find answers, ask questions, and share your expertise

Spark and Structured Data

avatar
Rising Star

It makes sense use Spark to divide a structured model (I know the schema of my data) into clusters? My question is because I don't know If will take some advantage in use Python instead of SQL (Hive) to divide the data into clusters.

1 ACCEPTED SOLUTION

avatar
Master Guru

What do you mean with clusters? A Datamining clustering or segmentation algorithm? In this case you have different options but Spark ML is definitely a strong contender.

http://spark.apache.org/docs/latest/ml-clustering.html

View solution in original post

3 REPLIES 3

avatar
Master Guru

What do you mean with clusters? A Datamining clustering or segmentation algorithm? In this case you have different options but Spark ML is definitely a strong contender.

http://spark.apache.org/docs/latest/ml-clustering.html

avatar
Rising Star

Hi Benjamin, Yes, I'm talking about datamining clustering. So, in your opinion even If I know the schema is a excelent choice use Spark to achieve that

avatar
Master Guru

There is no clustering algorithm in hive. I think Spark is greatly overselling its story as "unstructured" data analytics. To run a clustering algorithm you always need a schema and you need to create one in Spark as well to run a clustering model. You can use spark to read directly from an Hive/ORC table for example.

Frameworks with data mining algorithms in the hadoop ecosystem:

SparkML ( cool kid on the block and a lot of the algorithms are parallelized )

SparkR: a lot of data prep functions get pushed down to Spark and you have the full power of R and work with RStudio

R Mapreduce frameworks ( RMR ... 😞 If you don't like Spark

...

Mahout ( a bit out of vogue wouldn't use it )

And many more ( like running Python MapReduce streaming ... )

If you ask for an opinion I would put the tables in Hive (ORC ) and use SparkML for the clustering. It has just a lot of push and you can use Python or Scala ( use Scala ).

If you know R better, something like SparkR might be the way to go