question Spark and Structured Data in Archives of Support Questions (Read Only)

Spark and Structured Data

Stewart12586 — Fri, 10 Jun 2016 00:24:09 GMT

It makes sense use Spark to divide a structured model (I know the schema of my data) into clusters? My question is because I don't know If will take some advantage in use Python instead of SQL (Hive) to divide the data into clusters.

Re: Spark and Structured Data

bleonhardi — Fri, 10 Jun 2016 00:39:33 GMT

What do you mean with clusters? A Datamining clustering or segmentation algorithm? In this case you have different options but Spark ML is definitely a strong contender.

http://spark.apache.org/docs/latest/ml-clustering.html

Re: Spark and Structured Data

Stewart12586 — Fri, 10 Jun 2016 00:46:54 GMT

Hi Benjamin, Yes, I'm talking about datamining clustering. So, in your opinion even If I know the schema is a excelent choice use Spark to achieve that

Re: Spark and Structured Data

bleonhardi — Fri, 10 Jun 2016 01:11:05 GMT

There is no clustering algorithm in hive. I think Spark is greatly overselling its story as "unstructured" data analytics. To run a clustering algorithm you always need a schema and you need to create one in Spark as well to run a clustering model. You can use spark to read directly from an Hive/ORC table for example.

Frameworks with data mining algorithms in the hadoop ecosystem:

SparkML ( cool kid on the block and a lot of the algorithms are parallelized )

SparkR: a lot of data prep functions get pushed down to Spark and you have the full power of R and work with RStudio

R Mapreduce frameworks ( RMR ... 😞 If you don't like Spark

...

Mahout ( a bit out of vogue wouldn't use it )

And many more ( like running Python MapReduce streaming ... )

If you ask for an opinion I would put the tables in Hive (ORC ) and use SparkML for the clustering. It has just a lot of push and you can use Python or Scala ( use Scala ).

If you know R better, something like SparkR might be the way to go