Created 06-09-2016 05:24 PM
It makes sense use Spark to divide a structured model (I know the schema of my data) into clusters? My question is because I don't know If will take some advantage in use Python instead of SQL (Hive) to divide the data into clusters.
Created 06-09-2016 05:39 PM
What do you mean with clusters? A Datamining clustering or segmentation algorithm? In this case you have different options but Spark ML is definitely a strong contender.
Created 06-09-2016 05:39 PM
What do you mean with clusters? A Datamining clustering or segmentation algorithm? In this case you have different options but Spark ML is definitely a strong contender.
Created 06-09-2016 05:46 PM
Hi Benjamin, Yes, I'm talking about datamining clustering. So, in your opinion even If I know the schema is a excelent choice use Spark to achieve that
Created 06-09-2016 06:11 PM
There is no clustering algorithm in hive. I think Spark is greatly overselling its story as "unstructured" data analytics. To run a clustering algorithm you always need a schema and you need to create one in Spark as well to run a clustering model. You can use spark to read directly from an Hive/ORC table for example.
Frameworks with data mining algorithms in the hadoop ecosystem:
SparkML ( cool kid on the block and a lot of the algorithms are parallelized )
SparkR: a lot of data prep functions get pushed down to Spark and you have the full power of R and work with RStudio
R Mapreduce frameworks ( RMR ... 😞 If you don't like Spark
...
Mahout ( a bit out of vogue wouldn't use it )
And many more ( like running Python MapReduce streaming ... )
If you ask for an opinion I would put the tables in Hive (ORC ) and use SparkML for the clustering. It has just a lot of push and you can use Python or Scala ( use Scala ).
If you know R better, something like SparkR might be the way to go