Support Questions

Stewart12586 · ‎06-09-2016

It makes sense use Spark to divide a structured model (I know the schema of my data) into clusters? My question is because I don't know If will take some advantage in use Python instead of SQL (Hive) to divide the data into clusters.

bleonhardi · ‎06-09-2016

What do you mean with clusters? A Datamining clustering or segmentation algorithm? In this case you have different options but Spark ML is definitely a strong contender.

http://spark.apache.org/docs/latest/ml-clustering.html

View solution in original post

bleonhardi · ‎06-09-2016

What do you mean with clusters? A Datamining clustering or segmentation algorithm? In this case you have different options but Spark ML is definitely a strong contender.

http://spark.apache.org/docs/latest/ml-clustering.html

Stewart12586 · ‎06-09-2016

Hi Benjamin, Yes, I'm talking about datamining clustering. So, in your opinion even If I know the schema is a excelent choice use Spark to achieve that

bleonhardi · ‎06-09-2016

There is no clustering algorithm in hive. I think Spark is greatly overselling its story as "unstructured" data analytics. To run a clustering algorithm you always need a schema and you need to create one in Spark as well to run a clustering model. You can use spark to read directly from an Hive/ORC table for example.

Frameworks with data mining algorithms in the hadoop ecosystem:

SparkML ( cool kid on the block and a lot of the algorithms are parallelized )

SparkR: a lot of data prep functions get pushed down to Spark and you have the full power of R and work with RStudio

R Mapreduce frameworks ( RMR ... 😞 If you don't like Spark

...

Mahout ( a bit out of vogue wouldn't use it )

And many more ( like running Python MapReduce streaming ... )

If you ask for an opinion I would put the tables in Hive (ORC ) and use SparkML for the clustering. It has just a lot of push and you can use Python or Scala ( use Scala ).

If you know R better, something like SparkR might be the way to go

Cloudera Community

Support Questions

Spark and Structured Data

Spark Structured Streaming example with CDE

Cloudera Data Engineering Spark Job with Python Wh...

Spark 2.3 Structured Streaming Integration with Ap...

Streamlining Data Processing with Spark HBase Inte...

Spark Text Analytics - Uncovering Data-Driven Topi...

HDP3.0: spark structured streaming jobs working in...

Solace Integration with Spark Structured Streaming

HDP 2.6.4 - HDF 3.1: Apache Kafka - Apache Spark S...

Structured Unstructured Data for Pig and Hive

How to parse XMLs in Cloudera Data Engineering wit...