<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Spark and Structured Data in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-and-Structured-Data/m-p/131019#M31371</link>
    <description>&lt;P&gt;It makes sense use Spark to divide a structured model (I know the schema of my data) into clusters? 

My question is because I don't know If will take some advantage in use Python instead of SQL (Hive) to divide the data into clusters.&lt;/P&gt;</description>
    <pubDate>Fri, 10 Jun 2016 00:24:09 GMT</pubDate>
    <dc:creator>Stewart12586</dc:creator>
    <dc:date>2016-06-10T00:24:09Z</dc:date>
    <item>
      <title>Spark and Structured Data</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-and-Structured-Data/m-p/131019#M31371</link>
      <description>&lt;P&gt;It makes sense use Spark to divide a structured model (I know the schema of my data) into clusters? 

My question is because I don't know If will take some advantage in use Python instead of SQL (Hive) to divide the data into clusters.&lt;/P&gt;</description>
      <pubDate>Fri, 10 Jun 2016 00:24:09 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-and-Structured-Data/m-p/131019#M31371</guid>
      <dc:creator>Stewart12586</dc:creator>
      <dc:date>2016-06-10T00:24:09Z</dc:date>
    </item>
    <item>
      <title>Re: Spark and Structured Data</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-and-Structured-Data/m-p/131020#M31372</link>
      <description>&lt;P&gt;What do you mean with clusters? A Datamining clustering or segmentation algorithm? In this case you have different options but Spark ML is definitely a strong contender.&lt;/P&gt;&lt;P&gt;&lt;A href="http://spark.apache.org/docs/latest/ml-clustering.html" target="_blank"&gt;http://spark.apache.org/docs/latest/ml-clustering.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 10 Jun 2016 00:39:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-and-Structured-Data/m-p/131020#M31372</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-06-10T00:39:33Z</dc:date>
    </item>
    <item>
      <title>Re: Spark and Structured Data</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-and-Structured-Data/m-p/131021#M31373</link>
      <description>&lt;P&gt;Hi Benjamin,

Yes, I'm talking about datamining clustering. So, in your opinion even If I know the schema is a excelent choice use Spark to achieve that&lt;/P&gt;</description>
      <pubDate>Fri, 10 Jun 2016 00:46:54 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-and-Structured-Data/m-p/131021#M31373</guid>
      <dc:creator>Stewart12586</dc:creator>
      <dc:date>2016-06-10T00:46:54Z</dc:date>
    </item>
    <item>
      <title>Re: Spark and Structured Data</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-and-Structured-Data/m-p/131022#M31374</link>
      <description>&lt;P&gt;There is no clustering algorithm in hive. I think Spark is greatly overselling its story as "unstructured" data analytics. To run a clustering algorithm you always need a schema and you need to create one in Spark as well to run a clustering model. You can use spark to read directly from an Hive/ORC table for example.&lt;/P&gt;&lt;P&gt;Frameworks with data mining algorithms in the hadoop ecosystem:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;SparkML&lt;/STRONG&gt; ( cool kid on the block and a lot of the algorithms are parallelized )&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;SparkR&lt;/STRONG&gt;: a lot of data prep functions get pushed down to Spark and you have the full power of R and work with RStudio&lt;/P&gt;&lt;P&gt;R Mapreduce frameworks ( &lt;STRONG&gt;RMR&lt;/STRONG&gt; ... &lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt; If you don't like Spark&lt;/P&gt;&lt;P&gt;...&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Mahout&lt;/STRONG&gt; ( a bit out of vogue wouldn't use it ) &lt;/P&gt;&lt;P&gt;And many more ( like running Python MapReduce streaming ... ) &lt;/P&gt;&lt;P&gt;If you ask for an opinion I would put the tables in Hive (ORC ) and use SparkML for the clustering. It has just a lot of push and you can use Python or Scala ( use Scala ). &lt;/P&gt;&lt;P&gt;If you know R better, something like SparkR might be the way to go&lt;/P&gt;</description>
      <pubDate>Fri, 10 Jun 2016 01:11:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-and-Structured-Data/m-p/131022#M31374</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-06-10T01:11:05Z</dc:date>
    </item>
  </channel>
</rss>

