<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Text and Data Mining in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Text-and-Data-Mining/m-p/156049#M36334</link>
    <description>&lt;P&gt;
	I've placed a few pyspark scripts on my github: &lt;A href="https://github.com/zaratsian/pyspark"&gt;https://github.com/zaratsian/pyspark&lt;/A&gt;. You can demo/show these projects by copying the note.json link into &lt;A href="https://www.zeppelinhub.com/viewer"&gt;Zeppelin Hub Viewer&lt;/A&gt;. &lt;/P&gt;&lt;P&gt;
	When working with text / unstructured data, there are a few things to keep in mind:&lt;/P&gt;&lt;UL&gt;
	
&lt;LI&gt;&lt;STRONG&gt;Cleaning the text is important&lt;/STRONG&gt; (remove &lt;A href="http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.StopWordsRemover"&gt;stopwords&lt;/A&gt;, remove punctuation, typically you will want to lowcase/upcase all words, account for stemming, tag the part-of-speech, etc.). Tagging part-of-speech is an advanced option, but can enhance the accuracy if use on the right use case. &lt;/LI&gt;	
&lt;LI&gt;Most text analytics projects involve creating a &lt;STRONG&gt;term-document matrix&lt;/STRONG&gt; (&lt;A href="http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf"&gt;TFIDF&lt;/A&gt;, which is a term frequency, inverse document frequency matrix). This is typically done within spark using the HashingTF function.&lt;/LI&gt;	
&lt;LI&gt;From here, you can use the TFIDF vectors and feed them into a &lt;STRONG&gt;clustering algorithm&lt;/STRONG&gt;, such as &lt;A href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.clustering.KMeansModel"&gt;kmeans&lt;/A&gt;, &lt;A href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.clustering.LDA"&gt;LDA&lt;/A&gt;, or a really good option would be to use &lt;A href="https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html"&gt;SVD&lt;/A&gt; (singular value decomposition). &lt;/LI&gt;	
&lt;LI&gt;You could also use the TFIDF matrix paired with structured data and use it within a &lt;STRONG&gt;classification (or regression) algorithm&lt;/STRONG&gt; such as &lt;A href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayesModel"&gt;Naive Bayes&lt;/A&gt;, a &lt;A href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.tree.DecisionTreeModel"&gt;Decision Tree&lt;/A&gt; model, &lt;A href="https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.tree.RandomForestModel"&gt;Random Forest&lt;/A&gt;, etc.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;
	This process will help you understand your text by &lt;STRONG&gt;(1) finding data-driven topics&lt;/STRONG&gt; using the matrix reduction / clustering techniques or by &lt;STRONG&gt;(2) using the term-document matrix to predict an outcome&lt;/STRONG&gt; (probability failure, likelihood to churn, etc.)&lt;/P&gt;&lt;P&gt;
	You may also want to check out &lt;A href="http://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec"&gt;Word2Vec&lt;/A&gt; (I have an example in my github).&lt;/P&gt;&lt;P&gt;
	Hope this helps!&lt;/P&gt;</description>
    <pubDate>Fri, 29 Jul 2016 10:10:34 GMT</pubDate>
    <dc:creator>dzaratsian</dc:creator>
    <dc:date>2016-07-29T10:10:34Z</dc:date>
    <item>
      <title>Text and Data Mining</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Text-and-Data-Mining/m-p/156048#M36333</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I am looking for a reference and demos to show Text &amp;amp; Data mining capabilities on our platform.&lt;/P&gt;&lt;P&gt;I am trying to answer one of the RFP questions. &lt;/P&gt;&lt;P&gt;Any help is highly appreciated.&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Sujitha&lt;/P&gt;</description>
      <pubDate>Fri, 29 Jul 2016 05:54:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Text-and-Data-Mining/m-p/156048#M36333</guid>
      <dc:creator>ssanku</dc:creator>
      <dc:date>2016-07-29T05:54:51Z</dc:date>
    </item>
    <item>
      <title>Re: Text and Data Mining</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Text-and-Data-Mining/m-p/156049#M36334</link>
      <description>&lt;P&gt;
	I've placed a few pyspark scripts on my github: &lt;A href="https://github.com/zaratsian/pyspark"&gt;https://github.com/zaratsian/pyspark&lt;/A&gt;. You can demo/show these projects by copying the note.json link into &lt;A href="https://www.zeppelinhub.com/viewer"&gt;Zeppelin Hub Viewer&lt;/A&gt;. &lt;/P&gt;&lt;P&gt;
	When working with text / unstructured data, there are a few things to keep in mind:&lt;/P&gt;&lt;UL&gt;
	
&lt;LI&gt;&lt;STRONG&gt;Cleaning the text is important&lt;/STRONG&gt; (remove &lt;A href="http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.StopWordsRemover"&gt;stopwords&lt;/A&gt;, remove punctuation, typically you will want to lowcase/upcase all words, account for stemming, tag the part-of-speech, etc.). Tagging part-of-speech is an advanced option, but can enhance the accuracy if use on the right use case. &lt;/LI&gt;	
&lt;LI&gt;Most text analytics projects involve creating a &lt;STRONG&gt;term-document matrix&lt;/STRONG&gt; (&lt;A href="http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf"&gt;TFIDF&lt;/A&gt;, which is a term frequency, inverse document frequency matrix). This is typically done within spark using the HashingTF function.&lt;/LI&gt;	
&lt;LI&gt;From here, you can use the TFIDF vectors and feed them into a &lt;STRONG&gt;clustering algorithm&lt;/STRONG&gt;, such as &lt;A href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.clustering.KMeansModel"&gt;kmeans&lt;/A&gt;, &lt;A href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.clustering.LDA"&gt;LDA&lt;/A&gt;, or a really good option would be to use &lt;A href="https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html"&gt;SVD&lt;/A&gt; (singular value decomposition). &lt;/LI&gt;	
&lt;LI&gt;You could also use the TFIDF matrix paired with structured data and use it within a &lt;STRONG&gt;classification (or regression) algorithm&lt;/STRONG&gt; such as &lt;A href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayesModel"&gt;Naive Bayes&lt;/A&gt;, a &lt;A href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.tree.DecisionTreeModel"&gt;Decision Tree&lt;/A&gt; model, &lt;A href="https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.tree.RandomForestModel"&gt;Random Forest&lt;/A&gt;, etc.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;
	This process will help you understand your text by &lt;STRONG&gt;(1) finding data-driven topics&lt;/STRONG&gt; using the matrix reduction / clustering techniques or by &lt;STRONG&gt;(2) using the term-document matrix to predict an outcome&lt;/STRONG&gt; (probability failure, likelihood to churn, etc.)&lt;/P&gt;&lt;P&gt;
	You may also want to check out &lt;A href="http://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec"&gt;Word2Vec&lt;/A&gt; (I have an example in my github).&lt;/P&gt;&lt;P&gt;
	Hope this helps!&lt;/P&gt;</description>
      <pubDate>Fri, 29 Jul 2016 10:10:34 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Text-and-Data-Mining/m-p/156049#M36334</guid>
      <dc:creator>dzaratsian</dc:creator>
      <dc:date>2016-07-29T10:10:34Z</dc:date>
    </item>
  </channel>
</rss>

