<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Fetch distinct values of a column in Dataframe using Spark in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Fetch-distinct-values-of-a-column-in-Dataframe-using-Spark/m-p/174630#M136893</link>
    <description>&lt;P&gt;I am working on Spark 1.6.1 version and have a requirement to fetch distinct results of a column using Spark DataFrames.
The column contains ~50 million records and doing a collect() operation slows down further operation on the result dataframe and there is No parallelism.
&lt;STRONG&gt;Using the below piece of code on a local mode works fine. But on a yarn-cluster mode i get "&lt;/STRONG&gt;&lt;STRONG&gt;java.lang.NoClassDefFoundError".&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;preProcessedDataFrame.registerTempTable("tTempTable")
preProcessedDataFrame.distinct().foreach(record =&amp;gt; {
  val applicationId = record.getAs[Int]("ApplicationId")
  val selectedApplicationDataFrame = sqlContext.sql("SELECT * FROM tTempTable WHERE ApplicationId = " + applicationId)
  selectedApplicationDataFrame.show(20)
  //FURTHER DO SOME MORE CALC BASED ON EACH APPLICATION-ID
})&lt;/PRE&gt;&lt;P&gt;Can someone tell me the reason for the error or any other better approach to achieve the same result.&lt;/P&gt;</description>
    <pubDate>Mon, 15 Aug 2016 09:35:53 GMT</pubDate>
    <dc:creator>kaz_narasimhan</dc:creator>
    <dc:date>2016-08-15T09:35:53Z</dc:date>
  </channel>
</rss>

