<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Best way to select distinct values from multiple columns using Spark RDD? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-select-distinct-values-from-multiple-columns/m-p/98575#M11939</link>
    <description>&lt;P&gt;- RDD is read from CSV and split into list&lt;/P&gt;&lt;P&gt;- rawTrainData is cached &lt;/P&gt;&lt;P&gt;- It have 2 partitions at same node. The file is not large. 220 MB.&lt;/P&gt;&lt;P&gt;- I edited original code to translate to English. Valores = distincValues&lt;/P&gt;</description>
    <pubDate>Thu, 10 Dec 2015 23:22:38 GMT</pubDate>
    <dc:creator>Vitor</dc:creator>
    <dc:date>2015-12-10T23:22:38Z</dc:date>
    <item>
      <title>Best way to select distinct values from multiple columns using Spark RDD?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-select-distinct-values-from-multiple-columns/m-p/98571#M11935</link>
      <description>&lt;P&gt;I'm trying to convert each distinct value in each column of my RDD, but the code below is very slow. Is there any alternative?&lt;/P&gt;&lt;P&gt;Data is both numeric and categorical (string).&lt;/P&gt;&lt;PRE&gt;categories = {}
for i in idxCategories: ##idxCategories contains indexes of rows that contains categorical data
    distinctValues = rawTrainData.map(lambda x : x[i]).distinct().collect()
    valuesMap = {key: value for (key,value) in zip(distinctValues, range(len(valores)))}
    categories[i] = valuesMap&lt;/PRE&gt;</description>
      <pubDate>Thu, 10 Dec 2015 21:37:04 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-select-distinct-values-from-multiple-columns/m-p/98571#M11935</guid>
      <dc:creator>Vitor</dc:creator>
      <dc:date>2015-12-10T21:37:04Z</dc:date>
    </item>
    <item>
      <title>Re: Best way to select distinct values from multiple columns using Spark RDD?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-select-distinct-values-from-multiple-columns/m-p/98572#M11936</link>
      <description>&lt;P&gt;A few clarifying questions about rawTrainData:&lt;/P&gt;&lt;P&gt;- How is this RDD generated?&lt;/P&gt;&lt;P&gt;- Is it cached?&lt;/P&gt;&lt;P&gt;- how many partitions does it have?&lt;/P&gt;&lt;P&gt;Also, what is the variable "valores"?&lt;/P&gt;</description>
      <pubDate>Thu, 10 Dec 2015 22:23:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-select-distinct-values-from-multiple-columns/m-p/98572#M11936</guid>
      <dc:creator>ofermend</dc:creator>
      <dc:date>2015-12-10T22:23:06Z</dc:date>
    </item>
    <item>
      <title>Re: Best way to select distinct values from multiple columns using Spark RDD?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-select-distinct-values-from-multiple-columns/m-p/98573#M11937</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1218/vabatista.html" nodeid="1218"&gt;@Vitor Batista&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Data Frames are supposed to be faster than Python RDD operations, check slide 20 of this presentation:&lt;/P&gt;&lt;P&gt;&lt;A target="_blank" href="http://www.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of-structured-data"&gt;http://www.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of-structured-data&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Could you try code below and check if it's faster?&lt;/P&gt;&lt;PRE&gt;from pyspark.sql import SQLContext, Row


input_file = "hdfs:///tmp/your_text_file"
raw_rdd = sc.textFile(input_file)
csv_rdd = raw_rdd.map(lambda x: x.split(","))


row_data = csv_rdd.map(lambda p: Row(
    field1=p[0], 
    field2=p[1],
    field3=p[2]
    )
)


df = sqlContext.createDataFrame(row_data)


categories = {}
idxCategories = [0,1,2]
for i in idxCategories: ##idxCategories contains indexes of rows that contains categorical data
    distinctValues = df.map(lambda x : x[i]).distinct().collect()
    categories[i] = distinctValues



    
print categories[0]
print categories[1]
print categories[2]


&lt;/PRE&gt;</description>
      <pubDate>Thu, 10 Dec 2015 22:28:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-select-distinct-values-from-multiple-columns/m-p/98573#M11937</guid>
      <dc:creator>gbraccialli3</dc:creator>
      <dc:date>2015-12-10T22:28:52Z</dc:date>
    </item>
    <item>
      <title>Re: Best way to select distinct values from multiple columns using Spark RDD?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-select-distinct-values-from-multiple-columns/m-p/98574#M11938</link>
      <description>&lt;P&gt;4x slower &lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt; I used .toDF() instead of your code. Is there any difference?&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="762-dataframe.png" style="width: 1014px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/23899iCB61B8C62791B526/image-size/medium?v=v2&amp;amp;px=400" role="button" title="762-dataframe.png" alt="762-dataframe.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 19 Aug 2019 12:39:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-select-distinct-values-from-multiple-columns/m-p/98574#M11938</guid>
      <dc:creator>Vitor</dc:creator>
      <dc:date>2019-08-19T12:39:26Z</dc:date>
    </item>
    <item>
      <title>Re: Best way to select distinct values from multiple columns using Spark RDD?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-select-distinct-values-from-multiple-columns/m-p/98575#M11939</link>
      <description>&lt;P&gt;- RDD is read from CSV and split into list&lt;/P&gt;&lt;P&gt;- rawTrainData is cached &lt;/P&gt;&lt;P&gt;- It have 2 partitions at same node. The file is not large. 220 MB.&lt;/P&gt;&lt;P&gt;- I edited original code to translate to English. Valores = distincValues&lt;/P&gt;</description>
      <pubDate>Thu, 10 Dec 2015 23:22:38 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-select-distinct-values-from-multiple-columns/m-p/98575#M11939</guid>
      <dc:creator>Vitor</dc:creator>
      <dc:date>2015-12-10T23:22:38Z</dc:date>
    </item>
    <item>
      <title>Re: Best way to select distinct values from multiple columns using Spark RDD?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-select-distinct-values-from-multiple-columns/m-p/98576#M11940</link>
      <description>&lt;P&gt;You could load your csv directly, but I tested here and indeed distinct is take much longer with data frames.&lt;/P&gt;&lt;P&gt;Can you describe your environment?&lt;/P&gt;&lt;P&gt;- hortonworks version&lt;/P&gt;&lt;P&gt;- spark version&lt;/P&gt;&lt;P&gt;- hardware configuration&lt;/P&gt;&lt;P&gt;- spark mode (localmode or spark on yarn)&lt;/P&gt;&lt;P&gt;Lastly, if you have enough cores/processor and as your file is small, spark might be choosing a low level of parallelism. you can try it increasing parallelism, like this:&lt;/P&gt;&lt;PRE&gt;    distinctValues = rawTrainData.map(lambda x : x[i]).distinct(numPartitions = 15).collect()
&lt;/PRE&gt;&lt;P&gt;me fala se ficou mais rápido &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 11 Dec 2015 01:16:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-select-distinct-values-from-multiple-columns/m-p/98576#M11940</guid>
      <dc:creator>gbraccialli3</dc:creator>
      <dc:date>2015-12-11T01:16:33Z</dc:date>
    </item>
    <item>
      <title>Re: Best way to select distinct values from multiple columns using Spark RDD?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-select-distinct-values-from-multiple-columns/m-p/98577#M11941</link>
      <description>&lt;P&gt;before calling this routine, I introduced the code bellow and exec time reduced to 1m8s. 3x improvement.&lt;/P&gt;&lt;PRE&gt;rawTrainData = rawTrainData.repartition(8)
rawTrainData.cache()&lt;/PRE&gt;&lt;P&gt;But introducing numPartitions=15 inside distinct method does not affect the result.&lt;/P&gt;&lt;P&gt;I'm running Spark 1.3.1 into standalone mode (spark://host:7077) with 12 cores and 20 GB per node allocated to Spark. The hardware is virtual, but I know it`s a top hardware. The cluster has 4 nodes (3 spark workers)&lt;/P&gt;</description>
      <pubDate>Fri, 11 Dec 2015 04:28:15 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-select-distinct-values-from-multiple-columns/m-p/98577#M11941</guid>
      <dc:creator>Vitor</dc:creator>
      <dc:date>2015-12-11T04:28:15Z</dc:date>
    </item>
    <item>
      <title>Re: Best way to select distinct values from multiple columns using Spark RDD?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-select-distinct-values-from-multiple-columns/m-p/98578#M11942</link>
      <description>&lt;P&gt;Awesome! You can check current number of partitions with command below:&lt;/P&gt;&lt;PRE&gt;print csv_rdd.getNumPartitions()
&lt;/PRE&gt;</description>
      <pubDate>Fri, 11 Dec 2015 08:48:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-select-distinct-values-from-multiple-columns/m-p/98578#M11942</guid>
      <dc:creator>gbraccialli3</dc:creator>
      <dc:date>2015-12-11T08:48:06Z</dc:date>
    </item>
    <item>
      <title>Re: Best way to select distinct values from multiple columns using Spark RDD?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-select-distinct-values-from-multiple-columns/m-p/98579#M11943</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1218/vabatista.html" nodeid="1218"&gt;@Vitor Batista&lt;/A&gt; can you accept the best answer to close this thread or post your own solution?&lt;/P&gt;</description>
      <pubDate>Tue, 02 Feb 2016 09:52:46 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Best-way-to-select-distinct-values-from-multiple-columns/m-p/98579#M11943</guid>
      <dc:creator>aervits</dc:creator>
      <dc:date>2016-02-02T09:52:46Z</dc:date>
    </item>
  </channel>
</rss>

