question Re: Best way to select distinct values from multiple columns using Spark RDD? in Archives of Support Questions (Read Only)

Best way to select distinct values from multiple columns using Spark RDD?

Vitor — Thu, 10 Dec 2015 21:37:04 GMT

I'm trying to convert each distinct value in each column of my RDD, but the code below is very slow. Is there any alternative?

Data is both numeric and categorical (string).

categories = {}
for i in idxCategories: ##idxCategories contains indexes of rows that contains categorical data
    distinctValues = rawTrainData.map(lambda x : x[i]).distinct().collect()
    valuesMap = {key: value for (key,value) in zip(distinctValues, range(len(valores)))}
    categories[i] = valuesMap

Re: Best way to select distinct values from multiple columns using Spark RDD?

ofermend — Thu, 10 Dec 2015 22:23:06 GMT

A few clarifying questions about rawTrainData:

- How is this RDD generated?

- Is it cached?

- how many partitions does it have?

Also, what is the variable "valores"?

Re: Best way to select distinct values from multiple columns using Spark RDD?

gbraccialli3 — Thu, 10 Dec 2015 22:28:52 GMT

@Vitor Batista

Data Frames are supposed to be faster than Python RDD operations, check slide 20 of this presentation:

http://www.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of-structured-data

Could you try code below and check if it's faster?

from pyspark.sql import SQLContext, Row


input_file = "hdfs:///tmp/your_text_file"
raw_rdd = sc.textFile(input_file)
csv_rdd = raw_rdd.map(lambda x: x.split(","))


row_data = csv_rdd.map(lambda p: Row(
    field1=p[0], 
    field2=p[1],
    field3=p[2]
    )
)


df = sqlContext.createDataFrame(row_data)


categories = {}
idxCategories = [0,1,2]
for i in idxCategories: ##idxCategories contains indexes of rows that contains categorical data
    distinctValues = df.map(lambda x : x[i]).distinct().collect()
    categories[i] = distinctValues



    
print categories[0]
print categories[1]
print categories[2]

Re: Best way to select distinct values from multiple columns using Spark RDD?

Vitor — Mon, 19 Aug 2019 12:39:26 GMT

4x slower 😞 I used .toDF() instead of your code. Is there any difference?

Re: Best way to select distinct values from multiple columns using Spark RDD?

Vitor — Thu, 10 Dec 2015 23:22:38 GMT

- RDD is read from CSV and split into list

- rawTrainData is cached

- It have 2 partitions at same node. The file is not large. 220 MB.

- I edited original code to translate to English. Valores = distincValues

Re: Best way to select distinct values from multiple columns using Spark RDD?

gbraccialli3 — Fri, 11 Dec 2015 01:16:33 GMT

You could load your csv directly, but I tested here and indeed distinct is take much longer with data frames.

Can you describe your environment?

- hortonworks version

- spark version

- hardware configuration

- spark mode (localmode or spark on yarn)

Lastly, if you have enough cores/processor and as your file is small, spark might be choosing a low level of parallelism. you can try it increasing parallelism, like this:

    distinctValues = rawTrainData.map(lambda x : x[i]).distinct(numPartitions = 15).collect()

me fala se ficou mais rápido 🙂

Re: Best way to select distinct values from multiple columns using Spark RDD?

Vitor — Fri, 11 Dec 2015 04:28:15 GMT

before calling this routine, I introduced the code bellow and exec time reduced to 1m8s. 3x improvement.

rawTrainData = rawTrainData.repartition(8)
rawTrainData.cache()

But introducing numPartitions=15 inside distinct method does not affect the result.

I'm running Spark 1.3.1 into standalone mode (spark://host:7077) with 12 cores and 20 GB per node allocated to Spark. The hardware is virtual, but I know it`s a top hardware. The cluster has 4 nodes (3 spark workers)

Re: Best way to select distinct values from multiple columns using Spark RDD?

gbraccialli3 — Fri, 11 Dec 2015 08:48:06 GMT

Awesome! You can check current number of partitions with command below:

print csv_rdd.getNumPartitions()

Re: Best way to select distinct values from multiple columns using Spark RDD?

aervits — Tue, 02 Feb 2016 09:52:46 GMT

@Vitor Batista can you accept the best answer to close this thread or post your own solution?