Reply
New Contributor
Posts: 1
Registered: ‎05-26-2018

How to remove the duplicated elements in a RDD transformation

Hi, I am a new PySpark user.

 

What to create the output with the cartesian product.

 

Here is my code and output:

 

rdd = sc.parallelize([(1,5), (2,6), (3,7)])
rdd.cartesian(rdd).collect()

[((1, 5), (1, 5)),
((1, 5), (2, 6)),
((1, 5), (3, 7)),
((2, 6), (1, 5)),
((2, 6), (2, 6)),
((2, 6), (3, 7)),
((3, 7), (1, 5)),
((3, 7), (2, 6)),
((3, 7), (3, 7))]

 

However, my desired output should be:

 

[((1, 5), (2, 6)),
((1, 5), (3, 7)),
((2, 6), (1, 5)),
((2, 6), (3, 7)),
((3, 7), (1, 5)),
((3, 7), (2, 6))]

 

Since I want to remove the duplicated elements like ((1, 5), (1, 5)), ((2, 6), (2, 6)), and ((3, 7), (3, 7)).

Announcements