Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Spark Scala - Remove rows that have columns with same value

avatar
Rising Star

Hi,

I've this data in a textfile:

14
25
22
15

How can I using Spark and programming Scala can identify the rows that have the number repetead in same row? And how can I delete it? In this case I want to remove the third row...

Mnay thanks!

1 ACCEPTED SOLUTION

avatar
Master Guru
scala> val a = sc.textFile("/user/.../path/to/your/file").map(x => x.split("\t")).filter(x => x(0) != x(1))
scala> a.take(4)
res2: Array[Array[String]] = Array(Array(1, 4), Array(2, 5), Array(1, 5))

Try the snippet above, just insert the path to your file on hdfs.

View solution in original post

1 REPLY 1

avatar
Master Guru
scala> val a = sc.textFile("/user/.../path/to/your/file").map(x => x.split("\t")).filter(x => x(0) != x(1))
scala> a.take(4)
res2: Array[Array[String]] = Array(Array(1, 4), Array(2, 5), Array(1, 5))

Try the snippet above, just insert the path to your file on hdfs.