Reply
Highlighted
New Contributor
Posts: 1
Registered: ‎02-28-2019

Iterate every row of a spark dataframe without using collect

I want to iterate every row of a dataframe without using collect. Here is my current implementation:

val df = spark.read.csv("/tmp/s0v00fc/test_dir")

import scala.collection.mutable.Map
var m1 = Map[Int, Int]()
var m4 = Map[Int, Int]()
var j = 1

def Test(m:Int, n:Int):Unit = {
  if (!m1.contains(m)) {
    m1 += (m -> j)
    m4 += (j -> m)
    j += 1
  }
  if (!m1.contains(n)) {
    m1 += (n -> j)
    m4 += (j -> n)
    j += 1
  }

 df.foreach { row => Test(row(0).toString.toInt, row(1).toString.toInt) }

This does not give any error but m1 and m4 are still empty. I can get the result I am expecting if I do a df.collect as shown below -

df.collect.foreach { row => Test(row(0).toString.toInt, row(1).toString.toInt) }

How do I execute the custom function "Test" on every row of the dataframe without using collect