Support Questions
Find answers, ask questions, and share your expertise

Spark - From RDD to CSV String with ordering


Spark - From RDD to CSV String with ordering


Hello :)


I try to convert a RDD[(Int,String]) to a simple String.

The goal is to concatenate all String values to a CSV string and the difficulty is to keep the order defined by the Int values.


Input RDD[(Int,String]) is like the following example: 

  • (166,"A")
  • (2,"B")
  • (200,"C")
  • (100,"D")

Expected output String should be like "B","D","A","C" 


Here is the code example with all details:



    val data = sc.parallelize(Seq((166,"A"),(2,"B"),(200,"C"),(100,"D")))
    // RDD[(Int, Any)]

    val keyedSortedData = data.keyBy(x => x._1).sortByKey()
    // RDD[(Int, (Int, Any))]

    val sortedData ={case (a,(b,c)) => c})
    // RDD[Any]

    val reducedData = sortedData.reduce((valueA, valueB) => valueA + "," + valueB).toString
    // String in CSV format

    // Execution 1 - Result is: B,D,C,B
    // Execution 2 - Result is: A,C,D,B
    // Execution 3 - Result is: C,B,A,D
    // ...



I cannot achieve to get a fixed order... According documentation, map() and reduce() operations should not alter order defined with the keyBy() method. But it seems to alter it...


Thanks for your help or your remarks.

Don't have an account?