Posts: 24
Registered: ‎08-14-2015

Spark - From RDD to CSV String with ordering

[ Edited ]

Hello :)


I try to convert a RDD[(Int,String]) to a simple String.

The goal is to concatenate all String values to a CSV string and the difficulty is to keep the order defined by the Int values.


Input RDD[(Int,String]) is like the following example: 

  • (166,"A")
  • (2,"B")
  • (200,"C")
  • (100,"D")

Expected output String should be like "B","D","A","C" 


Here is the code example with all details:



    val data = sc.parallelize(Seq((166,"A"),(2,"B"),(200,"C"),(100,"D")))
    // RDD[(Int, Any)]

    val keyedSortedData = data.keyBy(x => x._1).sortByKey()
    // RDD[(Int, (Int, Any))]

    val sortedData ={case (a,(b,c)) => c})
    // RDD[Any]

    val reducedData = sortedData.reduce((valueA, valueB) => valueA + "," + valueB).toString
    // String in CSV format

    // Execution 1 - Result is: B,D,C,B
    // Execution 2 - Result is: A,C,D,B
    // Execution 3 - Result is: C,B,A,D
    // ...



I cannot achieve to get a fixed order... According documentation, map() and reduce() operations should not alter order defined with the keyBy() method. But it seems to alter it...


Thanks for your help or your remarks.