Support Questions

Find answers, ask questions, and share your expertise

Spark Dataframes: How can I change the order of columns in Java/Scala?

avatar
Contributor

After joining two dataframes, I find that the column order has changed what I supposed it would be.

Ex: Joining two data frames with columns [b,c,d,e] and [a,b] on b yields a column order of [b,a,c,d,e].

How can I change the order of the columns (e.g., [a,b,c,d,e])? I've found ways to do it in Python/R but not Scala or Java. Are there any methods that allow swapping or reordering of dataframe columns?

1 ACCEPTED SOLUTION

avatar
Contributor

@Jestin: Why do you need sorting columns in dataframes? Could u please elaborate.

However in Java there is no inbuilt function to reorder the columns.

View solution in original post

6 REPLIES 6

avatar
Super Guru

Your sorting should happens on the basis of the key, here is an example for scala.

val file = sc.textFile("some_local_text_file_pathname")
val wordCounts = file.flatMap(line => line.split(" "))
  .map(word => (word, 1))
  .reduceByKey(_ + _, 1)  // 2nd arg configures one task (same as number of partitions)
  .map(item => item.swap) // interchanges position of entries in each tuple
  .sortByKey(true, 1) // 1st arg configures ascending sort, 2nd arg configures one task
  .map(item => item.swap)



avatar
Contributor

@Jestin: Why do you need sorting columns in dataframes? Could u please elaborate.

However in Java there is no inbuilt function to reorder the columns.

avatar
Master Guru

why does the order of columns matter?

avatar
Contributor

There are scenarios(though bad) where data insertion requires the ordering of columns to be in Lexicographical Sorting while inserting data into db using JDBC connection. Not sure if jestin ma is facing similar issue.

avatar
Contributor

In order to reorder tuples (columns) in scala I think you just use a map like in Pyspark:

val rdd2 = rdd.map((x, y, z) => (z, y, x)) 

You should also be able to build key-value pairs this way too.

val rdd2 = rdd.map((x, y, z) => (z, (y, x)))

This is very handy if you want to follow it up with sortByKey().

avatar
New Contributor

All you need to do is use select (worked for me). Do the following:

val new_df = df.select("a", "b", "c", "d", "e") // Assuming you want a, b, c, d, e to be your order

@venki2404

,

All you need to do do is use select (worked for me). Do the following:

val new_df = df.select("a", "b", "c", "d", "e") // assuming the column order you need is a, b, c, d, e