Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to transpose a pyspark dataframe?

avatar
Rising Star

dt1 = {'one':[0.3, 1.2, 1.3, 1.5, 1.4, 1],'two':[0.6, 1.2, 1.7, 1.5,1.4, 2]}

dt = sc.parallelize([ (k,) + tuple(v[0:]) for k,v in dt1.items()]).toDF()

dt.show()

I want 'one' and 'two' to be column header and all list values should be column values

1 ACCEPTED SOLUTION

avatar

@subhrajit mohanty

Here's an option using RDDs, and I'd like to see if anyone comes up with a good DataFrame solution.

Input Format:

+---+---+---+---+---+---+---+
| _1| _2| _3| _4| _5| _6| _7|
+---+---+---+---+---+---+---+
|two|0.6|1.2|1.7|1.5|1.4|2.0|
|one|0.3|1.2|1.3|1.5|1.4|1.0|
+---+---+---+---+---+---+---+

Output Format:

+---+---+
|two|one|
+---+---+
|0.6|0.3|
|1.2|1.2|
|1.7|1.3|
|1.5|1.5|
|1.4|1.4|
|2.0|1.0|
+---+---+

Code:

import numpy as np
from pyspark.sql import SQLContext
from pyspark.sql.functions import lit

dt1 = {'one':[0.3, 1.2, 1.3, 1.5, 1.4, 1.0],'two':[0.6, 1.2, 1.7, 1.5, 1.4, 2.0]}
dt = sc.parallelize([ (k,) + tuple(v[0:]) for k,v in dt1.items()]).toDF()
dt.show()


#--- Start of my Transpose Code ---


# Grad data from first columns, since it will be transposed to new column headers
new_header = [i[0] for i in dt.select("_1").rdd.map(tuple).collect()]

# Remove first column from dataframe
dt2 = dt.select([c for c in dt.columns if c not in ['_1']])

# Convert DataFrame to RDD
rdd = dt2.rdd.map(tuple)

# Transpose Data
rddT1 = rdd.zipWithIndex().flatMap(lambda (x,i): [(i,j,e) for (j,e) in enumerate(x)])
rddT2 = rddT1.map(lambda (i,j,e): (j, (i,e))).groupByKey().sortByKey()
rddT3 = rddT2.map(lambda (i, x): sorted(list(x), cmp=lambda (i1,e1),(i2,e2) : cmp(i1, i2)))
rddT4 = rddT3.map(lambda x: map(lambda (i, y): y , x))

# Convert back to DataFrame (along with header)
df = rddT4.toDF(new_header)

df.show()

Let me know if this helps.

Reference: http://www.data-intuitive.com/2015/01/transposing-a-spark-rdd/

View solution in original post

2 REPLIES 2

avatar

@subhrajit mohanty

Here's an option using RDDs, and I'd like to see if anyone comes up with a good DataFrame solution.

Input Format:

+---+---+---+---+---+---+---+
| _1| _2| _3| _4| _5| _6| _7|
+---+---+---+---+---+---+---+
|two|0.6|1.2|1.7|1.5|1.4|2.0|
|one|0.3|1.2|1.3|1.5|1.4|1.0|
+---+---+---+---+---+---+---+

Output Format:

+---+---+
|two|one|
+---+---+
|0.6|0.3|
|1.2|1.2|
|1.7|1.3|
|1.5|1.5|
|1.4|1.4|
|2.0|1.0|
+---+---+

Code:

import numpy as np
from pyspark.sql import SQLContext
from pyspark.sql.functions import lit

dt1 = {'one':[0.3, 1.2, 1.3, 1.5, 1.4, 1.0],'two':[0.6, 1.2, 1.7, 1.5, 1.4, 2.0]}
dt = sc.parallelize([ (k,) + tuple(v[0:]) for k,v in dt1.items()]).toDF()
dt.show()


#--- Start of my Transpose Code ---


# Grad data from first columns, since it will be transposed to new column headers
new_header = [i[0] for i in dt.select("_1").rdd.map(tuple).collect()]

# Remove first column from dataframe
dt2 = dt.select([c for c in dt.columns if c not in ['_1']])

# Convert DataFrame to RDD
rdd = dt2.rdd.map(tuple)

# Transpose Data
rddT1 = rdd.zipWithIndex().flatMap(lambda (x,i): [(i,j,e) for (j,e) in enumerate(x)])
rddT2 = rddT1.map(lambda (i,j,e): (j, (i,e))).groupByKey().sortByKey()
rddT3 = rddT2.map(lambda (i, x): sorted(list(x), cmp=lambda (i1,e1),(i2,e2) : cmp(i1, i2)))
rddT4 = rddT3.map(lambda x: map(lambda (i, y): y , x))

# Convert back to DataFrame (along with header)
df = rddT4.toDF(new_header)

df.show()

Let me know if this helps.

Reference: http://www.data-intuitive.com/2015/01/transposing-a-spark-rdd/

avatar
New Contributor

Is there a way to get row labels (_1, _2, _3, _4, _5, _6, _7) as the first column of transposed dataframe?