Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

How to transpose a pyspark dataframe?

avatar
Rising Star

dt1 = {'one':[0.3, 1.2, 1.3, 1.5, 1.4, 1],'two':[0.6, 1.2, 1.7, 1.5,1.4, 2]}

dt = sc.parallelize([ (k,) + tuple(v[0:]) for k,v in dt1.items()]).toDF()

dt.show()

I want 'one' and 'two' to be column header and all list values should be column values

1 ACCEPTED SOLUTION

avatar

@subhrajit mohanty

Here's an option using RDDs, and I'd like to see if anyone comes up with a good DataFrame solution.

Input Format:

+---+---+---+---+---+---+---+
| _1| _2| _3| _4| _5| _6| _7|
+---+---+---+---+---+---+---+
|two|0.6|1.2|1.7|1.5|1.4|2.0|
|one|0.3|1.2|1.3|1.5|1.4|1.0|
+---+---+---+---+---+---+---+

Output Format:

+---+---+
|two|one|
+---+---+
|0.6|0.3|
|1.2|1.2|
|1.7|1.3|
|1.5|1.5|
|1.4|1.4|
|2.0|1.0|
+---+---+

Code:

import numpy as np
from pyspark.sql import SQLContext
from pyspark.sql.functions import lit

dt1 = {'one':[0.3, 1.2, 1.3, 1.5, 1.4, 1.0],'two':[0.6, 1.2, 1.7, 1.5, 1.4, 2.0]}
dt = sc.parallelize([ (k,) + tuple(v[0:]) for k,v in dt1.items()]).toDF()
dt.show()


#--- Start of my Transpose Code ---


# Grad data from first columns, since it will be transposed to new column headers
new_header = [i[0] for i in dt.select("_1").rdd.map(tuple).collect()]

# Remove first column from dataframe
dt2 = dt.select([c for c in dt.columns if c not in ['_1']])

# Convert DataFrame to RDD
rdd = dt2.rdd.map(tuple)

# Transpose Data
rddT1 = rdd.zipWithIndex().flatMap(lambda (x,i): [(i,j,e) for (j,e) in enumerate(x)])
rddT2 = rddT1.map(lambda (i,j,e): (j, (i,e))).groupByKey().sortByKey()
rddT3 = rddT2.map(lambda (i, x): sorted(list(x), cmp=lambda (i1,e1),(i2,e2) : cmp(i1, i2)))
rddT4 = rddT3.map(lambda x: map(lambda (i, y): y , x))

# Convert back to DataFrame (along with header)
df = rddT4.toDF(new_header)

df.show()

Let me know if this helps.

Reference: http://www.data-intuitive.com/2015/01/transposing-a-spark-rdd/

View solution in original post

2 REPLIES 2

avatar

@subhrajit mohanty

Here's an option using RDDs, and I'd like to see if anyone comes up with a good DataFrame solution.

Input Format:

+---+---+---+---+---+---+---+
| _1| _2| _3| _4| _5| _6| _7|
+---+---+---+---+---+---+---+
|two|0.6|1.2|1.7|1.5|1.4|2.0|
|one|0.3|1.2|1.3|1.5|1.4|1.0|
+---+---+---+---+---+---+---+

Output Format:

+---+---+
|two|one|
+---+---+
|0.6|0.3|
|1.2|1.2|
|1.7|1.3|
|1.5|1.5|
|1.4|1.4|
|2.0|1.0|
+---+---+

Code:

import numpy as np
from pyspark.sql import SQLContext
from pyspark.sql.functions import lit

dt1 = {'one':[0.3, 1.2, 1.3, 1.5, 1.4, 1.0],'two':[0.6, 1.2, 1.7, 1.5, 1.4, 2.0]}
dt = sc.parallelize([ (k,) + tuple(v[0:]) for k,v in dt1.items()]).toDF()
dt.show()


#--- Start of my Transpose Code ---


# Grad data from first columns, since it will be transposed to new column headers
new_header = [i[0] for i in dt.select("_1").rdd.map(tuple).collect()]

# Remove first column from dataframe
dt2 = dt.select([c for c in dt.columns if c not in ['_1']])

# Convert DataFrame to RDD
rdd = dt2.rdd.map(tuple)

# Transpose Data
rddT1 = rdd.zipWithIndex().flatMap(lambda (x,i): [(i,j,e) for (j,e) in enumerate(x)])
rddT2 = rddT1.map(lambda (i,j,e): (j, (i,e))).groupByKey().sortByKey()
rddT3 = rddT2.map(lambda (i, x): sorted(list(x), cmp=lambda (i1,e1),(i2,e2) : cmp(i1, i2)))
rddT4 = rddT3.map(lambda x: map(lambda (i, y): y , x))

# Convert back to DataFrame (along with header)
df = rddT4.toDF(new_header)

df.show()

Let me know if this helps.

Reference: http://www.data-intuitive.com/2015/01/transposing-a-spark-rdd/

avatar
New Member

Is there a way to get row labels (_1, _2, _3, _4, _5, _6, _7) as the first column of transposed dataframe?