Created 01-17-2017 05:56 AM
dt1 = {'one':[0.3, 1.2, 1.3, 1.5, 1.4, 1],'two':[0.6, 1.2, 1.7, 1.5,1.4, 2]}
dt = sc.parallelize([ (k,) + tuple(v[0:]) for k,v in dt1.items()]).toDF()
dt.show()
I want 'one' and 'two' to be column header and all list values should be column values
Created 01-20-2017 02:32 AM
Here's an option using RDDs, and I'd like to see if anyone comes up with a good DataFrame solution.
Input Format:
+---+---+---+---+---+---+---+ | _1| _2| _3| _4| _5| _6| _7| +---+---+---+---+---+---+---+ |two|0.6|1.2|1.7|1.5|1.4|2.0| |one|0.3|1.2|1.3|1.5|1.4|1.0| +---+---+---+---+---+---+---+
Output Format:
+---+---+ |two|one| +---+---+ |0.6|0.3| |1.2|1.2| |1.7|1.3| |1.5|1.5| |1.4|1.4| |2.0|1.0| +---+---+
Code:
import numpy as np from pyspark.sql import SQLContext from pyspark.sql.functions import lit dt1 = {'one':[0.3, 1.2, 1.3, 1.5, 1.4, 1.0],'two':[0.6, 1.2, 1.7, 1.5, 1.4, 2.0]} dt = sc.parallelize([ (k,) + tuple(v[0:]) for k,v in dt1.items()]).toDF() dt.show() #--- Start of my Transpose Code --- # Grad data from first columns, since it will be transposed to new column headers new_header = [i[0] for i in dt.select("_1").rdd.map(tuple).collect()] # Remove first column from dataframe dt2 = dt.select([c for c in dt.columns if c not in ['_1']]) # Convert DataFrame to RDD rdd = dt2.rdd.map(tuple) # Transpose Data rddT1 = rdd.zipWithIndex().flatMap(lambda (x,i): [(i,j,e) for (j,e) in enumerate(x)]) rddT2 = rddT1.map(lambda (i,j,e): (j, (i,e))).groupByKey().sortByKey() rddT3 = rddT2.map(lambda (i, x): sorted(list(x), cmp=lambda (i1,e1),(i2,e2) : cmp(i1, i2))) rddT4 = rddT3.map(lambda x: map(lambda (i, y): y , x)) # Convert back to DataFrame (along with header) df = rddT4.toDF(new_header) df.show()
Let me know if this helps.
Reference: http://www.data-intuitive.com/2015/01/transposing-a-spark-rdd/
Created 01-20-2017 02:32 AM
Here's an option using RDDs, and I'd like to see if anyone comes up with a good DataFrame solution.
Input Format:
+---+---+---+---+---+---+---+ | _1| _2| _3| _4| _5| _6| _7| +---+---+---+---+---+---+---+ |two|0.6|1.2|1.7|1.5|1.4|2.0| |one|0.3|1.2|1.3|1.5|1.4|1.0| +---+---+---+---+---+---+---+
Output Format:
+---+---+ |two|one| +---+---+ |0.6|0.3| |1.2|1.2| |1.7|1.3| |1.5|1.5| |1.4|1.4| |2.0|1.0| +---+---+
Code:
import numpy as np from pyspark.sql import SQLContext from pyspark.sql.functions import lit dt1 = {'one':[0.3, 1.2, 1.3, 1.5, 1.4, 1.0],'two':[0.6, 1.2, 1.7, 1.5, 1.4, 2.0]} dt = sc.parallelize([ (k,) + tuple(v[0:]) for k,v in dt1.items()]).toDF() dt.show() #--- Start of my Transpose Code --- # Grad data from first columns, since it will be transposed to new column headers new_header = [i[0] for i in dt.select("_1").rdd.map(tuple).collect()] # Remove first column from dataframe dt2 = dt.select([c for c in dt.columns if c not in ['_1']]) # Convert DataFrame to RDD rdd = dt2.rdd.map(tuple) # Transpose Data rddT1 = rdd.zipWithIndex().flatMap(lambda (x,i): [(i,j,e) for (j,e) in enumerate(x)]) rddT2 = rddT1.map(lambda (i,j,e): (j, (i,e))).groupByKey().sortByKey() rddT3 = rddT2.map(lambda (i, x): sorted(list(x), cmp=lambda (i1,e1),(i2,e2) : cmp(i1, i2))) rddT4 = rddT3.map(lambda x: map(lambda (i, y): y , x)) # Convert back to DataFrame (along with header) df = rddT4.toDF(new_header) df.show()
Let me know if this helps.
Reference: http://www.data-intuitive.com/2015/01/transposing-a-spark-rdd/
Created 06-07-2017 01:28 PM
Is there a way to get row labels (_1, _2, _3, _4, _5, _6, _7) as the first column of transposed dataframe?