Member since
07-16-2020
1
Post
0
Kudos Received
0
Solutions
07-16-2020
11:40 PM
Hello, here is a way to do it using pyspark, it may be not optimal. I used this csv to test my code column1,column2,column3
row1-1|row1-2|row1-3
row2-1|row2-2|row2-3
row3-1|row3-2|row3-3 Load the header only, giving the dataframe structure header_dataframe = spark.read.format("csv").option("header", "true").load('/tmp/test.csv').limit(0) +-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
+-------+-------+-------+ Load the data as RDD, remove the first line and convert it to dataframe data_rdd = sc.textFile('/tmp/test.csv')
header_row =data_rdd.first()
data_rdd = data_rdd.filter(lambda row:row != header_row)
data_dataframe = data_rdd.map(lambda x: x.split("|")).toDF() +------+------+------+
| _1| _2| _3|
+------+------+------+
|row1-1|row1-2|row1-3|
|row2-1|row2-2|row2-3|
|row3-1|row3-2|row3-3|
+------+------+------+ Append the dataframe containing the data to the dataframe holding the structure dataframe = header_dataframe.union(data_dataframe) +-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
| row1-1| row1-2| row1-3|
| row2-1| row2-2| row2-3|
| row3-1| row3-2| row3-3|
+-------+-------+-------+
... View more