Support Questions
Find answers, ask questions, and share your expertise

How can I replace characters in the header of a Spark Dataframe?

How can I replace characters in the header of a Spark Dataframe?

New Contributor

I have a dozen or so large txt files (15GB) that I have been reading into Spark as csv, but the issue that I am running into is that the headers for these files are delimited by commas(,), whereas all of the data in said columns is delimited by pipes( | ). 

 

Does anyone know how I could replace the commas in the headers with pipes, or conversely replace the pipes in the rest of the data with commas?

 

The current layout of the files make it so that all of the data in the files fall into the first column, which messes up the filter I am trying to run on said files. 

 

Any help is greatly appreciated. 

2 REPLIES 2

Re: How can I replace characters in the header of a Spark Dataframe?

New Contributor

hi, in which file system, the file is based on? linux/unix or hdfs?

Re: How can I replace characters in the header of a Spark Dataframe?

Explorer

Hello,

 

here is a way to do it using pyspark, it may be not optimal.

 

I used this csv to test my code 

column1,column2,column3
row1-1|row1-2|row1-3
row2-1|row2-2|row2-3
row3-1|row3-2|row3-3

 

Load the header only, giving the dataframe structure

 

header_dataframe = spark.read.format("csv").option("header", "true").load('/tmp/test.csv').limit(0)
+-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
+-------+-------+-------+

 

 

Load the data as RDD, remove the first line and convert it to dataframe

data_rdd = sc.textFile('/tmp/test.csv')
header_row =data_rdd.first()
data_rdd = data_rdd.filter(lambda row:row != header_row)
data_dataframe = data_rdd.map(lambda x: x.split("|")).toDF()
+------+------+------+
|    _1|    _2|    _3|
+------+------+------+
|row1-1|row1-2|row1-3|
|row2-1|row2-2|row2-3|
|row3-1|row3-2|row3-3|
+------+------+------+

 

Append the dataframe containing the data to the dataframe holding the structure

 

dataframe = header_dataframe.union(data_dataframe)

 

+-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
| row1-1| row1-2| row1-3|
| row2-1| row2-2| row2-3|
| row3-1| row3-2| row3-3|
+-------+-------+-------+