- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How can I replace characters in the header of a Spark Dataframe?
- Labels:
-
Apache Spark
Created ‎07-15-2020 07:54 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a dozen or so large txt files (15GB) that I have been reading into Spark as csv, but the issue that I am running into is that the headers for these files are delimited by commas(,), whereas all of the data in said columns is delimited by pipes( | ).
Does anyone know how I could replace the commas in the headers with pipes, or conversely replace the pipes in the rest of the data with commas?
The current layout of the files make it so that all of the data in the files fall into the first column, which messes up the filter I am trying to run on said files.
Any help is greatly appreciated.
Created ‎07-16-2020 09:15 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi, in which file system, the file is based on? linux/unix or hdfs?
Created ‎07-16-2020 11:40 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
here is a way to do it using pyspark, it may be not optimal.
I used this csv to test my code
column1,column2,column3
row1-1|row1-2|row1-3
row2-1|row2-2|row2-3
row3-1|row3-2|row3-3
Load the header only, giving the dataframe structure
header_dataframe = spark.read.format("csv").option("header", "true").load('/tmp/test.csv').limit(0)
+-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
+-------+-------+-------+
Load the data as RDD, remove the first line and convert it to dataframe
data_rdd = sc.textFile('/tmp/test.csv')
header_row =data_rdd.first()
data_rdd = data_rdd.filter(lambda row:row != header_row)
data_dataframe = data_rdd.map(lambda x: x.split("|")).toDF()
+------+------+------+
| _1| _2| _3|
+------+------+------+
|row1-1|row1-2|row1-3|
|row2-1|row2-2|row2-3|
|row3-1|row3-2|row3-3|
+------+------+------+
Append the dataframe containing the data to the dataframe holding the structure
dataframe = header_dataframe.union(data_dataframe)
+-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
| row1-1| row1-2| row1-3|
| row2-1| row2-2| row2-3|
| row3-1| row3-2| row3-3|
+-------+-------+-------+
