Created 07-15-2020 07:54 AM
I have a dozen or so large txt files (15GB) that I have been reading into Spark as csv, but the issue that I am running into is that the headers for these files are delimited by commas(,), whereas all of the data in said columns is delimited by pipes( | ).
Does anyone know how I could replace the commas in the headers with pipes, or conversely replace the pipes in the rest of the data with commas?
The current layout of the files make it so that all of the data in the files fall into the first column, which messes up the filter I am trying to run on said files.
Any help is greatly appreciated.
Created 07-16-2020 09:15 AM
hi, in which file system, the file is based on? linux/unix or hdfs?
Created 07-16-2020 11:40 PM
Hello,
here is a way to do it using pyspark, it may be not optimal.
I used this csv to test my code
column1,column2,column3
row1-1|row1-2|row1-3
row2-1|row2-2|row2-3
row3-1|row3-2|row3-3
Load the header only, giving the dataframe structure
header_dataframe = spark.read.format("csv").option("header", "true").load('/tmp/test.csv').limit(0)
+-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
+-------+-------+-------+
Load the data as RDD, remove the first line and convert it to dataframe
data_rdd = sc.textFile('/tmp/test.csv')
header_row =data_rdd.first()
data_rdd = data_rdd.filter(lambda row:row != header_row)
data_dataframe = data_rdd.map(lambda x: x.split("|")).toDF()
+------+------+------+
| _1| _2| _3|
+------+------+------+
|row1-1|row1-2|row1-3|
|row2-1|row2-2|row2-3|
|row3-1|row3-2|row3-3|
+------+------+------+
Append the dataframe containing the data to the dataframe holding the structure
dataframe = header_dataframe.union(data_dataframe)
+-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
| row1-1| row1-2| row1-3|
| row2-1| row2-2| row2-3|
| row3-1| row3-2| row3-3|
+-------+-------+-------+