Support Questions

NoahB · ‎07-15-2020

I have a dozen or so large txt files (15GB) that I have been reading into Spark as csv, but the issue that I am running into is that the headers for these files are delimited by commas(,), whereas all of the data in said columns is delimited by pipes( | ).

Does anyone know how I could replace the commas in the headers with pipes, or conversely replace the pipes in the rest of the data with commas?

The current layout of the files make it so that all of the data in the files fall into the first column, which messes up the filter I am trying to run on said files.

Any help is greatly appreciated.

Gangabathula · ‎07-16-2020

hi, in which file system, the file is based on? linux/unix or hdfs?

Kezia · ‎07-16-2020

Hello,

here is a way to do it using pyspark, it may be not optimal.

I used this csv to test my code

column1,column2,column3
row1-1|row1-2|row1-3
row2-1|row2-2|row2-3
row3-1|row3-2|row3-3

Load the header only, giving the dataframe structure

header_dataframe = spark.read.format("csv").option("header", "true").load('/tmp/test.csv').limit(0)

+-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
+-------+-------+-------+

Load the data as RDD, remove the first line and convert it to dataframe

data_rdd = sc.textFile('/tmp/test.csv')
header_row =data_rdd.first()
data_rdd = data_rdd.filter(lambda row:row != header_row)
data_dataframe = data_rdd.map(lambda x: x.split("|")).toDF()

+------+------+------+
|    _1|    _2|    _3|
+------+------+------+
|row1-1|row1-2|row1-3|
|row2-1|row2-2|row2-3|
|row3-1|row3-2|row3-3|
+------+------+------+

Append the dataframe containing the data to the dataframe holding the structure

dataframe = header_dataframe.union(data_dataframe)

+-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
| row1-1| row1-2| row1-3|
| row2-1| row2-2| row2-3|
| row3-1| row3-2| row3-3|
+-------+-------+-------+

Cloudera Community

Support Questions

How can I replace characters in the header of a Spark Dataframe?

Spark RDDs vs DataFrames vs SparkSQL

Replacing characters within a nested json

Pyspark dataframe: How to replace

Replace Multiple Character with UpdateRecord in Ap...

nifi-replace file headers using replacetext proces...

Spark 3 legacy configurations list ( Spark 2 behav...

Accessing Hbase tables and querying on Dataframes ...

Spark Python Supportability Matrix

Replace Line Feed character in Nifi

TimestampType format for Spark DataFrames