Support Questions

Find answers, ask questions, and share your expertise

Line Separator in Spark

avatar
New Contributor

Hi All,

I'm new to spark and I'm looking on how to import a csv with custom liner separator into a DataFrame. I'm using CDH 2.2.0.

 

Data:

ID/x0fRegion/x0e1/x0fUS/x0e2/x0fRussia/x0e

 

Expected DataFrame:

IDRegion
1US
2Russia

 

I tried to use spark.read.csv with lineSep argument, but it seems my spark version doesn't support it.

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader

 

Any suggestion? Thanks

 

4 REPLIES 4

avatar
New Contributor

.option("quote", "\"")\
.option("escape", "\"")\

 

-Example:

 

contractsDF = spark.read\
.option("header", "true")\
.option("inferSchema", "true")\
.option("quote", "\"")\
.option("escape", "\"")\
.csv("gs://data/Major_Contract_Awards.csv")

avatar
New Contributor

Thanks for your reply, but it seems your script doesn't work

The dataset delimiter is shift-out (\x0f) and line-separator is shift-in (\x0e)

in pandas, i can simply load the data into dataframe using this command:

df1 = pd.read_csv("/folder/file.gz", sep = '\x0f', lineterminator = '\x0e' )

 

May I know how to do this in spark?

avatar
New Contributor

1)

import csv

with open("./prueba.csv") as file:

data = file.read().replace("/x0f", ",").replace("/x0e", "\n")

f = open('./prueba2.csv','w')
f.write(data)

f.close()

 

2)

df = spark.read.format("csv")\
.option("delimiter",",")\
.option("header","true")\
.load("./prueba2.csv")

df.show()

 

avatar
New Contributor

Thanks for your answer, but I prefer not changing the data file as the data fields may contain comma or line break

 

Is there a possible way to import the file directly? 

 

Thanks & Merry Christmas 🙂