Support Questions

Find answers, ask questions, and share your expertise
Announcements
Now Live: Explore expert insights and technical deep dives on the new Cloudera Community BlogsRead the Announcement

Line Separator in Spark

avatar
Frequent Visitor

Hi All,

I'm new to spark and I'm looking on how to import a csv with custom liner separator into a DataFrame. I'm using CDH 2.2.0.

 

Data:

ID/x0fRegion/x0e1/x0fUS/x0e2/x0fRussia/x0e

 

Expected DataFrame:

IDRegion
1US
2Russia

 

I tried to use spark.read.csv with lineSep argument, but it seems my spark version doesn't support it.

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader

 

Any suggestion? Thanks

 

4 REPLIES 4

avatar
Visitor

.option("quote", "\"")\
.option("escape", "\"")\

 

-Example:

 

contractsDF = spark.read\
.option("header", "true")\
.option("inferSchema", "true")\
.option("quote", "\"")\
.option("escape", "\"")\
.csv("gs://data/Major_Contract_Awards.csv")

avatar
Frequent Visitor

Thanks for your reply, but it seems your script doesn't work

The dataset delimiter is shift-out (\x0f) and line-separator is shift-in (\x0e)

in pandas, i can simply load the data into dataframe using this command:

df1 = pd.read_csv("/folder/file.gz", sep = '\x0f', lineterminator = '\x0e' )

 

May I know how to do this in spark?

avatar
Visitor

1)

import csv

with open("./prueba.csv") as file:

data = file.read().replace("/x0f", ",").replace("/x0e", "\n")

f = open('./prueba2.csv','w')
f.write(data)

f.close()

 

2)

df = spark.read.format("csv")\
.option("delimiter",",")\
.option("header","true")\
.load("./prueba2.csv")

df.show()

 

avatar
Frequent Visitor

Thanks for your answer, but I prefer not changing the data file as the data fields may contain comma or line break

 

Is there a possible way to import the file directly? 

 

Thanks & Merry Christmas 🙂