Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Line Separator in Spark

avatar
New Contributor

Hi All,

I'm new to spark and I'm looking on how to import a csv with custom liner separator into a DataFrame. I'm using CDH 2.2.0.

 

Data:

ID/x0fRegion/x0e1/x0fUS/x0e2/x0fRussia/x0e

 

Expected DataFrame:

IDRegion
1US
2Russia

 

I tried to use spark.read.csv with lineSep argument, but it seems my spark version doesn't support it.

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader

 

Any suggestion? Thanks

 

4 REPLIES 4

avatar
New Contributor

.option("quote", "\"")\
.option("escape", "\"")\

 

-Example:

 

contractsDF = spark.read\
.option("header", "true")\
.option("inferSchema", "true")\
.option("quote", "\"")\
.option("escape", "\"")\
.csv("gs://data/Major_Contract_Awards.csv")

avatar
New Contributor

Thanks for your reply, but it seems your script doesn't work

The dataset delimiter is shift-out (\x0f) and line-separator is shift-in (\x0e)

in pandas, i can simply load the data into dataframe using this command:

df1 = pd.read_csv("/folder/file.gz", sep = '\x0f', lineterminator = '\x0e' )

 

May I know how to do this in spark?

avatar
New Contributor

1)

import csv

with open("./prueba.csv") as file:

data = file.read().replace("/x0f", ",").replace("/x0e", "\n")

f = open('./prueba2.csv','w')
f.write(data)

f.close()

 

2)

df = spark.read.format("csv")\
.option("delimiter",",")\
.option("header","true")\
.load("./prueba2.csv")

df.show()

 

avatar
New Contributor

Thanks for your answer, but I prefer not changing the data file as the data fields may contain comma or line break

 

Is there a possible way to import the file directly? 

 

Thanks & Merry Christmas 🙂