Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Line Separator in Spark

New Contributor

Hi All,

I'm new to spark and I'm looking on how to import a csv with custom liner separator into a DataFrame. I'm using CDH 2.2.0.

 

Data:

ID/x0fRegion/x0e1/x0fUS/x0e2/x0fRussia/x0e

 

Expected DataFrame:

IDRegion
1US
2Russia

 

I tried to use spark.read.csv with lineSep argument, but it seems my spark version doesn't support it.

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader

 

Any suggestion? Thanks

 

4 REPLIES 4

New Contributor

.option("quote", "\"")\
.option("escape", "\"")\

 

-Example:

 

contractsDF = spark.read\
.option("header", "true")\
.option("inferSchema", "true")\
.option("quote", "\"")\
.option("escape", "\"")\
.csv("gs://data/Major_Contract_Awards.csv")

New Contributor

Thanks for your reply, but it seems your script doesn't work

The dataset delimiter is shift-out (\x0f) and line-separator is shift-in (\x0e)

in pandas, i can simply load the data into dataframe using this command:

df1 = pd.read_csv("/folder/file.gz", sep = '\x0f', lineterminator = '\x0e' )

 

May I know how to do this in spark?

New Contributor

1)

import csv

with open("./prueba.csv") as file:

data = file.read().replace("/x0f", ",").replace("/x0e", "\n")

f = open('./prueba2.csv','w')
f.write(data)

f.close()

 

2)

df = spark.read.format("csv")\
.option("delimiter",",")\
.option("header","true")\
.load("./prueba2.csv")

df.show()

 

New Contributor

Thanks for your answer, but I prefer not changing the data file as the data fields may contain comma or line break

 

Is there a possible way to import the file directly? 

 

Thanks & Merry Christmas 🙂

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.