Support Questions

Alans · ‎12-21-2020

Hi All,

I'm new to spark and I'm looking on how to import a csv with custom liner separator into a DataFrame. I'm using CDH 2.2.0.

Data:

ID/x0fRegion/x0e1/x0fUS/x0e2/x0fRussia/x0e

Expected DataFrame:

ID	Region
1	US
2	Russia

I tried to use spark.read.csv with lineSep argument, but it seems my spark version doesn't support it.

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader

Any suggestion? Thanks

Gr4vi7y · ‎12-22-2020

.option("quote", "\"")\
.option("escape", "\"")\

-Example:

contractsDF = spark.read\
.option("header", "true")\
.option("inferSchema", "true")\
.option("quote", "\"")\
.option("escape", "\"")\
.csv("gs://data/Major_Contract_Awards.csv")

Alans · ‎12-22-2020

Thanks for your reply, but it seems your script doesn't work

The dataset delimiter is shift-out (\x0f) and line-separator is shift-in (\x0e)

in pandas, i can simply load the data into dataframe using this command:

df1 = pd.read_csv("/folder/file.gz", sep = '\x0f', lineterminator = '\x0e' )

May I know how to do this in spark?

Gr4vi7y · ‎12-23-2020

1)

import csv

with open("./prueba.csv") as file:

data = file.read().replace("/x0f", ",").replace("/x0e", "\n")

f = open('./prueba2.csv','w')
f.write(data)

f.close()

2)

df = spark.read.format("csv")\
.option("delimiter",",")\
.option("header","true")\
.load("./prueba2.csv")

df.show()

Alans · ‎12-23-2020

Thanks for your answer, but I prefer not changing the data file as the data fields may contain comma or line break

Is there a possible way to import the file directly?

Thanks & Merry Christmas 🙂

Cloudera Community

Support Questions

Line Separator in Spark