Support Questions

Alans · ‎12-21-2020

Hi All,

I'm new to spark and I'm looking on how to import a csv with custom liner separator into a DataFrame. I'm using CDH 2.2.0.

Data:

ID/x0fRegion/x0e1/x0fUS/x0e2/x0fRussia/x0e

Expected DataFrame:

ID	Region
1	US
2	Russia

I tried to use spark.read.csv with lineSep argument, but it seems my spark version doesn't support it.

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader

Any suggestion? Thanks

Gr4vi7y · ‎12-22-2020

.option("quote", "\"")\
.option("escape", "\"")\

-Example:

contractsDF = spark.read\
.option("header", "true")\
.option("inferSchema", "true")\
.option("quote", "\"")\
.option("escape", "\"")\
.csv("gs://data/Major_Contract_Awards.csv")

Alans · ‎12-22-2020

Thanks for your reply, but it seems your script doesn't work

The dataset delimiter is shift-out (\x0f) and line-separator is shift-in (\x0e)

in pandas, i can simply load the data into dataframe using this command:

df1 = pd.read_csv("/folder/file.gz", sep = '\x0f', lineterminator = '\x0e' )

May I know how to do this in spark?

Gr4vi7y · ‎12-23-2020

1)

import csv

with open("./prueba.csv") as file:

data = file.read().replace("/x0f", ",").replace("/x0e", "\n")

f = open('./prueba2.csv','w')
f.write(data)

f.close()

2)

df = spark.read.format("csv")\
.option("delimiter",",")\
.option("header","true")\
.load("./prueba2.csv")

df.show()

Alans · ‎12-23-2020

Thanks for your answer, but I prefer not changing the data file as the data fields may contain comma or line break

Is there a possible way to import the file directly?

Thanks & Merry Christmas 🙂

Cloudera Community

Support Questions

Line Separator in Spark

Spark 3 legacy configurations list ( Spark 2 behav...

Spark Python Supportability Matrix

Spark and Java versions Supportability Matrix

Creating separate list using JOLT

Spark Scala Version Compatibility Matrix

Spark Python Integration Test Result Exceptions

Spark Memory Management

Dynamic Allocation in Apache Spark

Setting up separate Zookeeper Quorum for Kafka

Spark - separating dependecies of spark and applic...