Created 12-21-2020 08:03 PM
Hi All,
I'm new to spark and I'm looking on how to import a csv with custom liner separator into a DataFrame. I'm using CDH 2.2.0.
Data:
ID/x0fRegion/x0e1/x0fUS/x0e2/x0fRussia/x0e
Expected DataFrame:
ID | Region |
1 | US |
2 | Russia |
I tried to use spark.read.csv with lineSep argument, but it seems my spark version doesn't support it.
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader
Any suggestion? Thanks
Created 12-22-2020 02:46 AM
.option("quote", "\"")\
.option("escape", "\"")\
-Example:
contractsDF = spark.read\
.option("header", "true")\
.option("inferSchema", "true")\
.option("quote", "\"")\
.option("escape", "\"")\
.csv("gs://data/Major_Contract_Awards.csv")
Created 12-22-2020 07:57 PM
Thanks for your reply, but it seems your script doesn't work
The dataset delimiter is shift-out (\x0f) and line-separator is shift-in (\x0e)
in pandas, i can simply load the data into dataframe using this command:
df1 = pd.read_csv("/folder/file.gz", sep = '\x0f', lineterminator = '\x0e' )
May I know how to do this in spark?
Created on 12-23-2020 02:38 AM - edited 12-23-2020 02:38 AM
1)
import csv
with open("./prueba.csv") as file:
data = file.read().replace("/x0f", ",").replace("/x0e", "\n")
f = open('./prueba2.csv','w')
f.write(data)
f.close()
2)
df = spark.read.format("csv")\
.option("delimiter",",")\
.option("header","true")\
.load("./prueba2.csv")
df.show()
Created 12-23-2020 06:27 PM
Thanks for your answer, but I prefer not changing the data file as the data fields may contain comma or line break
Is there a possible way to import the file directly?
Thanks & Merry Christmas 🙂