Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Whats the best way to read multiline cvs and transpose it to columns

avatar
Expert Contributor

I have a requirement where in I need to ingest multiline CSV with semistructured records with some rows need to be converted to column and some rows needs to be both rows and column.

below is the input CSV file look like:

a,a1,a11,7/1/2008

b,b1,b11,8:53:00

c,c1,c11,25

d,d1,d11,1

e,e1,e11, ABCDEF

f,f1,f11,

sn1,msg,ref_sn_01,abc

sn2,msg,ref_sn_02,def

sn3,msg,ref_sn_02,ghi

sn4,msg,ref_sn_04,jkl

sn5,msg,ref_sn_05,mno

sn6,msg,ref_sn_06,pqr

sn7,msg,ref_sn_07,stu

sn8,msg,ref_sn_08,vwx

sn9,msg,ref_sn_09,yza

sn9,msg,ref_sn_09,yza

sn10,msg,ref_sn_010,

sn11,msg,ref_sn_011

cp1,ana,pw01,1.1

cp2,ana,pw02,1.1

cp3,ana,pw03,1.1

cp4,ana,pw04,1.1

cp5,ana,pw05,1.1

cp6,ana,pw06,1.1

cp7,ana,pw07,1.1

cp8,ana,pw08,1.1

cp9,ana,pw09,1.1

cp10,ana,pw10,1.1

cp11,ana,pw11,1.1

Below is the expected output:

5241-screen-shot-2016-06-25-at-43154-pm.png

please let me know whats the best to read it and load it in HDFS/Hive.


screen-shot-2016-06-25-at-44836-pm.png
1 ACCEPTED SOLUTION

avatar
Super Guru

This is quite a custom requirement that you are converting some rows to column and other rows to both rows and column. You'll have to write a lot of your code but take advantage of pivot functionality in Spark. Check following link.

https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

sc.parallelize(rdd.collect.toSeq.transpose)

See the link here for more details.

View solution in original post

2 REPLIES 2

avatar
Super Guru

This is quite a custom requirement that you are converting some rows to column and other rows to both rows and column. You'll have to write a lot of your code but take advantage of pivot functionality in Spark. Check following link.

https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

sc.parallelize(rdd.collect.toSeq.transpose)

See the link here for more details.

avatar
Expert Contributor

@mqureshi

Thanks for your response. yes it is quite a custom requirement. I thought its better to check with the community if anyone has implemented this kinda stuff.

I am trying to use either hadoop custom input format or python UDF's to get this done. There seems to be no straightforward way of doing this in spark. I can not use spark pivot also as it supports only column as of now right?.