Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Whats the best way to read multiline cvs and transpose it to columns

Solved Go to solution
Highlighted

Whats the best way to read multiline cvs and transpose it to columns

Rising Star

I have a requirement where in I need to ingest multiline CSV with semistructured records with some rows need to be converted to column and some rows needs to be both rows and column.

below is the input CSV file look like:

a,a1,a11,7/1/2008

b,b1,b11,8:53:00

c,c1,c11,25

d,d1,d11,1

e,e1,e11, ABCDEF

f,f1,f11,

sn1,msg,ref_sn_01,abc

sn2,msg,ref_sn_02,def

sn3,msg,ref_sn_02,ghi

sn4,msg,ref_sn_04,jkl

sn5,msg,ref_sn_05,mno

sn6,msg,ref_sn_06,pqr

sn7,msg,ref_sn_07,stu

sn8,msg,ref_sn_08,vwx

sn9,msg,ref_sn_09,yza

sn9,msg,ref_sn_09,yza

sn10,msg,ref_sn_010,

sn11,msg,ref_sn_011

cp1,ana,pw01,1.1

cp2,ana,pw02,1.1

cp3,ana,pw03,1.1

cp4,ana,pw04,1.1

cp5,ana,pw05,1.1

cp6,ana,pw06,1.1

cp7,ana,pw07,1.1

cp8,ana,pw08,1.1

cp9,ana,pw09,1.1

cp10,ana,pw10,1.1

cp11,ana,pw11,1.1

Below is the expected output:

5241-screen-shot-2016-06-25-at-43154-pm.png

please let me know whats the best to read it and load it in HDFS/Hive.


screen-shot-2016-06-25-at-44836-pm.png
1 ACCEPTED SOLUTION

Accepted Solutions

Re: Whats the best way to read multiline cvs and transpose it to columns

Super Guru

This is quite a custom requirement that you are converting some rows to column and other rows to both rows and column. You'll have to write a lot of your code but take advantage of pivot functionality in Spark. Check following link.

https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

sc.parallelize(rdd.collect.toSeq.transpose)

See the link here for more details.

2 REPLIES 2

Re: Whats the best way to read multiline cvs and transpose it to columns

Super Guru

This is quite a custom requirement that you are converting some rows to column and other rows to both rows and column. You'll have to write a lot of your code but take advantage of pivot functionality in Spark. Check following link.

https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

sc.parallelize(rdd.collect.toSeq.transpose)

See the link here for more details.

Re: Whats the best way to read multiline cvs and transpose it to columns

Rising Star

@mqureshi

Thanks for your response. yes it is quite a custom requirement. I thought its better to check with the community if anyone has implemented this kinda stuff.

I am trying to use either hadoop custom input format or python UDF's to get this done. There seems to be no straightforward way of doing this in spark. I can not use spark pivot also as it supports only column as of now right?.