Support Questions

GeeKay2015 · ‎06-25-2016

I have a requirement where in I need to ingest multiline CSV with semistructured records with some rows need to be converted to column and some rows needs to be both rows and column.

below is the input CSV file look like:

a,a1,a11,7/1/2008

b,b1,b11,8:53:00

c,c1,c11,25

d,d1,d11,1

e,e1,e11, ABCDEF

f,f1,f11,

sn1,msg,ref_sn_01,abc

sn2,msg,ref_sn_02,def

sn3,msg,ref_sn_02,ghi

sn4,msg,ref_sn_04,jkl

sn5,msg,ref_sn_05,mno

sn6,msg,ref_sn_06,pqr

sn7,msg,ref_sn_07,stu

sn8,msg,ref_sn_08,vwx

sn9,msg,ref_sn_09,yza

sn10,msg,ref_sn_010,

sn11,msg,ref_sn_011

cp1,ana,pw01,1.1

cp2,ana,pw02,1.1

cp3,ana,pw03,1.1

cp4,ana,pw04,1.1

cp5,ana,pw05,1.1

cp6,ana,pw06,1.1

cp7,ana,pw07,1.1

cp8,ana,pw08,1.1

cp9,ana,pw09,1.1

cp10,ana,pw10,1.1

cp11,ana,pw11,1.1

Below is the expected output:

please let me know whats the best to read it and load it in HDFS/Hive.

mqureshi · ‎06-26-2016

This is quite a custom requirement that you are converting some rows to column and other rows to both rows and column. You'll have to write a lot of your code but take advantage of pivot functionality in Spark. Check following link.

https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

sc.parallelize(rdd.collect.toSeq.transpose)

See the link here for more details.

View solution in original post

mqureshi · ‎06-26-2016

This is quite a custom requirement that you are converting some rows to column and other rows to both rows and column. You'll have to write a lot of your code but take advantage of pivot functionality in Spark. Check following link.

https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

sc.parallelize(rdd.collect.toSeq.transpose)

See the link here for more details.

GeeKay2015 · ‎06-27-2016

@mqureshi

Thanks for your response. yes it is quite a custom requirement. I thought its better to check with the community if anyone has implemented this kinda stuff.

I am trying to use either hadoop custom input format or python UDF's to get this done. There seems to be no straightforward way of doing this in spark. I can not use spark pivot also as it supports only column as of now right?.

Cloudera Community

Support Questions

Whats the best way to read multiline cvs and transpose it to columns