Support Questions

GeeKay2015 · ‎06-25-2016

I have a requirement where in I need to ingest multiline CSV with semistructured records with some rows need to be converted to column and some rows needs to be both rows and column.

below is the input CSV file look like:

a,a1,a11,7/1/2008

b,b1,b11,8:53:00

c,c1,c11,25

d,d1,d11,1

e,e1,e11, ABCDEF

f,f1,f11,

sn1,msg,ref_sn_01,abc

sn2,msg,ref_sn_02,def

sn3,msg,ref_sn_02,ghi

sn4,msg,ref_sn_04,jkl

sn5,msg,ref_sn_05,mno

sn6,msg,ref_sn_06,pqr

sn7,msg,ref_sn_07,stu

sn8,msg,ref_sn_08,vwx

sn9,msg,ref_sn_09,yza

sn10,msg,ref_sn_010,

sn11,msg,ref_sn_011

cp1,ana,pw01,1.1

cp2,ana,pw02,1.1

cp3,ana,pw03,1.1

cp4,ana,pw04,1.1

cp5,ana,pw05,1.1

cp6,ana,pw06,1.1

cp7,ana,pw07,1.1

cp8,ana,pw08,1.1

cp9,ana,pw09,1.1

cp10,ana,pw10,1.1

cp11,ana,pw11,1.1

Below is the expected output:

please let me know whats the best to read it and load it in HDFS/Hive.

mqureshi · ‎06-26-2016

This is quite a custom requirement that you are converting some rows to column and other rows to both rows and column. You'll have to write a lot of your code but take advantage of pivot functionality in Spark. Check following link.

https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

sc.parallelize(rdd.collect.toSeq.transpose)

See the link here for more details.

View solution in original post

mqureshi · ‎06-26-2016

This is quite a custom requirement that you are converting some rows to column and other rows to both rows and column. You'll have to write a lot of your code but take advantage of pivot functionality in Spark. Check following link.

https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

sc.parallelize(rdd.collect.toSeq.transpose)

See the link here for more details.

GeeKay2015 · ‎06-27-2016

@mqureshi

Thanks for your response. yes it is quite a custom requirement. I thought its better to check with the community if anyone has implemented this kinda stuff.

I am trying to use either hadoop custom input format or python UDF's to get this done. There seems to be no straightforward way of doing this in spark. I can not use spark pivot also as it supports only column as of now right?.

Cloudera Community

Support Questions

Whats the best way to read multiline cvs and transpose it to columns

Hive transpose concatenated data in columns to row...

How to transpose a pyspark dataframe?

Write / Read Parquet File in Spark

Transpose columns to rows

How to read fsimage

How to change column Type in SparkSQL?

What is Tungsten for Apache Spark?

What is HDFS Ozone?

Spark to read the Hive table sub-directory data

Adding new columns to an already partitioned Hive ...