Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Pyspark Parsing Text File

Pyspark Parsing Text File

New Contributor

Hello,

 

Have a data file like this:

a

b

c

d:;e

g:;h

 

/pqr:

column1;column2;column3

column1_unit;column2_unit;column3_unit

 

data1;data2;data3

data4;data5;data6

 

Columns and associated data elements could vary. Hence, I want to store it as key value pair, something like this:

d:e;g:h     column1:column1_unit:data1;column2:column2_unit:data2;column3:column3_unit:data3

d:e;g:h     column1:column1_unit:data4;column2:column2_unit:data5;column3:column3_unit:data6

 

If you notice, I've ignored the 1st three lines before reading the text file. Also, I've ignored "/pqr" as well before reading the column names, units and actual data.

Any directions or thoughts how could I achieve this using pyspark?

 

My idea is that if I could convert the incoming data file like this using pyspark, then I could have a Hive layer on top of it and read it as a string.

I couldn't define the columns statically in Hive because number of columns and their order could vary with each file.

Don't have an account?
Coming from Hortonworks? Activate your account here