Posts: 11
Registered: ‎05-01-2014

Load tab delimited key value data

Is there a good way to load key value pairs, tab delimited data where each row has a different keys depending on the valie of one of the keys?


Data example:

type:A     field1:valueA1     field2:valueA2     field3:valueA3

type:A     field1:valueA1     field2:valueA2     field3:valueA3

type:B     field1:valueB1     field4:valueB4


I would like 2 different tables, one that contains type A records and one with type B records.

Posts: 14
Registered: ‎12-19-2013

Re: Load tab delimited key value data

[ Edited ]

Hey Matt,


Here's what I would do in your situation:

  1. Put the TSV file onto HDFS
  2. Use Spark to read in the TSV file with the textFile function
  3. Write some Scala to perform the conditional logic and field/value extraction
  4. Write out the results into two separate Parquet files, each with an appropriate Avro schema


Here are some links to get you started:

There may be a better way to do things, but that's how I'd do it! The nice thing about Spark is that you can experiment with step 3 using the Spark shell.


Let me know how it goes!