Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Load tab delimited key value data


Load tab delimited key value data


Is there a good way to load key value pairs, tab delimited data where each row has a different keys depending on the valie of one of the keys?


Data example:

type:A     field1:valueA1     field2:valueA2     field3:valueA3

type:A     field1:valueA1     field2:valueA2     field3:valueA3

type:B     field1:valueB1     field4:valueB4


I would like 2 different tables, one that contains type A records and one with type B records.


Re: Load tab delimited key value data


Hey Matt,


Here's what I would do in your situation:

  1. Put the TSV file onto HDFS
  2. Use Spark to read in the TSV file with the textFile function
  3. Write some Scala to perform the conditional logic and field/value extraction
  4. Write out the results into two separate Parquet files, each with an appropriate Avro schema


Here are some links to get you started:

There may be a better way to do things, but that's how I'd do it! The nice thing about Spark is that you can experiment with step 3 using the Spark shell.


Let me know how it goes!



Don't have an account?
Coming from Hortonworks? Activate your account here