Support Questions
Find answers, ask questions, and share your expertise

can i assign hive metadata to files of different column formats and column numbers

Explorer

I have different systems that feed our Cloudera hub with common data.  However, the data arrives differently from each one.  I have some tab-delimited and csv.  I also have different numbers of fields between different files and positions on the key fields I care about.  Consider the below files:

 

System1 (tab-delimited)

KeyID1     KeyID2    Garbage    Garbage1   Garbage2

123           456         sdafasdf    asdfasdf     gfgsdf

987           157         sdf             sdf              sdf

 

System2   (csv)

Crap,KeyID1,Crap1,KeyID2

nada,123,zip,345

zilch,246,none,432

 

I'd like to be able to drop all of these files into a common HDFS folder and then define a Hive table only on the Key fields I care about.  There could be up to 15-20 different formats per system, but each will have KeyID1 and KeyID2.  

 

Is there a simple way to do this where I can dump and see data from different file types and lengths easily?

 

Thank you!

 

1 REPLY 1

Master Guru
I don't believe this is currently possible, w.r.t. the plaintext format (text/csv/tsv). When defined as a table directory, Hive will read up all files under it - and the reader will expect consistency in the files or otherwise produce a number of NULL results.

Is there a reason you cannot keep these files separate, and instead use a program to clean+merge these data fields into a common format file, which can then be more gracefully utilised? Formats like Parquet and Avro, for example, embed schemas internally allowing you to define a consistent file directory based table on the fly with no real knowledge of the schema beforehand: http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/impala_tutorial.html#tu...