Support Questions
Find answers, ask questions, and share your expertise

can i assign hive metadata to files of different column formats and column numbers


I have different systems that feed our Cloudera hub with common data.  However, the data arrives differently from each one.  I have some tab-delimited and csv.  I also have different numbers of fields between different files and positions on the key fields I care about.  Consider the below files:


System1 (tab-delimited)

KeyID1     KeyID2    Garbage    Garbage1   Garbage2

123           456         sdafasdf    asdfasdf     gfgsdf

987           157         sdf             sdf              sdf


System2   (csv)





I'd like to be able to drop all of these files into a common HDFS folder and then define a Hive table only on the Key fields I care about.  There could be up to 15-20 different formats per system, but each will have KeyID1 and KeyID2.  


Is there a simple way to do this where I can dump and see data from different file types and lengths easily?


Thank you!



Master Guru
I don't believe this is currently possible, w.r.t. the plaintext format (text/csv/tsv). When defined as a table directory, Hive will read up all files under it - and the reader will expect consistency in the files or otherwise produce a number of NULL results.

Is there a reason you cannot keep these files separate, and instead use a program to clean+merge these data fields into a common format file, which can then be more gracefully utilised? Formats like Parquet and Avro, for example, embed schemas internally allowing you to define a consistent file directory based table on the fly with no real knowledge of the schema beforehand: