I have different systems that feed our Cloudera hub with common data. However, the data arrives differently from each one. I have some tab-delimited and csv. I also have different numbers of fields between different files and positions on the key fields I care about. Consider the below files:
KeyID1 KeyID2 Garbage Garbage1 Garbage2
123 456 sdafasdf asdfasdf gfgsdf
987 157 sdf sdf sdf
I'd like to be able to drop all of these files into a common HDFS folder and then define a Hive table only on the Key fields I care about. There could be up to 15-20 different formats per system, but each will have KeyID1 and KeyID2.
Is there a simple way to do this where I can dump and see data from different file types and lengths easily?
I don't believe this is currently possible, w.r.t. the plaintext format (text/csv/tsv). When defined as a table directory, Hive will read up all files under it - and the reader will expect consistency in the files or otherwise produce a number of NULL results.