We want to analyze using hive on the below type of data. Below are the challenges.
Source data are flat files from different sources.Multiple source file on daily basis. There is no fixed columns (each files have different columns). Each file have very large number of rows. No:of columns,order of the column are diffrent. each field will be comma seperated, but field value might have quotes ("").
Please suggest what would be the ideal aproch in this. Load to hbase and create hive table on top of that? or is it possible to create hive table with dynamic schema?
If on the other hand each file has a different header then you cannot do this since Hive/Pig UDFs work row by row. You would need to do the transformations outside of hadoop or use something like the below to run a custom InputFormat reading and processing every file on its own. The output again should be Hive/Pig maps, Avro or what else you decide.