I am relatively new to Hadoop and understand that there are possibly hundreds of ways and tools to solve any given problem. I am looking for someone to walk me through the generally acceptable best practice approach to solve a problem which I am hoping is a relatively common scenario.
I have a number of CSV files being generated from a process which I don't control. The CSV files are related, but each CSV file might have a different set of columns in it. I am providing a made-up example of two files below, in real life there will be many thousands of files and potentially thousands of columns.
#COL1, COL3, COL4
a1, c1, d1
#COL1, COL2, COL4
a1, b1, d6
I would eventually like to expose some kind of a tabular view accessible via SQL/JDBC on this data, which will have a theoretical structure like: