04-05-2016 05:29 PM
I am relatively new to Hadoop and understand that there are possibly hundreds of ways and tools to solve any given problem. I am looking for someone to walk me through the generally acceptable best practice approach to solve a problem which I am hoping is a relatively common scenario.
I have a number of CSV files being generated from a process which I don't control. The CSV files are related, but each CSV file might have a different set of columns in it. I am providing a made-up example of two files below, in real life there will be many thousands of files and potentially thousands of columns.
#COL1, COL3, COL4 a1, c1, d1 a2,,d2 a3,c3,d3
#COL1, COL2, COL4 a1, b1, d6 a5, b5,a3,,
I would eventually like to expose some kind of a tabular view accessible via SQL/JDBC on this data, which will have a theoretical structure like:
#COL1, COL2, COL3, COL4 a1,,c1,d1 a1,b1,,d6 a3,,c3,d3 a3,c3,,a5,b5,,
Key points are: