Reply
New Contributor
Posts: 1
Registered: ‎04-05-2016

Hadoop - Exposing SQL interface on collection of CSV files with variable columns

I am relatively new to Hadoop and understand that there are possibly hundreds of ways and tools to solve any given problem. I am looking for someone to walk me through the generally acceptable best practice approach to solve a problem which I am hoping is a relatively common scenario.

 

I have a number of CSV files being generated from a process which I don't control. The CSV files are related, but each CSV file might have a different set of columns in it. I am providing a made-up example of two files below, in real life there will be many thousands of files and potentially thousands of columns.

 

FILE1

#COL1, COL3, COL4
a1, c1, d1
a2,,d2
a3,c3,d3

FILE2

#COL1, COL2, COL4
a1, b1, d6
a5, b5,a3,,

I would eventually like to expose some kind of a tabular view accessible via SQL/JDBC on this data, which will have a theoretical structure like:

#COL1, COL2, COL3, COL4
a1,,c1,d1
a1,b1,,d6
a3,,c3,d3
a3,c3,,a5,b5,,

Key points are:

  1. I would like a simple process to be able to update the available columns as new ones are discovered in newer files.
  2. I would not want to be penalized for performance if the users wants to analyze or work with only 10-20 columns at a time when the list of all columns across all files is in the thousands.
Announcements
New solutions