What are some best practices/techniques for ingesting malformed data into a Hive table?
For example, the CSV should have 15 columns (and thus 15 commas) but some rows only have 3 columns without having the additional 12 blank commas.
I don't think Hive will have a problem in general. Here is a simple experiment:
$ cat /tmp/test.dat 1,2,3 2,3,4 2 4,5,6 4,5 6,7,8 hive> create table abc(a string, b string, c string) row format delimited fields terminated by ',' stored as textfile; OK Time taken: 0.712 seconds hive> load data local inpath '/tmp/test.dat' into table abc; Loading data to table default.abc Table default.abc stats: [numFiles=1, totalSize=30] OK Time taken: 0.271 seconds hive> select * from abc; OK 1 2 3 2 3 4 2 NULL NULL 4 5 6 4 5 NULL 6 7 8 Time taken: 0.56 seconds, Fetched: 6 row(s)
Observe some of the rows has missing comma delimiters.
What would be a good use case for ingesting malformed data using the technique above? What is the value? The above assignment, e.g. row 3, is completely blind, it will always go to column a, but it could be b or c. The same about row 5, it could be (a,b), but also (a,c) or (b,c). You will never know whether a or b or c supposed to get value 2. How would you use something not understood? A true good practice is to have a second table capturing all those rows with data issues and do something about them, or at least skip them in the upload which Hive can do. That is a value technique. I am looking at the response above and what I read is "Hive can do it". That is correct. I understand your question differently. You are not asking that, you are asking about best practices.
My 2c. I "Unaccepted" the above response and I suggest to consider Dhruv's response.
As a general rule, checking for malformed records and doing something about them should be the first step in any data processing pipeline. Eg: In Cascading, it's a standard practice to plug in the Filter function at the very beginning of the main input pipe. In Spark, we create the input RDD and immediately filter bad records. For Hive CSV case, column inference will happen at read time, so make sure that your ser-de implementation can handle missing columns, or better, remove bad records from the data stored in HDFS before making a Hive table. If you have used Flume for moving CSV into HDFS, then you can add the filtering logic to Interceptors. Otherwise, implement the error handling in your SerDe.