Support Questions
Find answers, ask questions, and share your expertise

Techniques for dealing with malformed data (Hive )

Techniques for dealing with malformed data (Hive )

What are some best practices/techniques for ingesting malformed data into a Hive table?

For example, the CSV should have 15 columns (and thus 15 commas) but some rows only have 3 columns without having the additional 12 blank commas.


Re: Techniques for dealing with malformed data (Hive )

Master Collaborator

I don't think Hive will have a problem in general. Here is a simple experiment:

$ cat /tmp/test.dat

hive> create table abc(a string, b string, c string) row format delimited fields terminated by ',' stored as textfile;
Time taken: 0.712 seconds
hive> load data local inpath '/tmp/test.dat' into table abc;
Loading data to table
Table stats: [numFiles=1, totalSize=30]
Time taken: 0.271 seconds
hive> select * from abc;
1	2	3
2	3	4
4	5	6
4	5	NULL
6	7	8
Time taken: 0.56 seconds, Fetched: 6 row(s)

Observe some of the rows has missing comma delimiters.

Re: Techniques for dealing with malformed data (Hive )

@Andrew Watson

What would be a good use case for ingesting malformed data using the technique above? What is the value? The above assignment, e.g. row 3, is completely blind, it will always go to column a, but it could be b or c. The same about row 5, it could be (a,b), but also (a,c) or (b,c). You will never know whether a or b or c supposed to get value 2. How would you use something not understood? A true good practice is to have a second table capturing all those rows with data issues and do something about them, or at least skip them in the upload which Hive can do. That is a value technique. I am looking at the response above and what I read is "Hive can do it". That is correct. I understand your question differently. You are not asking that, you are asking about best practices.

My 2c. I "Unaccepted" the above response and I suggest to consider Dhruv's response.

Re: Techniques for dealing with malformed data (Hive )


As a general rule, checking for malformed records and doing something about them should be the first step in any data processing pipeline. Eg: In Cascading, it's a standard practice to plug in the Filter function at the very beginning of the main input pipe. In Spark, we create the input RDD and immediately filter bad records. For Hive CSV case, column inference will happen at read time, so make sure that your ser-de implementation can handle missing columns, or better, remove bad records from the data stored in HDFS before making a Hive table. If you have used Flume for moving CSV into HDFS, then you can add the filtering logic to Interceptors. Otherwise, implement the error handling in your SerDe.