Support Questions

Find answers, ask questions, and share your expertise

How to handle nulls when exporting from Hive?

avatar
Super Collaborator

I often need to export data from Hive to CSV files so that I can share with folks - usually they will ultimately import the CSV data into some sort of standard DB.

Currently, I use a CLI command like this:

hive -e 'set hive.cli.print.header=true; select * from blah where condition ' | sed 's/[\t]/,/g' > myfile.csv

However, when I do it this way, null values actually get printed as "NULL".

For example, an output row might be:

0|true|NULL|1|0|'my string'|NULL|etc

So, my question: What can I add to my command to replace those NULL entries with just an empty character?

In other words, how do I instead get this:

0|true||1|0|'my string'||etc

?

1 ACCEPTED SOLUTION

avatar
Guru

Using your sed approach, this should replace all NULL with empty character

sed 's/[\t]/,/g; s/NULL//g'  > myfile.csv

If there is a chance that NULL is a substring of a value you will need to do the following where ^ is beginning of line and $ is end of line and , is your field delimiter

sed 's/[\t]/,/g; s/^NULL,/,/g; s/,NULL,/,,/g; s/,NULL$/,/g;'  > myfile.csv

Note that if your resultset is large, it is probably best to use Pig on HDFS and not sed (to leverage the parallel processing of hadoop and save yourself a lot of time.

Note also: To use empty character as nulls in the actual hive table, use the following in the DDL

TBLPROPERTIES('serialization.null.format'='');

View solution in original post

1 REPLY 1

avatar
Guru

Using your sed approach, this should replace all NULL with empty character

sed 's/[\t]/,/g; s/NULL//g'  > myfile.csv

If there is a chance that NULL is a substring of a value you will need to do the following where ^ is beginning of line and $ is end of line and , is your field delimiter

sed 's/[\t]/,/g; s/^NULL,/,/g; s/,NULL,/,,/g; s/,NULL$/,/g;'  > myfile.csv

Note that if your resultset is large, it is probably best to use Pig on HDFS and not sed (to leverage the parallel processing of hadoop and save yourself a lot of time.

Note also: To use empty character as nulls in the actual hive table, use the following in the DDL

TBLPROPERTIES('serialization.null.format'='');