Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to handle nulls when exporting from Hive?

avatar
Super Collaborator

I often need to export data from Hive to CSV files so that I can share with folks - usually they will ultimately import the CSV data into some sort of standard DB.

Currently, I use a CLI command like this:

hive -e 'set hive.cli.print.header=true; select * from blah where condition ' | sed 's/[\t]/,/g' > myfile.csv

However, when I do it this way, null values actually get printed as "NULL".

For example, an output row might be:

0|true|NULL|1|0|'my string'|NULL|etc

So, my question: What can I add to my command to replace those NULL entries with just an empty character?

In other words, how do I instead get this:

0|true||1|0|'my string'||etc

?

1 ACCEPTED SOLUTION

avatar
Guru

Using your sed approach, this should replace all NULL with empty character

sed 's/[\t]/,/g; s/NULL//g'  > myfile.csv

If there is a chance that NULL is a substring of a value you will need to do the following where ^ is beginning of line and $ is end of line and , is your field delimiter

sed 's/[\t]/,/g; s/^NULL,/,/g; s/,NULL,/,,/g; s/,NULL$/,/g;'  > myfile.csv

Note that if your resultset is large, it is probably best to use Pig on HDFS and not sed (to leverage the parallel processing of hadoop and save yourself a lot of time.

Note also: To use empty character as nulls in the actual hive table, use the following in the DDL

TBLPROPERTIES('serialization.null.format'='');

View solution in original post

1 REPLY 1

avatar
Guru

Using your sed approach, this should replace all NULL with empty character

sed 's/[\t]/,/g; s/NULL//g'  > myfile.csv

If there is a chance that NULL is a substring of a value you will need to do the following where ^ is beginning of line and $ is end of line and , is your field delimiter

sed 's/[\t]/,/g; s/^NULL,/,/g; s/,NULL,/,,/g; s/,NULL$/,/g;'  > myfile.csv

Note that if your resultset is large, it is probably best to use Pig on HDFS and not sed (to leverage the parallel processing of hadoop and save yourself a lot of time.

Note also: To use empty character as nulls in the actual hive table, use the following in the DDL

TBLPROPERTIES('serialization.null.format'='');