Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive merge of CSV files incorrectly formats NULL values

Hive merge of CSV files incorrectly formats NULL values

Contributor

Hi,

I am trying to convert Hive ORC table into CSV file for further processing, using this Hive script

INSERT OVERWRITE DIRECTORY '${hivevar:TARGET_FOLDER}'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ('serialization.encoding'='UTF-8')
STORED AS TEXTFILE SELECT * FROM ${hivevar:TABLE_NAME};

However the table is large ( > 10G) and there are created several CSV fragments in the given directory.Becouse of their it is not very practical to merge then (using hdfs dfs -cat) and thus I tried to force Hive to create just one.

I had some luck with the following settings

SET hive.execution.engine=mr;
SET hive.merge.mapfiles=true;
SET hive.merge.mapredfiles=true;
SET hive.merge.smallfiles.avgsize=1000000000000;
SET hive.merge.size.per.task=1000000000000;

but alas the merge operation changed formatting of the resulting CSV.

Without merge the one line in one CVS fragment looked like this:

"100",,,"0010044990","00010","NB","CARRAM","2011-06-14 00:00:00" 

but the in the single merged file the same line was changed to

"100","","","0010044990","00010","NB","CARRAM","2011-06-14 00:00:00" 

and the second and third column containing NULL values are now formatted differently ( ,"", instead of ,,)

Possibly the merge is ignoring the given SerDe. It worked same way when I tried it with a table instead of the directory.

I will be very grateful for any hints or clues.

Thanks,

Pavel

2 REPLIES 2
Highlighted

Re: Hive merge of CSV files incorrectly formats NULL values

Super Guru

@Pavel Benes Since this happens, during the merge step, I am not sure if this will help, but have tried the following?

SERDEPROPERTIES ('serialization.null.format' = '')

Highlighted

Re: Hive merge of CSV files incorrectly formats NULL values

Contributor

@mqureshi: Thanks for answer. I have already tried this with the directory, but with no effect. My understanding is that this setting determines what character in input CSV should be treated as NULL, but has no effect in export to CSV (i.e. serialization). According the doc the org.apache.hadoop.hive.serde2.OpenCSVSerde does not seem to support this setting at all.

Regards,

Pavel

Don't have an account?
Coming from Hortonworks? Activate your account here