Reply
Explorer
Posts: 8
Registered: ‎01-12-2016

Hive SERDEPROPERTIES clarification

[ Edited ]

could anybody tell what is the purpouse of code highlighted in bold letters in create table statement

 

CREATE EXTERNAL TABLE intermediate_access_logs ( ip STRING, date STRING, method STRING, url STRING, http_version STRING, code1 STRING, code2 STRING, dash STRING, user_agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'input.regex' = '([^ ]*) - - \\[([^\\]]*)\\] "([^\ ]*) ([^\ ]*) ([^\ ]*)" (\\d*) (\\d*) "([^"]*)" "([^"]*)"', 'output.format.string' = "%1$$s %2$$s %3$$s %4$$s %5$$s %6$$s %7$$s %8$$s %9$$s") LOCATION '/user/hive/warehouse/original_access_logs';

Cloudera Employee
Posts: 34
Registered: ‎08-16-2016

Re: Hive SERDEPROPERTIES clarification

The bold text is used to tell hive how to read/interpret the data for the hive table (located at '/user/hive/warehouse/original_access_logs' in this case).
1) ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.RegexSerDe" -- tells hive to use this class to serialize and deserialize the rows to/from the file.
2) input.regex is a property used by this class (RegexSerde) to deserialize the rows read from the table data. So this regex pattern is applied to the row value read from the file to split up into different columns defined in the meta data for this hive table.
3) output.format.string is a property used by this class (RegexSerde) to serialize the rows being written out to this table data. This value is used as a format to generate a row value (from its column values) that is to be written back to the output file for this hive table.

Hope this helps. Thanks
New Contributor
Posts: 5
Registered: ‎09-16-2018

Re: Hive SERDEPROPERTIES clarification

Naveen,

Thanks for exhaustive answer. I am a newbie so I might be wrong, but after some experiments I tend to believe that the current output.format.string, as it is written in tutorial is wrong.

 

Currently it is:

 

"%1$$s %2$$s %3$$s %4$$s %5$$s %6$$s %7$$s %8$$s %9$$s"

I believe it should be:

"%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"

What makes me think so?

I have tried, just for fun and experimenting, inserting a new row in intermediate_access_log table in hive. And the original output.format.string was making the statement to fail. After the change of the format string, the new row was nicely inserted.

Announcements