Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive SERDEPROPERTIES clarification

Hive SERDEPROPERTIES clarification

Contributor

could anybody tell what is the purpouse of code highlighted in bold letters in create table statement

 

CREATE EXTERNAL TABLE intermediate_access_logs ( ip STRING, date STRING, method STRING, url STRING, http_version STRING, code1 STRING, code2 STRING, dash STRING, user_agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'input.regex' = '([^ ]*) - - \\[([^\\]]*)\\] "([^\ ]*) ([^\ ]*) ([^\ ]*)" (\\d*) (\\d*) "([^"]*)" "([^"]*)"', 'output.format.string' = "%1$$s %2$$s %3$$s %4$$s %5$$s %6$$s %7$$s %8$$s %9$$s") LOCATION '/user/hive/warehouse/original_access_logs';

2 REPLIES 2

Re: Hive SERDEPROPERTIES clarification

Contributor
The bold text is used to tell hive how to read/interpret the data for the hive table (located at '/user/hive/warehouse/original_access_logs' in this case).
1) ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.RegexSerDe" -- tells hive to use this class to serialize and deserialize the rows to/from the file.
2) input.regex is a property used by this class (RegexSerde) to deserialize the rows read from the table data. So this regex pattern is applied to the row value read from the file to split up into different columns defined in the meta data for this hive table.
3) output.format.string is a property used by this class (RegexSerde) to serialize the rows being written out to this table data. This value is used as a format to generate a row value (from its column values) that is to be written back to the output file for this hive table.

Hope this helps. Thanks

Re: Hive SERDEPROPERTIES clarification

New Contributor

Naveen,

Thanks for exhaustive answer. I am a newbie so I might be wrong, but after some experiments I tend to believe that the current output.format.string, as it is written in tutorial is wrong.

 

Currently it is:

 

"%1$$s %2$$s %3$$s %4$$s %5$$s %6$$s %7$$s %8$$s %9$$s"

I believe it should be:

"%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"

What makes me think so?

I have tried, just for fun and experimenting, inserting a new row in intermediate_access_log table in hive. And the original output.format.string was making the statement to fail. After the change of the format string, the new row was nicely inserted.