- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Hive SERDEPROPERTIES clarification
- Labels:
-
Apache Hive
Created on 10-21-2016 09:11 AM - edited 09-16-2022 03:45 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
could anybody tell what is the purpouse of code highlighted in bold letters in create table statement
CREATE EXTERNAL TABLE intermediate_access_logs ( ip STRING, date STRING, method STRING, url STRING, http_version STRING, code1 STRING, code2 STRING, dash STRING, user_agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'input.regex' = '([^ ]*) - - \\[([^\\]]*)\\] "([^\ ]*) ([^\ ]*) ([^\ ]*)" (\\d*) (\\d*) "([^"]*)" "([^"]*)"', 'output.format.string' = "%1$$s %2$$s %3$$s %4$$s %5$$s %6$$s %7$$s %8$$s %9$$s") LOCATION '/user/hive/warehouse/original_access_logs';
Created 11-09-2016 12:31 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1) ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.RegexSerDe" -- tells hive to use this class to serialize and deserialize the rows to/from the file.
2) input.regex is a property used by this class (RegexSerde) to deserialize the rows read from the table data. So this regex pattern is applied to the row value read from the file to split up into different columns defined in the meta data for this hive table.
3) output.format.string is a property used by this class (RegexSerde) to serialize the rows being written out to this table data. This value is used as a format to generate a row value (from its column values) that is to be written back to the output file for this hive table.
Hope this helps. Thanks
Created 09-22-2018 01:17 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Naveen,
Thanks for exhaustive answer. I am a newbie so I might be wrong, but after some experiments I tend to believe that the current output.format.string, as it is written in tutorial is wrong.
Currently it is:
"%1$$s %2$$s %3$$s %4$$s %5$$s %6$$s %7$$s %8$$s %9$$s"
I believe it should be:
"%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
What makes me think so?
I have tried, just for fun and experimenting, inserting a new row in intermediate_access_log table in hive. And the original output.format.string was making the statement to fail. After the change of the format string, the new row was nicely inserted.
