- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Hive- escaping field delimiter in column value
- Labels:
-
Apache Hive
Created ‎08-24-2017 09:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My dataset set is as below:
700,Angus (1995),Comedy
702,Faces (1968),Drama
703,Boys (1996),Drama
704,"Quest, The (1996)",Action|Adventure
705,Cosi (1996),Comedy
747,"Stupids, The (1996)",Comedy
Create table movies( movieid int, title String, genre string) row format delimited fields terminated by ',' ; Select title , genre from movies;
Since the rows have comma separated values, the records like 704,"Quest, The (1996)",Action|Adventure returns Quest as title, instead of Quest,The (1996).
And Genre value is shown as The(1996) instead of Action|Adventure.
How to load such data correctly by escaping the delimiter in the value ?
Created ‎08-25-2017 07:33 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Adding to @Sonu Sahi's reply, the CSVSerde is available in Hive 0.14 and greater. The following example creates a TSV (Tab-separated) file.
<code>CREATE
TABLE
my_table(a string, b string, ...)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH
SERDEPROPERTIES (
"separatorChar"
=
"\t"
,
"quoteChar"
=
"'"
,
"escapeChar"
=
"\\"
)
STORED
AS
TEXTFILE;
Default properties for SerDe is Comma-Separated (CSV) fileDEFAULT_ESCAPE_CHARACTER \
DEFAULT_QUOTE_CHARACTER "
DEFAULT_SEPARATOR ,
This SerDe works for most CSV data, but does not handle embedded newlines. To use the SerDe, specify the fully qualified class name org.apache.hadoop.hive.serde2.OpenCSVSerde.
If you want to use the TextFile format, then use 'ESCAPED BY' in the DDL.
"Enable escaping for the delimiter characters by using the 'ESCAPED BY' clause (such as ESCAPED BY '\')
Escaping is needed if you want to work with data that can contain these delimiter characters.
A custom NULL format can also be specified using the 'NULL DEFINED AS' clause (default is '\N')."
Created ‎08-24-2017 10:49 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try the following:
create table movies(
movieid int,
title String,
genre string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
This should resolve your issue as the OpenCSVSerde's default properties should work for your use case.
Created ‎08-27-2017 08:05 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank You @Sonu Sahi for the solution.
Created ‎08-25-2017 07:33 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Adding to @Sonu Sahi's reply, the CSVSerde is available in Hive 0.14 and greater. The following example creates a TSV (Tab-separated) file.
<code>CREATE
TABLE
my_table(a string, b string, ...)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH
SERDEPROPERTIES (
"separatorChar"
=
"\t"
,
"quoteChar"
=
"'"
,
"escapeChar"
=
"\\"
)
STORED
AS
TEXTFILE;
Default properties for SerDe is Comma-Separated (CSV) fileDEFAULT_ESCAPE_CHARACTER \
DEFAULT_QUOTE_CHARACTER "
DEFAULT_SEPARATOR ,
This SerDe works for most CSV data, but does not handle embedded newlines. To use the SerDe, specify the fully qualified class name org.apache.hadoop.hive.serde2.OpenCSVSerde.
If you want to use the TextFile format, then use 'ESCAPED BY' in the DDL.
"Enable escaping for the delimiter characters by using the 'ESCAPED BY' clause (such as ESCAPED BY '\')
Escaping is needed if you want to work with data that can contain these delimiter characters.
A custom NULL format can also be specified using the 'NULL DEFINED AS' clause (default is '\N')."
Created ‎08-27-2017 08:04 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you @Sindhu for the elaborate explanation and the solution..!!
Created ‎02-27-2018 03:39 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have the same situation in log file. My data set looks like this
2017-11-29 16:19:39,217 DEBUG Conn -> 8=FIXT.1.1|9=76|35=A
2017-11-29 16:19:43,618 DEBUG Conn -> 8=FIXT.1.1|9=52|35=0|49=1151|56=BSEEQ|34=2
How to handle this Regex serde. my create table query is:
CREATE TABLE capital_log1 ( date STRING, filed2 STRING, filed3 STRING, filed4 STRING,filed7 map<bigint,string>) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^,]*)([^ ]*) (.*\->) (.*\|)", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s" ) STORED AS TEXTFILE; It throws me an exception.
Created ‎11-06-2019 02:26 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
as you said,the escapeChar '\n' doesn't work? such as "the great wall
in china",although used escapeChar '\n' ,the result were resolved into "the great wall",NULL.
rather than "the great wall in china" in hive table?
Created ‎08-19-2022 04:26 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello @ssubhas , the above worked however, when we try the same with LazySerde, it is able to escape the delimiter but loads few NULL values at the end. PFB snippet of statement I used: CREATE TABLE test1(5columns string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES(
'separatorChar'='|',
'escapeChar'='\\'
)
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
NOTE: also tried field.delim=|, format.serialization=|.
It works when serde properties are not mentioned and we use escape by Clause as you suggested, any way to make it work with LazySerde as well? (Data is Pipe delimited & may also have pipe in the data). Please suggest and help.
Created ‎03-25-2019 08:11 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Sir,
How do we store if via ETL layer '\n' is manipulated as '\\n'. org.apache.hadoop.hive.serde2.OpenCSVSerde does not work as data gets misaligned. Is there any way to load it without preprocessing?
Created ‎01-24-2020 09:04 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
You can use (ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde') as well.
Create Table tablename (columns) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde';
Thanks,
Manoj
