Support Questions

Find answers, ask questions, and share your expertise

Hive and comma delimited fields in managed table

avatar
Contributor

Hi everyone,

 

I am uploading Navigator logs into Hive for analysis. I used this: 

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = '"',
"escapeChar" = "\\"
)
STORED AS TEXTFILE

 

Based off: https://community.cloudera.com/t5/Support-Questions/Hive-escaping-field-delimiter-in-column-value/m-... to get an external table to handle the commas in queries. I think it's working fine (but will test tomorrow to make sure).

 

However, I then do a move to a managed partitioned table, but it is not handling the commas in queries correctly. I created it as:

PARTITIONED BY(event_day INT, event_month INT, event_year INT);

 

How do I handle the commas in queries as part of this move?

Does the managed table need the same delimter & escape info as the external table?

If that's the case, what's the syntax for it?

 

Schwifty

1 ACCEPTED SOLUTION

avatar
Contributor

Hi,

 

This isn't meant to be a blog post. Here's the answer:

 

csv file has lots of new line characters. Probably (I'm guessing) from where the devs were writing their queries? Who knows. Either way, it looked like this:

 

Timestamp,Username,"IP Address","Service Name",Operation,Resource,Allowed,Impersonator,sub_operation,entity_id,stored_object_name,additional_info,collection_name,solr_version,operation_params,service,operation_text,url,operation_text,table_name,resource_path,database_name,object_type,Source,Destination,Permissions,"Delegation Token ID","Table Name",Family,Qualifier,"Operation Text","Database Name","Table Name","Object Type","Resource Path","Usage Type","Operation Text","Database Name","Table Name","Object Type","Resource Path","Usage Type","Operation Text","Query ID","Session ID",Status,"Database Name","Table Name","Object Type",Privilege$

2020-07-01T22:49:13.000Z,user1,::ffff:xx.xx.xx.xx,IMPALA,QUERY,env_db:table_

$

name,true,"hue/host19.fqdn@DOMAIN",,,,,,,,,,,"select * fro$

m table_name",table_name2 etc etc",,,

 

Crazy. New line characters in the middle of words. Found this online:

 

awk 'NR == 1{ printf $0; next }; { printf "%s%s", (/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]T[0-9][0-9]:[0-9][0-9]+/? ORS : ""), $0 } END{ print "" }' inputfile.csv > outputfile.csv

 

I am terrible at regex. Can't tell you why it's doing what it's doing. But it works. Check against the timestamp, which each line starts with: 2020-07-08T23:49:13.000Z

 

Strip off the header line first, then run the newline stripper code:

sed -i 1d inputfile.csv

 

Testing looks good. Time for lunch.

View solution in original post

2 REPLIES 2

avatar
Contributor

So, this:

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = '"',
"escapeChar" = "\\"
)
STORED AS TEXTFILE

does not handle commas in csv records, even if they are enclosed with " " (ie: "CREATE TABLE db.table AS ( SELECT db.col1, db.col2, db.col3, db.col3,... etc"

 

There has to be a way to do this. Does anyone know?

avatar
Contributor

Hi,

 

This isn't meant to be a blog post. Here's the answer:

 

csv file has lots of new line characters. Probably (I'm guessing) from where the devs were writing their queries? Who knows. Either way, it looked like this:

 

Timestamp,Username,"IP Address","Service Name",Operation,Resource,Allowed,Impersonator,sub_operation,entity_id,stored_object_name,additional_info,collection_name,solr_version,operation_params,service,operation_text,url,operation_text,table_name,resource_path,database_name,object_type,Source,Destination,Permissions,"Delegation Token ID","Table Name",Family,Qualifier,"Operation Text","Database Name","Table Name","Object Type","Resource Path","Usage Type","Operation Text","Database Name","Table Name","Object Type","Resource Path","Usage Type","Operation Text","Query ID","Session ID",Status,"Database Name","Table Name","Object Type",Privilege$

2020-07-01T22:49:13.000Z,user1,::ffff:xx.xx.xx.xx,IMPALA,QUERY,env_db:table_

$

name,true,"hue/host19.fqdn@DOMAIN",,,,,,,,,,,"select * fro$

m table_name",table_name2 etc etc",,,

 

Crazy. New line characters in the middle of words. Found this online:

 

awk 'NR == 1{ printf $0; next }; { printf "%s%s", (/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]T[0-9][0-9]:[0-9][0-9]+/? ORS : ""), $0 } END{ print "" }' inputfile.csv > outputfile.csv

 

I am terrible at regex. Can't tell you why it's doing what it's doing. But it works. Check against the timestamp, which each line starts with: 2020-07-08T23:49:13.000Z

 

Strip off the header line first, then run the newline stripper code:

sed -i 1d inputfile.csv

 

Testing looks good. Time for lunch.