Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hive and comma delimited fields in managed table

avatar
Contributor

Hi everyone,

 

I am uploading Navigator logs into Hive for analysis. I used this: 

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = '"',
"escapeChar" = "\\"
)
STORED AS TEXTFILE

 

Based off: https://community.cloudera.com/t5/Support-Questions/Hive-escaping-field-delimiter-in-column-value/m-... to get an external table to handle the commas in queries. I think it's working fine (but will test tomorrow to make sure).

 

However, I then do a move to a managed partitioned table, but it is not handling the commas in queries correctly. I created it as:

PARTITIONED BY(event_day INT, event_month INT, event_year INT);

 

How do I handle the commas in queries as part of this move?

Does the managed table need the same delimter & escape info as the external table?

If that's the case, what's the syntax for it?

 

Schwifty

1 ACCEPTED SOLUTION

avatar
Contributor

Hi,

 

This isn't meant to be a blog post. Here's the answer:

 

csv file has lots of new line characters. Probably (I'm guessing) from where the devs were writing their queries? Who knows. Either way, it looked like this:

 

Timestamp,Username,"IP Address","Service Name",Operation,Resource,Allowed,Impersonator,sub_operation,entity_id,stored_object_name,additional_info,collection_name,solr_version,operation_params,service,operation_text,url,operation_text,table_name,resource_path,database_name,object_type,Source,Destination,Permissions,"Delegation Token ID","Table Name",Family,Qualifier,"Operation Text","Database Name","Table Name","Object Type","Resource Path","Usage Type","Operation Text","Database Name","Table Name","Object Type","Resource Path","Usage Type","Operation Text","Query ID","Session ID",Status,"Database Name","Table Name","Object Type",Privilege$

2020-07-01T22:49:13.000Z,user1,::ffff:xx.xx.xx.xx,IMPALA,QUERY,env_db:table_

$

name,true,"hue/host19.fqdn@DOMAIN",,,,,,,,,,,"select * fro$

m table_name",table_name2 etc etc",,,

 

Crazy. New line characters in the middle of words. Found this online:

 

awk 'NR == 1{ printf $0; next }; { printf "%s%s", (/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]T[0-9][0-9]:[0-9][0-9]+/? ORS : ""), $0 } END{ print "" }' inputfile.csv > outputfile.csv

 

I am terrible at regex. Can't tell you why it's doing what it's doing. But it works. Check against the timestamp, which each line starts with: 2020-07-08T23:49:13.000Z

 

Strip off the header line first, then run the newline stripper code:

sed -i 1d inputfile.csv

 

Testing looks good. Time for lunch.

View solution in original post

2 REPLIES 2

avatar
Contributor

So, this:

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = '"',
"escapeChar" = "\\"
)
STORED AS TEXTFILE

does not handle commas in csv records, even if they are enclosed with " " (ie: "CREATE TABLE db.table AS ( SELECT db.col1, db.col2, db.col3, db.col3,... etc"

 

There has to be a way to do this. Does anyone know?

avatar
Contributor

Hi,

 

This isn't meant to be a blog post. Here's the answer:

 

csv file has lots of new line characters. Probably (I'm guessing) from where the devs were writing their queries? Who knows. Either way, it looked like this:

 

Timestamp,Username,"IP Address","Service Name",Operation,Resource,Allowed,Impersonator,sub_operation,entity_id,stored_object_name,additional_info,collection_name,solr_version,operation_params,service,operation_text,url,operation_text,table_name,resource_path,database_name,object_type,Source,Destination,Permissions,"Delegation Token ID","Table Name",Family,Qualifier,"Operation Text","Database Name","Table Name","Object Type","Resource Path","Usage Type","Operation Text","Database Name","Table Name","Object Type","Resource Path","Usage Type","Operation Text","Query ID","Session ID",Status,"Database Name","Table Name","Object Type",Privilege$

2020-07-01T22:49:13.000Z,user1,::ffff:xx.xx.xx.xx,IMPALA,QUERY,env_db:table_

$

name,true,"hue/host19.fqdn@DOMAIN",,,,,,,,,,,"select * fro$

m table_name",table_name2 etc etc",,,

 

Crazy. New line characters in the middle of words. Found this online:

 

awk 'NR == 1{ printf $0; next }; { printf "%s%s", (/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]T[0-9][0-9]:[0-9][0-9]+/? ORS : ""), $0 } END{ print "" }' inputfile.csv > outputfile.csv

 

I am terrible at regex. Can't tell you why it's doing what it's doing. But it works. Check against the timestamp, which each line starts with: 2020-07-08T23:49:13.000Z

 

Strip off the header line first, then run the newline stripper code:

sed -i 1d inputfile.csv

 

Testing looks good. Time for lunch.