Created on 07-14-2020 12:38 AM - edited 09-16-2022 07:38 AM
Hi everyone,
I am uploading Navigator logs into Hive for analysis. I used this:
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = '"',
"escapeChar" = "\\"
)
STORED AS TEXTFILE
Based off: https://community.cloudera.com/t5/Support-Questions/Hive-escaping-field-delimiter-in-column-value/m-... to get an external table to handle the commas in queries. I think it's working fine (but will test tomorrow to make sure).
However, I then do a move to a managed partitioned table, but it is not handling the commas in queries correctly. I created it as:
PARTITIONED BY(event_day INT, event_month INT, event_year INT);
How do I handle the commas in queries as part of this move?
Does the managed table need the same delimter & escape info as the external table?
If that's the case, what's the syntax for it?
Schwifty
Created 07-14-2020 08:08 PM
Hi,
This isn't meant to be a blog post. Here's the answer:
csv file has lots of new line characters. Probably (I'm guessing) from where the devs were writing their queries? Who knows. Either way, it looked like this:
Timestamp,Username,"IP Address","Service Name",Operation,Resource,Allowed,Impersonator,sub_operation,entity_id,stored_object_name,additional_info,collection_name,solr_version,operation_params,service,operation_text,url,operation_text,table_name,resource_path,database_name,object_type,Source,Destination,Permissions,"Delegation Token ID","Table Name",Family,Qualifier,"Operation Text","Database Name","Table Name","Object Type","Resource Path","Usage Type","Operation Text","Database Name","Table Name","Object Type","Resource Path","Usage Type","Operation Text","Query ID","Session ID",Status,"Database Name","Table Name","Object Type",Privilege$
2020-07-01T22:49:13.000Z,user1,::ffff:xx.xx.xx.xx,IMPALA,QUERY,env_db:table_
$
name,true,"hue/host19.fqdn@DOMAIN",,,,,,,,,,,"select * fro$
m table_name",table_name2 etc etc",,,
Crazy. New line characters in the middle of words. Found this online:
awk 'NR == 1{ printf $0; next }; { printf "%s%s", (/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]T[0-9][0-9]:[0-9][0-9]+/? ORS : ""), $0 } END{ print "" }' inputfile.csv > outputfile.csv
I am terrible at regex. Can't tell you why it's doing what it's doing. But it works. Check against the timestamp, which each line starts with: 2020-07-08T23:49:13.000Z
Strip off the header line first, then run the newline stripper code:
sed -i 1d inputfile.csv
Testing looks good. Time for lunch.
Created 07-14-2020 04:58 PM
So, this:
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = '"',
"escapeChar" = "\\"
)
STORED AS TEXTFILE
does not handle commas in csv records, even if they are enclosed with " " (ie: "CREATE TABLE db.table AS ( SELECT db.col1, db.col2, db.col3, db.col3,... etc"
There has to be a way to do this. Does anyone know?
Created 07-14-2020 08:08 PM
Hi,
This isn't meant to be a blog post. Here's the answer:
csv file has lots of new line characters. Probably (I'm guessing) from where the devs were writing their queries? Who knows. Either way, it looked like this:
Timestamp,Username,"IP Address","Service Name",Operation,Resource,Allowed,Impersonator,sub_operation,entity_id,stored_object_name,additional_info,collection_name,solr_version,operation_params,service,operation_text,url,operation_text,table_name,resource_path,database_name,object_type,Source,Destination,Permissions,"Delegation Token ID","Table Name",Family,Qualifier,"Operation Text","Database Name","Table Name","Object Type","Resource Path","Usage Type","Operation Text","Database Name","Table Name","Object Type","Resource Path","Usage Type","Operation Text","Query ID","Session ID",Status,"Database Name","Table Name","Object Type",Privilege$
2020-07-01T22:49:13.000Z,user1,::ffff:xx.xx.xx.xx,IMPALA,QUERY,env_db:table_
$
name,true,"hue/host19.fqdn@DOMAIN",,,,,,,,,,,"select * fro$
m table_name",table_name2 etc etc",,,
Crazy. New line characters in the middle of words. Found this online:
awk 'NR == 1{ printf $0; next }; { printf "%s%s", (/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]T[0-9][0-9]:[0-9][0-9]+/? ORS : ""), $0 } END{ print "" }' inputfile.csv > outputfile.csv
I am terrible at regex. Can't tell you why it's doing what it's doing. But it works. Check against the timestamp, which each line starts with: 2020-07-08T23:49:13.000Z
Strip off the header line first, then run the newline stripper code:
sed -i 1d inputfile.csv
Testing looks good. Time for lunch.