Support Questions

getschwifty · ‎07-14-2020

Hi everyone,

I am uploading Navigator logs into Hive for analysis. I used this:

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = '"',
"escapeChar" = "\\"
)
STORED AS TEXTFILE

Based off: https://community.cloudera.com/t5/Support-Questions/Hive-escaping-field-delimiter-in-column-value/m-... to get an external table to handle the commas in queries. I think it's working fine (but will test tomorrow to make sure).

However, I then do a move to a managed partitioned table, but it is not handling the commas in queries correctly. I created it as:

PARTITIONED BY(event_day INT, event_month INT, event_year INT);

How do I handle the commas in queries as part of this move?

Does the managed table need the same delimter & escape info as the external table?

If that's the case, what's the syntax for it?

Schwifty

getschwifty · ‎07-14-2020

Hi,

This isn't meant to be a blog post. Here's the answer:

csv file has lots of new line characters. Probably (I'm guessing) from where the devs were writing their queries? Who knows. Either way, it looked like this:

Timestamp,Username,"IP Address","Service Name",Operation,Resource,Allowed,Impersonator,sub_operation,entity_id,stored_object_name,additional_info,collection_name,solr_version,operation_params,service,operation_text,url,operation_text,table_name,resource_path,database_name,object_type,Source,Destination,Permissions,"Delegation Token ID","Table Name",Family,Qualifier,"Operation Text","Database Name","Table Name","Object Type","Resource Path","Usage Type","Operation Text","Database Name","Table Name","Object Type","Resource Path","Usage Type","Operation Text","Query ID","Session ID",Status,"Database Name","Table Name","Object Type",Privilege$

2020-07-01T22:49:13.000Z,user1,::ffff:xx.xx.xx.xx,IMPALA,QUERY,env_db:table_

$

name,true,"hue/host19.fqdn@DOMAIN",,,,,,,,,,,"select * fro$

m table_name",table_name2 etc etc",,,

Crazy. New line characters in the middle of words. Found this online:

awk 'NR == 1{ printf $0; next }; { printf "%s%s", (/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]T[0-9][0-9]:[0-9][0-9]+/? ORS : ""), $0 } END{ print "" }' inputfile.csv > outputfile.csv

I am terrible at regex. Can't tell you why it's doing what it's doing. But it works. Check against the timestamp, which each line starts with: 2020-07-08T23:49:13.000Z

Strip off the header line first, then run the newline stripper code:

sed -i 1d inputfile.csv

Testing looks good. Time for lunch.

View solution in original post

getschwifty · ‎07-14-2020

So, this:

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = '"',
"escapeChar" = "\\"
)
STORED AS TEXTFILE

does not handle commas in csv records, even if they are enclosed with " " (ie: "CREATE TABLE db.table AS ( SELECT db.col1, db.col2, db.col3, db.col3,... etc"

There has to be a way to do this. Does anyone know?

getschwifty · ‎07-14-2020

Hi,

This isn't meant to be a blog post. Here's the answer:

csv file has lots of new line characters. Probably (I'm guessing) from where the devs were writing their queries? Who knows. Either way, it looked like this:

Timestamp,Username,"IP Address","Service Name",Operation,Resource,Allowed,Impersonator,sub_operation,entity_id,stored_object_name,additional_info,collection_name,solr_version,operation_params,service,operation_text,url,operation_text,table_name,resource_path,database_name,object_type,Source,Destination,Permissions,"Delegation Token ID","Table Name",Family,Qualifier,"Operation Text","Database Name","Table Name","Object Type","Resource Path","Usage Type","Operation Text","Database Name","Table Name","Object Type","Resource Path","Usage Type","Operation Text","Query ID","Session ID",Status,"Database Name","Table Name","Object Type",Privilege$

2020-07-01T22:49:13.000Z,user1,::ffff:xx.xx.xx.xx,IMPALA,QUERY,env_db:table_

$

name,true,"hue/host19.fqdn@DOMAIN",,,,,,,,,,,"select * fro$

m table_name",table_name2 etc etc",,,

Crazy. New line characters in the middle of words. Found this online:

awk 'NR == 1{ printf $0; next }; { printf "%s%s", (/^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]T[0-9][0-9]:[0-9][0-9]+/? ORS : ""), $0 } END{ print "" }' inputfile.csv > outputfile.csv

I am terrible at regex. Can't tell you why it's doing what it's doing. But it works. Check against the timestamp, which each line starts with: 2020-07-08T23:49:13.000Z

Strip off the header line first, then run the newline stripper code:

sed -i 1d inputfile.csv

Testing looks good. Time for lunch.

Cloudera Community

Support Questions

Hive and comma delimited fields in managed table

Using Regular Expressions to Extract Fields for Hi...

Escaping field delimiters

Hive Temporary Tables.

Hive ORC or AVRO format with field delimiters

comma in between data of csv mapped to external ta...

how to load double quotes data of fields in hive t...

Is there a way to convert locally managed table to...

HIVE MANAGED TABLE

How to drop delimiter in hive table

Trouble loading CSV data with embedded double quot...