10-04-2017 10:14 AM
I created an external table in hive with Avro as serde format but I found that if new line characters exist in the data hive returns incorrect results when executing any query that returns that information.
The specification of the table is:
CREATE TABLE IF NOT EXISTS `employment`( `fid` int COMMENT '', `name` string COMMENT '' PARTITIONED BY ( `year` string, `month` string, `day` string) CLUSTERED BY ( fid) INTO 100 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 'hdfs:///workarea/productos/pf/db/origenes/employment';
The data is loaded through sqoop and in my test case the name column contains new line chars so when I execute the query:
SELECT fid, name FROM employment WHERE year='2017' AND month='03' AND day='31' AND fid=123;
I get these results:
Assuming that the second row contains break line chars between the two words.
The same query on impala works fine.
And the solutions I've found related to this problem advice you to use binary formats such as SequenceFile or Avro, but that's what i'm already using.
And the other solution is to use some replacement function like
But that solution works on the specific case of a known column with the "problem", and maybe other columns or tables can have the same "problem" in the future.
So I think I'd be easy to recreate the table with the appropiate configuration than changing queries on several processes that use the emerging problematic table.
Is there any other option?
My CDH version is 5.8.0