Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive join fails with different table formats

Hive join fails with different table formats

Contributor

Hi,

I have complex Hive join using ORC tables, that uses some sub-query twice. I tried to optimize it by storing result of subquery into temporary table. I forgot to define storage format for the temp table so it was created as CSV by default. The query seemed to work (completed with no error) but the results were wrong (few extra lines added with some NULL columns). After I changed the temp table storage type to ORC as the other tables, the query took significantly longer time, but the data are correct.

The temp table creation (in CSV) :

CREATE TABLE mmd_test_glims1_global_raw.lot_temp_union AS SELECT *,'0000-00-00-00-00-00' as bdp_version FROM mmd_test_glims1_global.lot
                UNION ALL SELECT * FROM mmd_test_glims1_global_raw.lot WHERE bdp_version='2016-06-24-18-47-05';

And the view with join:

CREATE OR REPLACE VIEW mmd_test_glims1_global_raw.lot_temp_view AS
    SELECT t.* FROM mmd_test_glims1_global_raw.lot_temp_union t
    JOIN
        (SELECT LOT_NUMBER, max(bdp_version) max_modified FROM mmd_test_glims1_global_raw.lot_temp_union GROUP BY LOT_NUMBER) s
    ON t.LOT_NUMBER<=>s.LOT_NUMBER AND  t.bdp_version=s.max_modified;

Have you seen such behavior? I thought it is safe to mix storage formats in a single query.

Thanks for any input.

Regards,

Pavel

3 REPLIES 3

Re: Hive join fails with different table formats

Super Guru
  • what version of hive are you on?
  • Do you mind posting your DDL?

Re: Hive join fails with different table formats

New Contributor

Hi, I have a similar problem on hive 1.2.

My ddls:

'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

CREATE TABLE `dask_tech.orc_table`(
`vat_nip` varchar(15),
`firmano` decimal(10,0),
`mailadr_nazwa` varchar(255))
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

\

CREATE TABLE `dask_tech.txt_table`(
`val_orig` varchar(100),
`val_masked` varchar(100))
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

Query from orc - 3 lines in result.

select mailadr_nazwa from dask_tech.orc_table
where FIRMANO=45601;

Designers' Baby P Paula Cz
Designers' Baby P Paula Cz
Designers' Baby P Paula Cz

Query with join to txt - 6 lines in result.

select mailadr_nazwa from dask_tech.orc_table
left join dask_tech.txt_table m1 on m1.val_orig = vat_nip
where FIRMANO=45601;


Designers' Baby P
Paula Cz
Designers' Baby P
Paula Cz
Designers' Baby P
Paula Cz

Do you have any idea what might be the reason?

Re: Hive join fails with different table formats

New Contributor

I solved the problem. The results are correct after changing the parameter:
hive.query.result.fileformat=sequencefile
I had the default value for my HIVE version: textFile.