Support Questions

Find answers, ask questions, and share your expertise

Impala CREATE TABLE from SELECT is dropping/missing rows

New Contributor

Consider the following query in Impala:

 

SELECT count(*) FROM (
SELECT
.
.
.
sum(revenue) as revenue,
date_time_hours,
date_time_day,
date_time_month,
date_time_year
FROM customer_records
WHERE
date_time_day=day(DAYS_SUB(now(), 1)) AND
date_time_month=month(DAYS_SUB(now(), 1)) AND
date_time_year=year(DAYS_SUB(now(), 1)) AND
cust_id='25'
GROUP BY
.
.
.
date_time_hours,
date_time_day,
date_time_month,
date_time_year
) A1

 

This gives me a count of 3,458,017 rows.

 

However, if I do the following:

 

CREATE TABLE C1 AS
SELECT
.
.
.
sum(revenue) as revenue,
date_time_hours,
date_time_day,
date_time_month,
date_time_year
FROM customer_records
WHERE
date_time_day=day(DAYS_SUB(now(), 1)) AND
date_time_month=month(DAYS_SUB(now(), 1)) AND
date_time_year=year(DAYS_SUB(now(), 1)) AND
cust_id='25'
GROUP BY
.
.
.
date_time_hours,
date_time_day,
date_time_month,
date_time_year

 

This gives me the following output:

 

Inserted 817391 row(s)

 

Why am I missing rows?

 

I am performing these queries in the Hue dashboard.

3 REPLIES 3

Expert Contributor

Can you attach the query profile of both these queries?

 

CM > Impala > Queries. Ensure that time range on the top right covers the duration of the query. (Click on "Query Details" on the drop down at the right hand side) > "Download Text Profile".

New Contributor
Hello,

Here is the profile for the first case:

https://pastebin.com/JTHKxqbY

And for the second one:

https://pastebin.com/bv3msxyq

Thank you
Regards
Nandeep

Expert Contributor

Only difference i see between working and problematic query is

 

    Errors: Snappy: GetUncompressedLength failed
File 'hdfs://itchil800.inteliquent.com:8020/user/inteliquent/data/cdr/database/call_date_time_year=2020/call_date_time_month=2/call_date_time_day=10/call_date_time_hours=16/part-00192-7937e39f-9e89-4247-8561-f0491d26c970.c000.snappy.parquet' is corrupt: error decoding value of type DECIMAL(12,12) at offset 1766 (1 of 286 similar)

 

00:SCAN HDFS               3   52.015ms   62.367ms    4.47M          -1    3.15 GB      440.00 MB  call_detail_records.raw

-->

00:SCAN HDFS               3  221.143ms  297.334ms  54.04M          -1    2.02 GB      320.00 MB  call_detail_records.raw

 

When run as CTAS select, I can see multiple files are reported to have compression error. And in good run there is 54.04 million rows read and in bad run 4.47 million rows are read. 

 

Can we know what data format is used for this table call_detail_records.raw (please provide describe formatted call_detail_records.raw result)

 

Do you have a text formatted version of the same table and can replicate the issue in text format?

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.