Reply
Explorer
Posts: 24
Registered: ‎11-15-2016

Impala can't query gz file

Hi,

 

I have an application with an external table with a "big" gz file. It's a 2.9GB compressed and 30GB uncompressed with ~75M rows (I know it's an antipattern and it's slow, but I only manage the cluster and not write the application).

 

Even simple queries (SELECT * FROM table limit 1 or SELECT COUNT(*) from table) on this table fails with the following errors (from /var/log/impalad):

 

I0724 12:30:10.699849 31703 runtime-state.cc:209] Error from query 954bc4c192defb20:183b125985dcb690: For better performance, snappy-, gzip-, and bzip-compressed files should not be split into multiple HDFS blocks. file=hdfs://nameservice/path/to/file.gz offset 2818572288
I0724 12:30:10.723881 31707 status.cc:44] Gzip Data error, likely data corrupted in this block.
    @           0x80ff0a  (unknown)
    @           0xb25369  (unknown)
    @           0xc09135  (unknown)
    @           0xc098a4  (unknown)
    @           0xc0a2ed  (unknown)
    @           0xc0b6b5  (unknown)
    @           0xc0c005  (unknown)
    @           0xbeed16  (unknown)
    @           0xbefdfe  (unknown)
    @           0xb8f4c7  (unknown)
    @           0xb8fe04  (unknown)
    @           0xdf5f3a  (unknown)
    @     0x7fa6653adaa1  start_thread
    @     0x7fa6642feaad  clone
I0724 12:30:10.723917 31707 hdfs-scan-node.cc:1240] Scan node (id=0) ran into a parse error for scan range hdfs://nameservice/path/to/file.gz(0:2930219325). Processed 8388608 bytes.
I0724 12:30:10.724810 31703 runtime-state.cc:209] Error from query 954bc4c192defb20:183b125985dcb690: Gzip Data error, likely data corrupted in this block.
I0724 12:30:10.725317   742 coordinator.cc:1434] Cancel() query_id=954bc4c192defb20:183b125985dcb690

I've tested the file with all the classic unix tools and the gz file is NOT corrupted and the same queries work fine using Hive. I'm using CDH 5.8.3.

 

Is this a know issues?

 

 

Posts: 519
Topics: 14
Kudos: 92
Solutions: 45
Registered: ‎09-02-2016

Re: Impala can't query gz file

@parnigot

 

There are couple of reasons

 

1. Can you please share your table structure? if you have too many columns, ignore the columns and share the remaining parts like ROW FORMAT SERDE, LOCATION, etc

 

This is possible because Hive supports some additional file formats compare to impala. Also we can extend the Hive SerDes to support some custom file formats. But impala may not support some of the hive SerDes

 

2. Where do you keep the source file? in local or hdfs? If it is in local i think your unix tool is sufficient to check the corrupted blocks but to the corrupted blocks in hdfs you have to use hdfs fsck command

 

Explorer
Posts: 24
Registered: ‎11-15-2016

Re: Impala can't query gz file

Hi saranvisa,

 

This is the conbined output of show create table and describe formatted:

 

CREATE EXTERNAL TABLE db_name.table_name (
  col00 STRING,
  col01 STRING,
  col02 STRING,
  col03 STRING,
  col04 STRING,
  col05 STRING,
  col06 STRING,
  col07 STRING,
  col08 STRING,
  col09 STRING,
  col10 STRING,
  col11 STRING,
  col12 STRING,
  col13 STRING,
  col14 STRING,
  col15 STRING,
  col16 STRING,
  col17 STRING,
  col18 STRING,
  col19 STRING,
  col20 STRING,
  col21 STRING,
  col22 STRING,
  col23 STRING,
  col24 STRING,
  col25 STRING,
  col26 STRING,
  col27 STRING,
  col28 STRING,
  col29 STRING,
  col30 STRING,
  col31 STRING,
  col32 STRING,
  col33 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ';'
WITH SERDEPROPERTIES ('serialization.format'=';', 'field.delim'=';')
STORED AS TEXTFILE
LOCATION 'hdfs://nameservice/path'
TBLPROPERTIES ('numFiles'='0', 'COLUMN_STATS_ACCURATE'='false', 'skip.header.line.count'='1', 'transient_lastDdlTime'='1500561063', 'numRows'='-1', 'totalSize'='0', 'rawDataSize'='-1')


| # Detailed Table Information    | NULL                                                       | NULL                 |
| Database:                       | db_name                                                    | NULL                 |
| Owner:                          | hive                                                       | NULL                 |
| CreateTime:                     | Thu Jul 20 16:31:03 CEST 2017                              | NULL                 |
| LastAccessTime:                 | UNKNOWN                                                    | NULL                 |
| Protect Mode:                   | None                                                       | NULL                 |
| Retention:                      | 0                                                          | NULL                 |
| Location:                       | hdfs://nameservice/path                                    | NULL                 |
| Table Type:                     | EXTERNAL_TABLE                                             | NULL                 |
| Table Parameters:               | NULL                                                       | NULL                 |
|                                 | COLUMN_STATS_ACCURATE                                      | false                |
|                                 | EXTERNAL                                                   | TRUE                 |
|                                 | numFiles                                                   | 0                    |
|                                 | numRows                                                    | -1                   |
|                                 | rawDataSize                                                | -1                   |
|                                 | skip.header.line.count                                     | 1                    |
|                                 | totalSize                                                  | 0                    |
|                                 | transient_lastDdlTime                                      | 1500561063           |
|                                 | NULL                                                       | NULL                 |
| # Storage Information           | NULL                                                       | NULL                 |
| SerDe Library:                  | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe         | NULL                 |
| InputFormat:                    | org.apache.hadoop.mapred.TextInputFormat                   | NULL                 |
| OutputFormat:                   | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL                 |
| Compressed:                     | No                                                         | NULL                 |
| Num Buckets:                    | -1                                                         | NULL                 |
| Bucket Columns:                 | []                                                         | NULL                 |
| Sort Columns:                   | []                                                         | NULL                 |
| Storage Desc Params:            | NULL                                                       | NULL                 |
|                                 | field.delim                                                | ;                    |
|                                 | serialization.format                                       | ;                    |

For the second question: the file is in HDFS. After the errors in Impala I've performed the following tests:

 

  1. I've downloaded the file from HDFS to the local fs and extracted successfully with gzip
  2. I've executed some queries with Hive and it works
Highlighted
Posts: 519
Topics: 14
Kudos: 92
Solutions: 45
Registered: ‎09-02-2016

Re: Impala can't query gz file

@parnigot

 

It seems Impala has some restrictions like The maximum size that Impala can accommodate for an individual bzip file is 1 GB (after uncompression). Pls refer the Note section from the below link

 

https://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_txtfile.html#text_performance

 

 

Announcements

Our community is getting a little larger. And a lot better.


Learn More about the Cloudera and Hortonworks community merger planned for late July and early August.