I have an application with an external table with a "big" gz file. It's a 2.9GB compressed and 30GB uncompressed with ~75M rows (I know it's an antipattern and it's slow, but I only manage the cluster and not write the application).
Even simple queries (SELECT * FROM table limit 1 or SELECT COUNT(*) from table) on this table fails with the following errors (from /var/log/impalad):
I0724 12:30:10.699849 31703 runtime-state.cc:209] Error from query 954bc4c192defb20:183b125985dcb690: For better performance, snappy-, gzip-, and bzip-compressed files should not be split into multiple HDFS blocks. file=hdfs://nameservice/path/to/file.gz offset 2818572288 I0724 12:30:10.723881 31707 status.cc:44] Gzip Data error, likely data corrupted in this block. @ 0x80ff0a (unknown) @ 0xb25369 (unknown) @ 0xc09135 (unknown) @ 0xc098a4 (unknown) @ 0xc0a2ed (unknown) @ 0xc0b6b5 (unknown) @ 0xc0c005 (unknown) @ 0xbeed16 (unknown) @ 0xbefdfe (unknown) @ 0xb8f4c7 (unknown) @ 0xb8fe04 (unknown) @ 0xdf5f3a (unknown) @ 0x7fa6653adaa1 start_thread @ 0x7fa6642feaad clone I0724 12:30:10.723917 31707 hdfs-scan-node.cc:1240] Scan node (id=0) ran into a parse error for scan range hdfs://nameservice/path/to/file.gz(0:2930219325). Processed 8388608 bytes. I0724 12:30:10.724810 31703 runtime-state.cc:209] Error from query 954bc4c192defb20:183b125985dcb690: Gzip Data error, likely data corrupted in this block. I0724 12:30:10.725317 742 coordinator.cc:1434] Cancel() query_id=954bc4c192defb20:183b125985dcb690
I've tested the file with all the classic unix tools and the gz file is NOT corrupted and the same queries work fine using Hive. I'm using CDH 5.8.3.
Is this a know issues?
There are couple of reasons
1. Can you please share your table structure? if you have too many columns, ignore the columns and share the remaining parts like ROW FORMAT SERDE, LOCATION, etc
This is possible because Hive supports some additional file formats compare to impala. Also we can extend the Hive SerDes to support some custom file formats. But impala may not support some of the hive SerDes
2. Where do you keep the source file? in local or hdfs? If it is in local i think your unix tool is sufficient to check the corrupted blocks but to the corrupted blocks in hdfs you have to use hdfs fsck command
This is the conbined output of show create table and describe formatted:
CREATE EXTERNAL TABLE db_name.table_name ( col00 STRING, col01 STRING, col02 STRING, col03 STRING, col04 STRING, col05 STRING, col06 STRING, col07 STRING, col08 STRING, col09 STRING, col10 STRING, col11 STRING, col12 STRING, col13 STRING, col14 STRING, col15 STRING, col16 STRING, col17 STRING, col18 STRING, col19 STRING, col20 STRING, col21 STRING, col22 STRING, col23 STRING, col24 STRING, col25 STRING, col26 STRING, col27 STRING, col28 STRING, col29 STRING, col30 STRING, col31 STRING, col32 STRING, col33 STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ';' WITH SERDEPROPERTIES ('serialization.format'=';', 'field.delim'=';') STORED AS TEXTFILE LOCATION 'hdfs://nameservice/path' TBLPROPERTIES ('numFiles'='0', 'COLUMN_STATS_ACCURATE'='false', 'skip.header.line.count'='1', 'transient_lastDdlTime'='1500561063', 'numRows'='-1', 'totalSize'='0', 'rawDataSize'='-1') | # Detailed Table Information | NULL | NULL | | Database: | db_name | NULL | | Owner: | hive | NULL | | CreateTime: | Thu Jul 20 16:31:03 CEST 2017 | NULL | | LastAccessTime: | UNKNOWN | NULL | | Protect Mode: | None | NULL | | Retention: | 0 | NULL | | Location: | hdfs://nameservice/path | NULL | | Table Type: | EXTERNAL_TABLE | NULL | | Table Parameters: | NULL | NULL | | | COLUMN_STATS_ACCURATE | false | | | EXTERNAL | TRUE | | | numFiles | 0 | | | numRows | -1 | | | rawDataSize | -1 | | | skip.header.line.count | 1 | | | totalSize | 0 | | | transient_lastDdlTime | 1500561063 | | | NULL | NULL | | # Storage Information | NULL | NULL | | SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL | | InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL | | OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL | | Compressed: | No | NULL | | Num Buckets: | -1 | NULL | | Bucket Columns: |  | NULL | | Sort Columns: |  | NULL | | Storage Desc Params: | NULL | NULL | | | field.delim | ; | | | serialization.format | ; |
For the second question: the file is in HDFS. After the errors in Impala I've performed the following tests:
It seems Impala has some restrictions like The maximum size that Impala can accommodate for an individual bzip file is 1 GB (after uncompression). Pls refer the Note section from the below link