Reply
Explorer
Posts: 10
Registered: ‎12-05-2017

Impala not working with .tar.gz files

[ Edited ]

I have dataset consisting of 100+ csv files. Total size of all files about 50GB.

I tarred all the 100+ csv files into a single tar file and used gzip to compress the tar file into a single myData.gz file.

Loaded into HDFS and defined an Impala table wrapping myData.gz. 

 

I did not specify any compression codec in table definition as the documentation stated here https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_txtfile.html#gzip 

 

When querying, I get error "Gzip Data error, likely data corrupted in this block.". I have tested using "gzip -t" that it is a valid gzip file.

 

Is a tarred gzip file not supported by Impala? 

If I created individual gzips of each of the 100+ csv files, should I expect it to work?

 

Thanks for answering.

 

Cloudera Employee
Posts: 332
Registered: ‎07-29-2015

Re: Impala not working with .tar.gz files

No, Impala does not support tar files.


Yes if you have many individual .gz files in a directory, that will work.

 

Explorer
Posts: 10
Registered: ‎12-05-2017

Re: Impala not working with .tar.gz files

I gzipped the individual csv files as suggested and uploaded them to HDFS. 

When I query any data, I receive binary results containing lots of special characters and NULL for alomost all of the columns.

 

My DDL:

 

create external table fi_mgr_raw.FI_H0A0_BOND_orig (
Col1 TIMESTAMP ,
Col2 String ,
Col3 String 

)
row format delimited
FIELDS TERMINATED BY ','
location "/data/fi/MY_BOND"
tblproperties('skip.header.line.count'='1');

 

Please throw some light on what is going wrong here.

Cloudera Employee
Posts: 332
Registered: ‎07-29-2015

Re: Impala not working with .tar.gz files

Explorer
Posts: 10
Registered: ‎12-05-2017

Re: Impala not working with .tar.gz files

All the gzipped files have .gz extension and they are located under directory location "/data/fi/MY_BOND" on my server.

 

I went through the link https://www.cloudera.com/documentation/enterprise/latest/topics/impala_txtfile.html#gzip and can't find anything that is different in my implementation.

 

Any other causes for the issues I am facing?

Is there any config parameter that I should enable?

Cloudera Employee
Posts: 332
Registered: ‎07-29-2015

Re: Impala not working with .tar.gz files

That should work then assuming everything is normal. It would be helpful to see the output of "show files in table". Here's what it looks like for me on a working gzipped text table.

 

[localhost:21000] default> show files in functional_text_gzip.alltypes;
Query: show files in functional_text_gzip.alltypes
+-----------------------------------------------------------------------------------------+--------+--------------------+
| Path                                                                                    | Size   | Partition          |
+-----------------------------------------------------------------------------------------+--------+--------------------+
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2009/month=1/000013_0.gz  | 3.27KB | year=2009/month=1  |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2009/month=2/000023_0.gz  | 3.00KB | year=2009/month=2  |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2009/month=3/000012_0.gz  | 3.31KB | year=2009/month=3  |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2009/month=4/000021_0.gz  | 3.20KB | year=2009/month=4  |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2009/month=5/000002_0.gz  | 3.30KB | year=2009/month=5  |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2009/month=6/000015_0.gz  | 3.22KB | year=2009/month=6  |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2009/month=7/000003_0.gz  | 3.31KB | year=2009/month=7  |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2009/month=8/000004_0.gz  | 3.31KB | year=2009/month=8  |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2009/month=9/000016_0.gz  | 3.21KB | year=2009/month=9  |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2009/month=10/000000_0.gz | 3.31KB | year=2009/month=10 |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2009/month=11/000014_0.gz | 3.20KB | year=2009/month=11 |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2009/month=12/000001_0.gz | 3.31KB | year=2009/month=12 |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2010/month=1/000005_0.gz  | 3.31KB | year=2010/month=1  |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2010/month=2/000022_0.gz  | 2.98KB | year=2010/month=2  |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2010/month=3/000008_0.gz  | 3.31KB | year=2010/month=3  |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2010/month=4/000018_0.gz  | 3.21KB | year=2010/month=4  |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2010/month=5/000009_0.gz  | 3.30KB | year=2010/month=5  |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2010/month=6/000019_0.gz  | 3.20KB | year=2010/month=6  |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2010/month=7/000010_0.gz  | 3.31KB | year=2010/month=7  |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2010/month=8/000011_0.gz  | 3.30KB | year=2010/month=8  |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2010/month=9/000020_0.gz  | 3.20KB | year=2010/month=9  |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2010/month=10/000006_0.gz | 3.30KB | year=2010/month=10 |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2010/month=11/000017_0.gz | 3.21KB | year=2010/month=11 |
| hdfs://localhost:20500/test-warehouse/alltypes_text_gzip/year=2010/month=12/000007_0.gz | 3.30KB | year=2010/month=12 |
+-----------------------------------------------------------------------------------------+--------+--------------------+
Announcements