Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

compression issue

compression issue

Contributor

Hello

 

I created some tables like following command sequence

set compression_codec=gzip
create table1 stored as parquet as select col1, col2, col3, col4 from table2

But when I checked show files in table1 output, I could see there is no compression.

 

|                              | transient_lastDdlTime                                                       | 1449221112           |
|                              | NULL                                                                        | NULL                 |
| # Storage Information        | NULL                                                                        | NULL                 |
| SerDe Library:               | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe                 | NULL                 |
| InputFormat:                 | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat               | NULL                 |
| OutputFormat:                | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat              | NULL                 |
| Compressed:                  | No                                                                          | NULL                 |

 

I also controlled set command output, compression_codec was gzip, also tried compute stats table1 command and checked again but no progress.

 

I could not figured out my error. Could you help me about to crerate gzip compressed parquet table?

 

Thanks

8 REPLIES 8

Re: compression issue

Contributor

Hi, 

 

The compression you see in the output is a table level property not a property of the files. If you run a query (e.g. "select * from table") and check the runtime profile it should mention in the HDFS_SCAN_NODE the formats of the files that are scanned. I tried a similar example to the one you posted and the output was "File Formats: PARQUET/GZIP". 

 

Dimitris

Re: compression issue

Contributor

Hello Dimitris

 

Thank you so much for your reply


I have lots of tables in Impala and I have to use gzip for storage consideration. I run select * from table and then profile command to check out HDFS_SCAN_NODE output for only one table, it takes about 3 hours for only one table. It's long for my operation.

Is there any other and fast way to check compression status and codec information

By the way I have couple of questions more :)

Is there any way to change default compression to gzip ?  I always want to work with gzipped mode. I accept gzip's overhead.

For example: I created parquet table and inserted lots of records with gzipped compression in last weekend. What is the Impala's behavior if I forget to set gzip compression for next weekend's insert operation on same table?

 

Best regards

 

Highlighted

Re: compression issue

Contributor

I agree, running a query and looking at the profile in order to get the compression info is not very "user friendly". I've filed a JIRA to output the compression codec of files in a 'show files' statement (see https://issues.cloudera.org/browse/IMPALA-2748 ). 

 

Unfortunately, I am not aware of a way to change the default compression codec so that you don't have to explicitly set it every time. 

 

Dimitris

Re: compression issue

Contributor
Hello Dimitris

Thanks for your inputs and JIRA case :)

Thanks
Suluhan

Re: compression issue

Contributor
Hello alex.behm

Do you have any idea about my issue ?

Regards

Re: compression issue

Master Collaborator

The double quote (") is in the wrong place: -default_query_options is not inside the double quotes.

Re: compression issue

Master Collaborator

The "set compression_codec=gzip" modifies the value for the compression_codec query option at a session level. You can set "gzip" as the default value for that query option to avoid setting it for every insert.

 

Search for "default_query_option" on this docs page:

http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/impala_config_optio...

 

 

 

Re: compression issue

Contributor

@alex.behm wrote:

The "set compression_codec=gzip" modifies the value for the compression_codec query option at a session level. You can set "gzip" as the default value for that query option to avoid setting it for every insert.

 

Search for "default_query_option" on this docs page:

http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/impala_config_optio...

 

 

 


Hello

 

Thanks for your reply


I changed all impala servers' /etc/default/impala file like in following lines and then restarted impala cluster

 

IMPALA_SERVER_ARGS=" \
    -log_dir=${IMPALA_LOG_DIR} \
    -catalog_service_host=${IMPALA_CATALOG_SERVICE_HOST} \
    -state_store_port=${IMPALA_STATE_STORE_PORT} \
    -use_statestore \
    -state_store_host=${IMPALA_STATE_STORE_HOST} \
    -be_port=${IMPALA_BACKEND_PORT}"\
    -default_query_options='compression_codec=gzip'

After restarted the daemons in all server, I checked settings with set command, compression codec is none

 

 

COMPRESSION_CODEC: [NONE]

What is the issue with my settings ?

 

Thanks