Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Parquet table snappy compressed by default

avatar
Contributor

Hi,

 

1) If we create a table (both hive and impala)and just specify stored as parquet . Will that be snappy compressed by default in CDH?

 

2) If not how do i identify a parquet table with snappy compression and parquet table without snappy compression?.

 

Also how to specify snappy compression for table level  whiel creating and also at global level, even if nobody specified at table level (all table stored as parquet should be snappy compressed).

 

Please help

 

11 REPLIES 11

avatar
Champion

1) If we create a table (both hive and impala)and just specify stored as parquet . Will that be snappy compressed by default in CDH?

 

Currently the default compression is - Snappy with Impala tables. 

 

2) If not how do i identify a parquet table with snappy compression and parquet table without snappy compression?.

 

describe formated tableName

Note  -  but you will always see the compression as NO because the compression data format is not stored in metadata of the table , the best way is to do dfs -ls -r  to the table location and see the file format for compression. 

 

3) Also how to specify snappy compression for table level  whiel creating and also at global level, even if nobody specified at table level (all table stored as parquet should be snappy compressed).

 

CREATE TABLE  external_parquet (c1 INT, c2 STRING)
  STORED AS PARQUET LOCATION ' ' 

 or 

 Session basis 

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

Globally  - i,e file is executed when you launch the hive shell

Put the above in  location in CDH  /etc/hive/conf.cloudera.hive1 if dont find one you can always create 

.hiverc file

 

 

Please refer this link for more Create  Table properties 

https://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_create_table.html

 

avatar
Champion

1. List of compression & default

 

gzip - org.apache.hadoop.io.compress.GzipCodec
bzip2 - org.apache.hadoop.io.compress.BZip2Codec
LZO - com.hadoop.compression.lzo.LzopCodec
Snappy - org.apache.hadoop.io.compress.SnappyCodec
Deflate - org.apache.hadoop.io.compress.DeflateCodec

 

From the above list, Snappy is NOT a default one, DeflateCodec is the default
You can confirm this by running


hive> SET mapred.output.compression.codec;

 

Refer this link to get the list of compression types.
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/introduction_compression.html#concept...

 

2. Refer the below link to understand how to setup snappy

 

https://www.cloudera.com/documentation/enterprise/5-9-x/topics/introduction_compression_snappy.html#...

avatar
Champion
I think snappy by default .
refer this link -
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_parquet.html
Could you please correct me if I am wrong .

Thanks

avatar
Champion

Interesting!! both the links are from Cloudera

 

For Hive, as mentioned above, no need to assume anything, we can confirm it by running 

hive> SET mapred.output.compression.codec;

 

For Impala, as we know it won't use map reduce, so we need to go by the link that you have mentioned

 

avatar
Champion

 Indeed .

 

To sum up , the below  stated are the default compression codec  - 

 

Hive - default Compression is DeflateCodec 
Impala - default Compression is Snappy

Thanks mate

avatar
Contributor

@csguna@saranvisa Thx for the detailed response. I have 2 follow up questions (sorry i am just learning)

 

1) Since snappy is not too good at compression (disk), what would be the difference on disk space for a 1 TB table when stored as parquet only and parquet with snappy compression.

 

2) Is it possible to compress a non-compressed parquet table later with snappy?

 

 

avatar
Champion

1) Since snappy is not too good at compression (disk), what would be the difference on disk space for a 1 TB table when stored as parquet only and parquet with snappy compression.

 

I created three table with different senario . please take a peek into it . It will give you some idea. 

 

TABLE 1 - No compression parquet format 

 

+-------+--------+--------+---------+
| #Rows | #Files | Size   | Format  |
+-------+--------+--------+---------+
| -1    | 4      | 3.73MB | PARQUET |
+-------+--------+--------+---------+

 TABLE 2 : TEXT FORMAT with default compression Snappy 

 

+-------+--------+---------+--------+
| #Rows | #Files | Size    | Format |
+-------+--------+---------+--------+
| 0     | 8      | 22.04MB | TEXT   |
+-------+--------+---------+--------+

TABLE 3 - With parquet  + compression enabled  as Snappy 

 

 

+-------+--------+--------+---------+
| #Rows | #Files | Size   | Format  |
+-------+--------+--------+---------+
| -1    | 4      | 3.71MB | PARQUET |
+-------+--------+--------+---------+

 

2) Is it possible to compress a non-compressed parquet table later with snappy? 

 

Alter table is a  logical operation that updates the table metadata in the metastore database. 

 

 however you can  fire a  CTAS perform the compression and rename if you want using

 

alter table d1.X rename to Y;

 

avatar
Contributor

Thx @csguna for the detailed explanation. Much appreaciated . So i think there is not much difference in terms size , for snappy compressed and non compressed parquet table.

avatar
New Contributor

I am still unclear on this. If the tables are created STORED AS PARQUET in Hive will they be using Snappy codec or not? It seems like that behaviour only occurs in Impala and not Hive. Do we have to explicitly state compression codec in DDL if we are creating tables via Hive?