Support Questions

npdell · ‎03-08-2017

Hi,

1) If we create a table (both hive and impala)and just specify stored as parquet . Will that be snappy compressed by default in CDH?

2) If not how do i identify a parquet table with snappy compression and parquet table without snappy compression?.

Also how to specify snappy compression for table level whiel creating and also at global level, even if nobody specified at table level (all table stored as parquet should be snappy compressed).

Please help

csguna · ‎03-08-2017

1) If we create a table (both hive and impala)and just specify stored as parquet . Will that be snappy compressed by default in CDH?

Currently the default compression is - Snappy with Impala tables.

2) If not how do i identify a parquet table with snappy compression and parquet table without snappy compression?.

describe formated tableName

Note - but you will always see the compression as NO because the compression data format is not stored in metadata of the table , the best way is to do dfs -ls -r to the table location and see the file format for compression.

3) Also how to specify snappy compression for table level whiel creating and also at global level, even if nobody specified at table level (all table stored as parquet should be snappy compressed).

CREATE TABLE  external_parquet (c1 INT, c2 STRING)
  STORED AS PARQUET LOCATION ' '

or

Session basis

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

Globally - i,e file is executed when you launch the hive shell

Put the above in location in CDH /etc/hive/conf.cloudera.hive1 if dont find one you can always create

.hiverc file

Please refer this link for more Create Table properties

https://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_create_table.html

saranvisa · ‎03-08-2017

1. List of compression & default

gzip - org.apache.hadoop.io.compress.GzipCodec
bzip2 - org.apache.hadoop.io.compress.BZip2Codec
LZO - com.hadoop.compression.lzo.LzopCodec
Snappy - org.apache.hadoop.io.compress.SnappyCodec
Deflate - org.apache.hadoop.io.compress.DeflateCodec

From the above list, Snappy is NOT a default one, DeflateCodec is the default
You can confirm this by running

hive> SET mapred.output.compression.codec;

Refer this link to get the list of compression types.
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/introduction_compression.html#concept...

2. Refer the below link to understand how to setup snappy

https://www.cloudera.com/documentation/enterprise/5-9-x/topics/introduction_compression_snappy.html#...

csguna · ‎03-08-2017

I think snappy by default .
refer this link -
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_parquet.html
Could you please correct me if I am wrong .

Thanks

saranvisa · ‎03-08-2017

Interesting!! both the links are from Cloudera

For Hive, as mentioned above, no need to assume anything, we can confirm it by running

hive> SET mapred.output.compression.codec;

For Impala, as we know it won't use map reduce, so we need to go by the link that you have mentioned

csguna · ‎03-08-2017

Indeed .

To sum up , the below stated are the default compression codec -

Hive - default Compression is DeflateCodec 
Impala - default Compression is Snappy

Thanks mate

npdell · ‎03-09-2017

@csguna @saranvisa Thx for the detailed response. I have 2 follow up questions (sorry i am just learning)

1) Since snappy is not too good at compression (disk), what would be the difference on disk space for a 1 TB table when stored as parquet only and parquet with snappy compression.

2) Is it possible to compress a non-compressed parquet table later with snappy?

csguna · ‎03-13-2017

1) Since snappy is not too good at compression (disk), what would be the difference on disk space for a 1 TB table when stored as parquet only and parquet with snappy compression.

I created three table with different senario . please take a peek into it . It will give you some idea.

TABLE 1 - No compression parquet format

+-------+--------+--------+---------+
| #Rows | #Files | Size   | Format  |
+-------+--------+--------+---------+
| -1    | 4      | 3.73MB | PARQUET |
+-------+--------+--------+---------+

TABLE 2 : TEXT FORMAT with default compression Snappy

+-------+--------+---------+--------+
| #Rows | #Files | Size    | Format |
+-------+--------+---------+--------+
| 0     | 8      | 22.04MB | TEXT   |
+-------+--------+---------+--------+

TABLE 3 - With parquet + compression enabled as Snappy

+-------+--------+--------+---------+
| #Rows | #Files | Size   | Format  |
+-------+--------+--------+---------+
| -1    | 4      | 3.71MB | PARQUET |
+-------+--------+--------+---------+

2) Is it possible to compress a non-compressed parquet table later with snappy?

Alter table is a logical operation that updates the table metadata in the metastore database.

however you can fire a CTAS perform the compression and rename if you want using

alter table d1.X rename to Y;

npdell · ‎03-16-2017

Thx @csguna for the detailed explanation. Much appreaciated . So i think there is not much difference in terms size , for snappy compressed and non compressed parquet table.

SachinSinha · ‎09-25-2017

I am still unclear on this. If the tables are created STORED AS PARQUET in Hive will they be using Snappy codec or not? It seems like that behaviour only occurs in Impala and not Hive. Do we have to explicitly state compression codec in DDL if we are creating tables via Hive?

Cloudera Community

Support Questions

Parquet table snappy compressed by default