Created on 03-08-2017 07:59 AM - edited 09-16-2022 04:13 AM
Hi,
1) If we create a table (both hive and impala)and just specify stored as parquet . Will that be snappy compressed by default in CDH?
2) If not how do i identify a parquet table with snappy compression and parquet table without snappy compression?.
Also how to specify snappy compression for table level whiel creating and also at global level, even if nobody specified at table level (all table stored as parquet should be snappy compressed).
Please help
Created on 03-08-2017 10:50 AM - edited 03-08-2017 06:45 PM
1) If we create a table (both hive and impala)and just specify stored as parquet . Will that be snappy compressed by default in CDH?
Currently the default compression is - Snappy with Impala tables.
2) If not how do i identify a parquet table with snappy compression and parquet table without snappy compression?.
describe formated tableName
Note - but you will always see the compression as NO because the compression data format is not stored in metadata of the table , the best way is to do dfs -ls -r to the table location and see the file format for compression.
3) Also how to specify snappy compression for table level whiel creating and also at global level, even if nobody specified at table level (all table stored as parquet should be snappy compressed).
CREATE TABLE external_parquet (c1 INT, c2 STRING) STORED AS PARQUET LOCATION ' '
or
Session basis
SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; SET mapred.output.compression.type=BLOCK;
Globally - i,e file is executed when you launch the hive shell
Put the above in location in CDH /etc/hive/conf.cloudera.hive1 if dont find one you can always create
.hiverc file
Please refer this link for more Create Table properties
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_create_table.html
Created 03-08-2017 11:58 AM
1. List of compression & default
gzip - org.apache.hadoop.io.compress.GzipCodec
bzip2 - org.apache.hadoop.io.compress.BZip2Codec
LZO - com.hadoop.compression.lzo.LzopCodec
Snappy - org.apache.hadoop.io.compress.SnappyCodec
Deflate - org.apache.hadoop.io.compress.DeflateCodec
From the above list, Snappy is NOT a default one, DeflateCodec is the default
You can confirm this by running
hive> SET mapred.output.compression.codec;
Refer this link to get the list of compression types.
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/introduction_compression.html#concept...
2. Refer the below link to understand how to setup snappy
Created 03-08-2017 12:05 PM
Created 03-08-2017 12:27 PM
Interesting!! both the links are from Cloudera
For Hive, as mentioned above, no need to assume anything, we can confirm it by running
hive> SET mapred.output.compression.codec;
For Impala, as we know it won't use map reduce, so we need to go by the link that you have mentioned
Created 03-08-2017 06:43 PM
Indeed .
To sum up , the below stated are the default compression codec -
Hive - default Compression is DeflateCodec Impala - default Compression is Snappy
Thanks mate
Created 03-09-2017 09:12 PM
@csguna@saranvisa Thx for the detailed response. I have 2 follow up questions (sorry i am just learning)
1) Since snappy is not too good at compression (disk), what would be the difference on disk space for a 1 TB table when stored as parquet only and parquet with snappy compression.
2) Is it possible to compress a non-compressed parquet table later with snappy?
Created on 03-13-2017 11:08 AM - edited 03-13-2017 11:09 AM
1) Since snappy is not too good at compression (disk), what would be the difference on disk space for a 1 TB table when stored as parquet only and parquet with snappy compression.
I created three table with different senario . please take a peek into it . It will give you some idea.
TABLE 1 - No compression parquet format
+-------+--------+--------+---------+ | #Rows | #Files | Size | Format | +-------+--------+--------+---------+ | -1 | 4 | 3.73MB | PARQUET | +-------+--------+--------+---------+
TABLE 2 : TEXT FORMAT with default compression Snappy
+-------+--------+---------+--------+ | #Rows | #Files | Size | Format | +-------+--------+---------+--------+ | 0 | 8 | 22.04MB | TEXT | +-------+--------+---------+--------+
TABLE 3 - With parquet + compression enabled as Snappy
+-------+--------+--------+---------+ | #Rows | #Files | Size | Format | +-------+--------+--------+---------+ | -1 | 4 | 3.71MB | PARQUET | +-------+--------+--------+---------+
2) Is it possible to compress a non-compressed parquet table later with snappy?
Alter table is a logical operation that updates the table metadata in the metastore database.
however you can fire a CTAS perform the compression and rename if you want using
alter table d1.X rename to Y;
Created on 03-16-2017 02:30 AM - edited 03-16-2017 02:32 AM
Thx @csguna for the detailed explanation. Much appreaciated . So i think there is not much difference in terms size , for snappy compressed and non compressed parquet table.
Created 09-25-2017 09:33 AM
I am still unclear on this. If the tables are created STORED AS PARQUET in Hive will they be using Snappy codec or not? It seems like that behaviour only occurs in Impala and not Hive. Do we have to explicitly state compression codec in DDL if we are creating tables via Hive?