I have read many articles about parquet file format, but still I'm struggling to generate parquet table properly for Impala.
Here is the scenario and data info I want to do.
Info about Table1 - `yearmonth_product_rcfile` ( existing )
- 2 billion rows
- 27 columns ( no columns with complex type )
- generated by MR
- 1300 partitions
- compressed with Snappy
I'd like to convert this table to a new table with PARQUET.
- HDFS block 128MB
- PARQUET file size 128MB
- around 40 partitions
- using MR to generate this data
- using Impala to access data in this table.
- compressed with Snappy
This is how I create table.
CREATE EXTERNAL TABLE `yearmonth_product`( `product_name` string, ) PARTITIONED BY ( `yearmonth` int, `product_prefix` string ) STORED AS PARQUET LOCATION '/user/hive/external_warehouse/yearmonth_product';
This is the insertion command I used.
- HDFS default block size is 134217728
SET hive.auto.convert.join=false; SET hive.exec.compress.output=true; SET mapred.output.compression.type=BLOCK; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.max.dynamic.partitions=500000; SET hive.exec.max.dynamic.partitions.pernode=50000; SET hive.exec.max.created.files=1000000;
SET parquet.block.size=134217728; INSERT INTO yearmonth_product PARTITION(yearmonth, product_prefix) SELECT product_name, 201512, COALESCE(substr(product_name, 0, 2), NULL) as product_prefix FROM yearmonth_product_rcfile WHERE yearmonth = 201512 DISTRIBUTE BY product_prefix;
This insertion query successfully finished.
However, if I check one of the partition for the table, it's like this.
hdfs dfs -ls /user/hive/external_warehouse/yearmonth_product/yearmonth=2015121/product_prefix=c Found 1 items -rwxrwxrwx 3 root supergroup 27667988364 2016-01-21 19:53 /user/hive/external_warehouse/yearmonth_product/yearmonth=2015121/product_prefix=c/000099_0
hdfs fsck /user/hive/external_warehouse/yearmonth_product/yearmonth=2015121/product_prefix=c -blocks ... Status: HEALTHY Total size: 27667988364 B Total dirs: 1 Total files: 1 Total symlinks: 0 Total blocks (validated): 207 (avg. block size 133661779 B) Minimally replicated blocks: 207 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 6 Number of racks: 1 FSCK ended at Fri Jan 22 07:08:47 PST 2016 in 2 milliseconds
I expected to see many files with around 128MB size under this partitions.
Here are the quetions I have.
In order to genearet parquet table with SnappyCodec, am I using right setting?
(somehow, the new table size with parquet is bigger than rcfile.)
`WARNINGS: Parquet files should not be split into multiple hdfs-blocks.`
is this msg only limited into Imapla or could Hive also have the performance impact like Impala can have?
can you tell me if I understand correctly?
If there is a big table like 200GB with Parquet, there could be multiple files and each file should be the size less than or equal to HDFS block. For instance, if HDFS block size is 128MB, each parquet file should be less than or equal to 128MB.
Not sure about the snappy codec question but I can help with the other 2.
Question 2: That warning should not come with the latest Impala. Since Impala 2.3.0, that warning has gone away. There are no major performance implications to using multi-blocked parquet files.
Question 3: There can be a single file made up of multiple blocks. The way you've set the configuration, it just creates one file with multiple blocks (207 for you) where each block is around 128 MB.
Thank you for your response.
I have been waiting for reply for this post. :)
The reason I asked about Snappy codec was that I read that Parquet File Format provides better compression than RCFile.
And, yes. I saw from release note that the warning will be removed as of 2.3 which is in cdh5.5.x.
I'm currently using cdh 5.4.4. I planned to upgrade to 5.5.1, but paused the upgrading because I saw a cdh user posted here that they are seeing some performace issue after upgrading to 5.5.1 from 5.4.x.
Oh, by the way, you said that "There are no major performance implications to using multi-blocked parquet files.". Is this still valid with Impala 2.2.0 as well?
Although there are no major performance implications to using multi-blocked parquet files, is it still recommended to use bigger block size?
If I understand correctly, the number of files in HDFS doesn't mean anything in terms of the number of blocks being used for parquet table. One single file could resides on multiple blocks or single block depends on the block size I configure.
Sorry for the late reply. I didn't notice your follow up post.
If there was a performance issue in 5.5.1, it would have been fixed by now.
A1: In Impala 2.2.0, there are performance implications to using multi-block parquet files. It will be slower. However, in Impala 2.3.0, it is fixed.
A2: If you're going to continue using Impala 2.2.0, I would suggest using multiple single block files with a larger block size for each.
And yes, a parquet table could be across multiple files each with multiple blocks or single blocks. Or it could just be one file with multiple blocks or a single block.