Reply
New Contributor
Posts: 8
Registered: ‎04-25-2014

Impala/Parquet Disk Space - Reorg increases usage 50%?

I'm working up disk space estimations for a new application for a client and ran into this situation. From everything I've read about storing data in HDFS, the objective should always be "fewer files, bigger files". I have a partitioned Parquet table consisting of INSERT's from 15 external tables based on tsv files (which turned into 30 files in /hive/warehouse), average size about 22MB with a total size of about 672MB - fsck output below under testdb.db/event.


According to the Impala INSERT reference, Impala creates 1GB blocks in HDFS, so I reasoned that I must be wasting alot of space.
I thought I would effectively reorg the data by creating a new partitioned Parquet table and doing an INSERT <toTable> AS SELECT FROM <fromTableWithMultipleFiles>, which (I thought) create a new table with a single 1GB block file, saving space and hopefully making the queries even faster. But that didn't happen. Instead, the new table is using almost 50% more space - 1.155MB -
and is spread across 10 files - see fsck output below under testdb.db/event_accum. I did a handful of simple queries against both tables and the response times were roughly the same. I'm running CDH 5.0.0 and Impala 1.3.0

 

I have a couple of questions:

  • Am I doing something wrong? I don't understand how creating a single table file from multiple table files would actually increase the disk space used. I also don't understand the number of files - since the total space is 672MB, why were 10 files created instead of a single 1GB file.
  • Why are so many Parquet files created in the first place (30) - looks like 2 files per insert. Since I did 15 inserts, and each was < 1GB, I assumed there would be 15 files, each a single 1GB block
  • 3. Does Parquet change the blocksize on each file as it is created?

 

Thanks
Pete Zybrick

++++++++++++++++++++++++++++++++++++
+++ fsck output for testdb.event +++
++++++++++++++++++++++++++++++++++++
[ipcdev@node1 ~]$ hdfs fsck /user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29 -files -blocks
Connecting to namenode via http://node1.ipc-global.com:50070
FSCK started by ipcdev (auth:SIMPLE) from /192.168.1.241 for path /user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29 at Sat May 03 14:58:46 EDT 2014
/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29 <dir>
/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/2749302544c3cc81-476dded44718ef97_27784437_data.0 18562544 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749671_8863 len=18562544 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/2749302544c3cc81-476dded44718ef98_764271366_data.0 38415896 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749672_8864 len=38415896 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/2e44ac7f4828ea58-1c5361f3d889079f_108854480_data.0 15591794 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749689_8881 len=15591794 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/2e44ac7f4828ea58-1c5361f3d88907a0_1363940349_data.0 20338472 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749690_8882 len=20338472 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/3146df097b7dce69-fa2312eab5cb8c0_1266491270_data.0 9809213 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749695_8887 len=9809213 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/3146df097b7dce69-fa2312eab5cb8c1_215817145_data.0 18651333 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749696_8888 len=18651333 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/5e4dff03c860412e-4fcebeddc2a8bc9d_1516344242_data.0 18721656 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749667_8859 len=18721656 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/5e4dff03c860412e-4fcebeddc2a8bc9e_1844720457_data.0 34591376 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749668_8860 len=34591376 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/744e067ee663c51b-1be1c91e6f805987_1553519263_data.0 13473301 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749693_8885 len=13473301 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/744e067ee663c51b-1be1c91e6f805988_759468439_data.0 18963742 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749694_8886 len=18963742 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/864a57eb2cb34df8-5491c5f2b5f781b1_1041955503_data.0 38011674 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749677_8869 len=38011674 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/864a57eb2cb34df8-5491c5f2b5f781b2_1547752638_data.0 21918497 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749676_8868 len=21918497 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/8a4952d97d354baf-b7f4175d894db8b0_2077568951_data.0 21583907 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749683_8875 len=21583907 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/8a4952d97d354baf-b7f4175d894db8b1_22983394_data.0 21358035 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749682_8874 len=21358035 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/9b415271a265a51f-cd0ef6b8edfef79f_1975128969_data.0 37749053 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749670_8862 len=37749053 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/9b415271a265a51f-cd0ef6b8edfef7a0_904643073_data.0 18921391 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749669_8861 len=18921391 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/c64c8dd9eed1a13e-ce368c38b46783a3_429980002_data.0 21334339 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749680_8872 len=21334339 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/c64c8dd9eed1a13e-ce368c38b46783a4_1347346617_data.0 29050702 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749681_8873 len=29050702 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/c84263763f16053a-1ff679360bed5abd_681533929_data.0 20807361 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749684_8876 len=20807361 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/c84263763f16053a-1ff679360bed5abe_685537178_data.0 21699313 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749685_8877 len=21699313 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/df4fddd5a4747562-b9355d3752cb8098_1289445892_data.0 38632291 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749674_8866 len=38632291 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/df4fddd5a4747562-b9355d3752cb8099_1295450792_data.0 19267145 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749673_8865 len=19267145 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/e547b7211807fd01-4e5105ccff1e5e83_1813281128_data.0 19632386 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749661_8853 len=19632386 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/e547b7211807fd01-4e5105ccff1e5e84_1287696553_data.0 37818967 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749662_8854 len=37818967 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/f1472fdec18dab1d-49f3a85efd23fb85_542044747_data.0 17296655 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749687_8879 len=17296655 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/f1472fdec18dab1d-49f3a85efd23fb86_1606899886_data.0 21362238 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749688_8880 len=21362238 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/f945169204587db2-b8e7eeee071080a0_479376400_data.0 19253474 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749678_8870 len=19253474 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/f945169204587db2-b8e7eeee071080a1_1568644014_data.0 38664451 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749679_8871 len=38664451 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/fb4bfe8f54bbf34d-24841c6433b9bdb0_679637527_data.0 14209231 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749691_8883 len=14209231 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/fb4bfe8f54bbf34d-24841c6433b9bdb1_1581016085_data.0 18802031 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749692_8884 len=18802031 repl=3

Status: HEALTHY
Total size: 704492468 B
Total dirs: 1
Total files: 30
Total symlinks: 0
Total blocks (validated): 30 (avg. block size 23483082 B)
Minimally replicated blocks: 30 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 1
FSCK ended at Sat May 03 14:58:46 EDT 2014 in 2 milliseconds


The filesystem under path '/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29' is HEALTHY

++++++++++++++++++++++++++++++++++++++++++
+++ fsck output for testdb.event_accum +++
++++++++++++++++++++++++++++++++++++++++++
[ipcdev@node1 ~]$ hdfs fsck /user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29 -files -blocks
Connecting to namenode via http://node1.ipc-global.com:50070
FSCK started by ipcdev (auth:SIMPLE) from /192.168.1.241 for path /user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29 at Sat May 03 15:02:50 EDT 2014
/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29 <dir>
/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29/2e4409edc5a499df-4d100b8be8f330ba_285908234_data.0 136310840 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073753175_12375 len=136310840 repl=3

/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29/2e4409edc5a499df-4d100b8be8f330ba_285908234_data.1 137597016 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073753177_12377 len=137597016 repl=3

/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29/2e4409edc5a499df-4d100b8be8f330ba_285908234_data.2 137114257 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073753179_12379 len=137114257 repl=3

/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29/2e4409edc5a499df-4d100b8be8f330ba_285908234_data.3 136838616 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073753182_12382 len=136838616 repl=3

/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29/2e4409edc5a499df-4d100b8be8f330ba_285908234_data.4 36494878 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073753184_12384 len=36494878 repl=3

/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29/2e4409edc5a499df-4d100b8be8f330bb_417893698_data.0 137093507 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073753176_12376 len=137093507 repl=3

/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29/2e4409edc5a499df-4d100b8be8f330bb_417893698_data.1 137524781 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073753178_12378 len=137524781 repl=3

/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29/2e4409edc5a499df-4d100b8be8f330bb_417893698_data.2 138676793 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073753180_12380 len=138676793 repl=3

/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29/2e4409edc5a499df-4d100b8be8f330bb_417893698_data.3 136688653 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073753183_12383 len=136688653 repl=3

/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29/2e4409edc5a499df-4d100b8be8f330bb_417893698_data.4 76307973 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073753185_12385 len=76307973 repl=3

Status: HEALTHY
Total size: 1210647314 B
Total dirs: 1
Total files: 10
Total symlinks: 0
Total blocks (validated): 10 (avg. block size 121064731 B)
Minimally replicated blocks: 10 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 1
FSCK ended at Sat May 03 15:02:50 EDT 2014 in 1 milliseconds


The filesystem under path '/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29' is HEALTHY

Cloudera Employee
Posts: 27
Registered: ‎09-27-2013

Re: Impala/Parquet Disk Space - Reorg increases usage 50%?

I'm a little confused by what your setup looks like. In tsv file, how many partitions do you have and how

much data is in each partition? 

 

Take a look at our docs here and see if that helps you. http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Im...

New Contributor
Posts: 8
Registered: ‎04-25-2014

Re: Impala/Parquet Disk Space - Reorg increases usage 50%?

1. All of the data is already in a Parquet table named "testdb.event" with a single partition - udc_target_date=2014-03-29

2. I'm creating another Parquet table, "testdb.event_accum", same format as "testdb.event"

3. Run:INSERT OVERWRITE testdb.event_accum SELECT ... from testdb.event

4. testdb.event_accum is using roughly 50% more disk space then testdb.event

5. Regarding "how much data in each partition", the number of rows in testdb.event and testdb.event_accum are exactly the same.  The amount of disk space used is in the fsck output at the bottom of the original post.

Thanks

Pete

Cloudera Employee
Posts: 27
Registered: ‎09-27-2013

Re: Impala/Parquet Disk Space - Reorg increases usage 50%?

I see.  Can you run this tool to print the footer meta data. That will print the size of each column.

 

https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/PrintF...

 

 

New Contributor
Posts: 8
Registered: ‎04-25-2014

Re: Impala/Parquet Disk Space - Reorg increases usage 50%?

Results below.

Thanks

Pete

 

[ipcdev@node1 ~]$ hadoop jar mrjobjars/catalyst-1.0.0-shaded.jar com.ipcglobal.catalyst.utils.PrintFooter /user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29
listing files in hdfs://node1.ipc-global.com:8020/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29
opening 30 files
0% [SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".0%
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
************************************************************
read all footers in 985 ms
[udc_in_file_name] BINARY 0.0% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 390 max: 1.138K average: 750 total: 22.512K (raw data&colon; 21.216K saving -6%)
values: min: 4.836M max: 16.358M average: 10.384M total: 311.52M
uncompressed: min: 368 max: 1.072K average: 707 total: 21.216K
[udc_in_file_offset] INT64 22.4% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 2.127M max: 8.619M average: 5.273M total: 158.212M (raw data&colon; 2.005G saving 92%)
values: min: 4.836M max: 16.358M average: 10.384M total: 311.52M
uncompressed: min: 22.427M max: 114.186M average: 66.834M total: 2.005G
[date_time] INT96 4.9% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 530.522K max: 1.781M average: 1.158M total: 34.747M (raw data&colon; 35.526M saving 2%)
values: min: 4.836M max: 16.358M average: 10.384M total: 311.52M
uncompressed: min: 549.345K max: 1.817M average: 1.184M total: 35.526M
[event_type] BINARY 1.5% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 155.184K max: 527.673K average: 352.787K total: 10.583M (raw data&colon; 32.746M saving 67%)
values: min: 4.836M max: 16.358M average: 10.384M total: 311.52M
uncompressed: min: 513.939K max: 1.777M average: 1.091M total: 32.746M
[event_seqn] INT32 4.9% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 528.486K max: 1.817M average: 1.165M total: 34.964M (raw data&colon; 196.236M saving 82%)
values: min: 4.836M max: 16.358M average: 10.384M total: 311.52M
uncompressed: min: 3.034M max: 10.262M average: 6.541M total: 196.236M
[event_id] BINARY 10.5% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 1.098M max: 3.771M average: 2.482M total: 74.471M (raw data&colon; 375.767M saving 80%)
values: min: 4.836M max: 16.358M average: 10.384M total: 311.52M
uncompressed: min: 5.49M max: 19.993M average: 12.525M total: 375.767M
[event_value] BINARY 53.4% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 5.147M max: 24.339M average: 12.544M total: 376.341M (raw data&colon; 1.555G saving 75%)
values: min: 4.836M max: 16.358M average: 10.384M total: 311.52M
uncompressed: min: 11.336M max: 129.761M average: 51.842M total: 1.555G
[is_secondary_lookup] INT32 2.1% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 220.195K max: 739.523K average: 503.788K total: 15.113M (raw data&colon; 58.915M saving 74%)
values: min: 4.836M max: 16.358M average: 10.384M total: 311.52M
uncompressed: min: 919.783K max: 3.068M average: 1.963M total: 58.915M
number of blocks: 30
total data size: 704.457M (raw 4.259G)
total record: 311.52M
average block size: 23.481M (raw 141.983M)
average record count: 10.384M
[ipcdev@node1 ~]$ hadoop jar mrjobjars/catalyst-1.0.0-shaded.jar com.ipcglobal.catalyst.utils.PrintFooter /user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29
listing files in hdfs://node1.ipc-global.com:8020/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29
opening 10 files
0% [SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".0%
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
************************************************************
read all footers in 724 ms
[udc_in_file_name] BINARY 0.0% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 3.862K max: 16.706K average: 11.94K total: 119.409K (raw data&colon; 2.609M saving 95%)
values: min: 13.072M max: 35.794M average: 31.152M total: 311.52M
uncompressed: min: 60.876K max: 417.862K average: 260.922K total: 2.609M
[udc_in_file_offset] INT64 14.2% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 6.894M max: 19.954M average: 17.255M total: 172.552M (raw data&colon; 7.607G saving 97%)
values: min: 13.072M max: 35.794M average: 31.152M total: 311.52M
uncompressed: min: 263.001M max: 1.203G average: 760.751M total: 7.607G
[date_time] INT96 3.1% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 1.599M max: 4.374M average: 3.848M total: 38.481M (raw data&colon; 127.714M saving 69%)
values: min: 13.072M max: 35.794M average: 31.152M total: 311.52M
uncompressed: min: 4.44M max: 20.305M average: 12.771M total: 127.714M
[event_type] BINARY 0.9% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 477.92K max: 1.343M average: 1.17M total: 11.707M (raw data&colon; 107.203M saving 89%)
values: min: 13.072M max: 35.794M average: 31.152M total: 311.52M
uncompressed: min: 3.695M max: 16.975M average: 10.72M total: 107.203M
[event_seqn] INT32 3.1% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 1.557M max: 4.37M average: 3.799M total: 37.999M (raw data&colon; 644.963M saving 94%)
values: min: 13.072M max: 35.794M average: 31.152M total: 311.52M
uncompressed: min: 21.886M max: 101.658M average: 64.496M total: 644.963M
[event_id] BINARY 7.1% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 3.423M max: 10.185M average: 8.632M total: 86.322M (raw data&colon; 1.34G saving 93%)
values: min: 13.072M max: 35.794M average: 31.152M total: 311.52M
uncompressed: min: 45.663M max: 214.925M average: 134.022M total: 1.34G
[event_value] BINARY 69.9% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 21.865M max: 98.556M average: 84.707M total: 847.07M (raw data&colon; 17.0G saving 95%)
values: min: 13.072M max: 35.794M average: 31.152M total: 311.52M
uncompressed: min: 582.655M max: 2.67G average: 1.7G total: 17.0G
[is_secondary_lookup] INT32 1.3% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 670.874K max: 1.869M average: 1.638M total: 16.381M (raw data&colon; 190.516M saving 91%)
values: min: 13.072M max: 35.794M average: 31.152M total: 311.52M
uncompressed: min: 6.598M max: 30.212M average: 19.051M total: 190.516M
number of blocks: 10
total data size: 1.21G (raw 27.021G)
total record: 311.52M
average block size: 121.063M (raw 2.702G)
average record count: 31.152M
[ipcdev@node1 ~]$

Highlighted
Cloudera Employee
Posts: 27
Registered: ‎09-27-2013

Re: Impala/Parquet Disk Space - Reorg increases usage 50%?

There's something weird going on.

 

Before:

total data size: 704.457M (raw 4.259G)
total record: 311.52M

 

After:

total data size: 1.21G (raw 27.021G)
total record: 311.52M

 

 

The compression ratio has improved quite a bit but for some reason the raw data seems to have gotten much bigger.

 

I'ld look at the event value column which went from 1.5GB to 17GB (unencoded size).

Announcements