Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Impala/Parquet Disk Space - Reorg increases usage 50%?

Impala/Parquet Disk Space - Reorg increases usage 50%?

New Contributor

I'm working up disk space estimations for a new application for a client and ran into this situation. From everything I've read about storing data in HDFS, the objective should always be "fewer files, bigger files". I have a partitioned Parquet table consisting of INSERT's from 15 external tables based on tsv files (which turned into 30 files in /hive/warehouse), average size about 22MB with a total size of about 672MB - fsck output below under testdb.db/event.


According to the Impala INSERT reference, Impala creates 1GB blocks in HDFS, so I reasoned that I must be wasting alot of space.
I thought I would effectively reorg the data by creating a new partitioned Parquet table and doing an INSERT <toTable> AS SELECT FROM <fromTableWithMultipleFiles>, which (I thought) create a new table with a single 1GB block file, saving space and hopefully making the queries even faster. But that didn't happen. Instead, the new table is using almost 50% more space - 1.155MB -
and is spread across 10 files - see fsck output below under testdb.db/event_accum. I did a handful of simple queries against both tables and the response times were roughly the same. I'm running CDH 5.0.0 and Impala 1.3.0

 

I have a couple of questions:

  • Am I doing something wrong? I don't understand how creating a single table file from multiple table files would actually increase the disk space used. I also don't understand the number of files - since the total space is 672MB, why were 10 files created instead of a single 1GB file.
  • Why are so many Parquet files created in the first place (30) - looks like 2 files per insert. Since I did 15 inserts, and each was < 1GB, I assumed there would be 15 files, each a single 1GB block
  • 3. Does Parquet change the blocksize on each file as it is created?

 

Thanks
Pete Zybrick

++++++++++++++++++++++++++++++++++++
+++ fsck output for testdb.event +++
++++++++++++++++++++++++++++++++++++
[ipcdev@node1 ~]$ hdfs fsck /user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29 -files -blocks
Connecting to namenode via http://node1.ipc-global.com:50070
FSCK started by ipcdev (auth:SIMPLE) from /192.168.1.241 for path /user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29 at Sat May 03 14:58:46 EDT 2014
/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29 <dir>
/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/2749302544c3cc81-476dded44718ef97_27784437_data.0 18562544 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749671_8863 len=18562544 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/2749302544c3cc81-476dded44718ef98_764271366_data.0 38415896 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749672_8864 len=38415896 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/2e44ac7f4828ea58-1c5361f3d889079f_108854480_data.0 15591794 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749689_8881 len=15591794 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/2e44ac7f4828ea58-1c5361f3d88907a0_1363940349_data.0 20338472 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749690_8882 len=20338472 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/3146df097b7dce69-fa2312eab5cb8c0_1266491270_data.0 9809213 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749695_8887 len=9809213 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/3146df097b7dce69-fa2312eab5cb8c1_215817145_data.0 18651333 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749696_8888 len=18651333 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/5e4dff03c860412e-4fcebeddc2a8bc9d_1516344242_data.0 18721656 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749667_8859 len=18721656 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/5e4dff03c860412e-4fcebeddc2a8bc9e_1844720457_data.0 34591376 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749668_8860 len=34591376 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/744e067ee663c51b-1be1c91e6f805987_1553519263_data.0 13473301 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749693_8885 len=13473301 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/744e067ee663c51b-1be1c91e6f805988_759468439_data.0 18963742 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749694_8886 len=18963742 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/864a57eb2cb34df8-5491c5f2b5f781b1_1041955503_data.0 38011674 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749677_8869 len=38011674 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/864a57eb2cb34df8-5491c5f2b5f781b2_1547752638_data.0 21918497 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749676_8868 len=21918497 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/8a4952d97d354baf-b7f4175d894db8b0_2077568951_data.0 21583907 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749683_8875 len=21583907 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/8a4952d97d354baf-b7f4175d894db8b1_22983394_data.0 21358035 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749682_8874 len=21358035 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/9b415271a265a51f-cd0ef6b8edfef79f_1975128969_data.0 37749053 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749670_8862 len=37749053 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/9b415271a265a51f-cd0ef6b8edfef7a0_904643073_data.0 18921391 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749669_8861 len=18921391 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/c64c8dd9eed1a13e-ce368c38b46783a3_429980002_data.0 21334339 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749680_8872 len=21334339 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/c64c8dd9eed1a13e-ce368c38b46783a4_1347346617_data.0 29050702 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749681_8873 len=29050702 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/c84263763f16053a-1ff679360bed5abd_681533929_data.0 20807361 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749684_8876 len=20807361 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/c84263763f16053a-1ff679360bed5abe_685537178_data.0 21699313 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749685_8877 len=21699313 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/df4fddd5a4747562-b9355d3752cb8098_1289445892_data.0 38632291 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749674_8866 len=38632291 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/df4fddd5a4747562-b9355d3752cb8099_1295450792_data.0 19267145 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749673_8865 len=19267145 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/e547b7211807fd01-4e5105ccff1e5e83_1813281128_data.0 19632386 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749661_8853 len=19632386 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/e547b7211807fd01-4e5105ccff1e5e84_1287696553_data.0 37818967 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749662_8854 len=37818967 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/f1472fdec18dab1d-49f3a85efd23fb85_542044747_data.0 17296655 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749687_8879 len=17296655 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/f1472fdec18dab1d-49f3a85efd23fb86_1606899886_data.0 21362238 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749688_8880 len=21362238 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/f945169204587db2-b8e7eeee071080a0_479376400_data.0 19253474 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749678_8870 len=19253474 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/f945169204587db2-b8e7eeee071080a1_1568644014_data.0 38664451 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749679_8871 len=38664451 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/fb4bfe8f54bbf34d-24841c6433b9bdb0_679637527_data.0 14209231 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749691_8883 len=14209231 repl=3

/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29/fb4bfe8f54bbf34d-24841c6433b9bdb1_1581016085_data.0 18802031 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073749692_8884 len=18802031 repl=3

Status: HEALTHY
Total size: 704492468 B
Total dirs: 1
Total files: 30
Total symlinks: 0
Total blocks (validated): 30 (avg. block size 23483082 B)
Minimally replicated blocks: 30 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 1
FSCK ended at Sat May 03 14:58:46 EDT 2014 in 2 milliseconds


The filesystem under path '/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29' is HEALTHY

++++++++++++++++++++++++++++++++++++++++++
+++ fsck output for testdb.event_accum +++
++++++++++++++++++++++++++++++++++++++++++
[ipcdev@node1 ~]$ hdfs fsck /user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29 -files -blocks
Connecting to namenode via http://node1.ipc-global.com:50070
FSCK started by ipcdev (auth:SIMPLE) from /192.168.1.241 for path /user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29 at Sat May 03 15:02:50 EDT 2014
/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29 <dir>
/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29/2e4409edc5a499df-4d100b8be8f330ba_285908234_data.0 136310840 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073753175_12375 len=136310840 repl=3

/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29/2e4409edc5a499df-4d100b8be8f330ba_285908234_data.1 137597016 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073753177_12377 len=137597016 repl=3

/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29/2e4409edc5a499df-4d100b8be8f330ba_285908234_data.2 137114257 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073753179_12379 len=137114257 repl=3

/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29/2e4409edc5a499df-4d100b8be8f330ba_285908234_data.3 136838616 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073753182_12382 len=136838616 repl=3

/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29/2e4409edc5a499df-4d100b8be8f330ba_285908234_data.4 36494878 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073753184_12384 len=36494878 repl=3

/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29/2e4409edc5a499df-4d100b8be8f330bb_417893698_data.0 137093507 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073753176_12376 len=137093507 repl=3

/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29/2e4409edc5a499df-4d100b8be8f330bb_417893698_data.1 137524781 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073753178_12378 len=137524781 repl=3

/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29/2e4409edc5a499df-4d100b8be8f330bb_417893698_data.2 138676793 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073753180_12380 len=138676793 repl=3

/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29/2e4409edc5a499df-4d100b8be8f330bb_417893698_data.3 136688653 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073753183_12383 len=136688653 repl=3

/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29/2e4409edc5a499df-4d100b8be8f330bb_417893698_data.4 76307973 bytes, 1 block(s): OK
0. BP-1358556399-192.168.1.241-1398819377649:blk_1073753185_12385 len=76307973 repl=3

Status: HEALTHY
Total size: 1210647314 B
Total dirs: 1
Total files: 10
Total symlinks: 0
Total blocks (validated): 10 (avg. block size 121064731 B)
Minimally replicated blocks: 10 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 1
FSCK ended at Sat May 03 15:02:50 EDT 2014 in 1 milliseconds


The filesystem under path '/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29' is HEALTHY

5 REPLIES 5

Re: Impala/Parquet Disk Space - Reorg increases usage 50%?

Contributor

I'm a little confused by what your setup looks like. In tsv file, how many partitions do you have and how

much data is in each partition? 

 

Take a look at our docs here and see if that helps you. http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Im...

Re: Impala/Parquet Disk Space - Reorg increases usage 50%?

New Contributor

1. All of the data is already in a Parquet table named "testdb.event" with a single partition - udc_target_date=2014-03-29

2. I'm creating another Parquet table, "testdb.event_accum", same format as "testdb.event"

3. Run:INSERT OVERWRITE testdb.event_accum SELECT ... from testdb.event

4. testdb.event_accum is using roughly 50% more disk space then testdb.event

5. Regarding "how much data in each partition", the number of rows in testdb.event and testdb.event_accum are exactly the same.  The amount of disk space used is in the fsck output at the bottom of the original post.

Thanks

Pete

Re: Impala/Parquet Disk Space - Reorg increases usage 50%?

Contributor

I see.  Can you run this tool to print the footer meta data. That will print the size of each column.

 

https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/PrintF...

 

 

Highlighted

Re: Impala/Parquet Disk Space - Reorg increases usage 50%?

New Contributor

Results below.

Thanks

Pete

 

[ipcdev@node1 ~]$ hadoop jar mrjobjars/catalyst-1.0.0-shaded.jar com.ipcglobal.catalyst.utils.PrintFooter /user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29
listing files in hdfs://node1.ipc-global.com:8020/user/hive/warehouse/testdb.db/event/udc_target_date=2014-03-29
opening 30 files
0% [SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".0%
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
************************************************************
read all footers in 985 ms
[udc_in_file_name] BINARY 0.0% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 390 max: 1.138K average: 750 total: 22.512K (raw data&colon; 21.216K saving -6%)
values: min: 4.836M max: 16.358M average: 10.384M total: 311.52M
uncompressed: min: 368 max: 1.072K average: 707 total: 21.216K
[udc_in_file_offset] INT64 22.4% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 2.127M max: 8.619M average: 5.273M total: 158.212M (raw data&colon; 2.005G saving 92%)
values: min: 4.836M max: 16.358M average: 10.384M total: 311.52M
uncompressed: min: 22.427M max: 114.186M average: 66.834M total: 2.005G
[date_time] INT96 4.9% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 530.522K max: 1.781M average: 1.158M total: 34.747M (raw data&colon; 35.526M saving 2%)
values: min: 4.836M max: 16.358M average: 10.384M total: 311.52M
uncompressed: min: 549.345K max: 1.817M average: 1.184M total: 35.526M
[event_type] BINARY 1.5% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 155.184K max: 527.673K average: 352.787K total: 10.583M (raw data&colon; 32.746M saving 67%)
values: min: 4.836M max: 16.358M average: 10.384M total: 311.52M
uncompressed: min: 513.939K max: 1.777M average: 1.091M total: 32.746M
[event_seqn] INT32 4.9% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 528.486K max: 1.817M average: 1.165M total: 34.964M (raw data&colon; 196.236M saving 82%)
values: min: 4.836M max: 16.358M average: 10.384M total: 311.52M
uncompressed: min: 3.034M max: 10.262M average: 6.541M total: 196.236M
[event_id] BINARY 10.5% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 1.098M max: 3.771M average: 2.482M total: 74.471M (raw data&colon; 375.767M saving 80%)
values: min: 4.836M max: 16.358M average: 10.384M total: 311.52M
uncompressed: min: 5.49M max: 19.993M average: 12.525M total: 375.767M
[event_value] BINARY 53.4% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 5.147M max: 24.339M average: 12.544M total: 376.341M (raw data&colon; 1.555G saving 75%)
values: min: 4.836M max: 16.358M average: 10.384M total: 311.52M
uncompressed: min: 11.336M max: 129.761M average: 51.842M total: 1.555G
[is_secondary_lookup] INT32 2.1% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 220.195K max: 739.523K average: 503.788K total: 15.113M (raw data&colon; 58.915M saving 74%)
values: min: 4.836M max: 16.358M average: 10.384M total: 311.52M
uncompressed: min: 919.783K max: 3.068M average: 1.963M total: 58.915M
number of blocks: 30
total data size: 704.457M (raw 4.259G)
total record: 311.52M
average block size: 23.481M (raw 141.983M)
average record count: 10.384M
[ipcdev@node1 ~]$ hadoop jar mrjobjars/catalyst-1.0.0-shaded.jar com.ipcglobal.catalyst.utils.PrintFooter /user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29
listing files in hdfs://node1.ipc-global.com:8020/user/hive/warehouse/testdb.db/event_accum/udc_target_date=2014-03-29
opening 10 files
0% [SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".0%
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
************************************************************
read all footers in 724 ms
[udc_in_file_name] BINARY 0.0% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 3.862K max: 16.706K average: 11.94K total: 119.409K (raw data&colon; 2.609M saving 95%)
values: min: 13.072M max: 35.794M average: 31.152M total: 311.52M
uncompressed: min: 60.876K max: 417.862K average: 260.922K total: 2.609M
[udc_in_file_offset] INT64 14.2% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 6.894M max: 19.954M average: 17.255M total: 172.552M (raw data&colon; 7.607G saving 97%)
values: min: 13.072M max: 35.794M average: 31.152M total: 311.52M
uncompressed: min: 263.001M max: 1.203G average: 760.751M total: 7.607G
[date_time] INT96 3.1% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 1.599M max: 4.374M average: 3.848M total: 38.481M (raw data&colon; 127.714M saving 69%)
values: min: 13.072M max: 35.794M average: 31.152M total: 311.52M
uncompressed: min: 4.44M max: 20.305M average: 12.771M total: 127.714M
[event_type] BINARY 0.9% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 477.92K max: 1.343M average: 1.17M total: 11.707M (raw data&colon; 107.203M saving 89%)
values: min: 13.072M max: 35.794M average: 31.152M total: 311.52M
uncompressed: min: 3.695M max: 16.975M average: 10.72M total: 107.203M
[event_seqn] INT32 3.1% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 1.557M max: 4.37M average: 3.799M total: 37.999M (raw data&colon; 644.963M saving 94%)
values: min: 13.072M max: 35.794M average: 31.152M total: 311.52M
uncompressed: min: 21.886M max: 101.658M average: 64.496M total: 644.963M
[event_id] BINARY 7.1% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 3.423M max: 10.185M average: 8.632M total: 86.322M (raw data&colon; 1.34G saving 93%)
values: min: 13.072M max: 35.794M average: 31.152M total: 311.52M
uncompressed: min: 45.663M max: 214.925M average: 134.022M total: 1.34G
[event_value] BINARY 69.9% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 21.865M max: 98.556M average: 84.707M total: 847.07M (raw data&colon; 17.0G saving 95%)
values: min: 13.072M max: 35.794M average: 31.152M total: 311.52M
uncompressed: min: 582.655M max: 2.67G average: 1.7G total: 17.0G
[is_secondary_lookup] INT32 1.3% of all space [PLAIN, RLE, PLAIN_DICTIONARY] min: 670.874K max: 1.869M average: 1.638M total: 16.381M (raw data&colon; 190.516M saving 91%)
values: min: 13.072M max: 35.794M average: 31.152M total: 311.52M
uncompressed: min: 6.598M max: 30.212M average: 19.051M total: 190.516M
number of blocks: 10
total data size: 1.21G (raw 27.021G)
total record: 311.52M
average block size: 121.063M (raw 2.702G)
average record count: 31.152M
[ipcdev@node1 ~]$

Re: Impala/Parquet Disk Space - Reorg increases usage 50%?

Contributor

There's something weird going on.

 

Before:

total data size: 704.457M (raw 4.259G)
total record: 311.52M

 

After:

total data size: 1.21G (raw 27.021G)
total record: 311.52M

 

 

The compression ratio has improved quite a bit but for some reason the raw data seems to have gotten much bigger.

 

I'ld look at the event value column which went from 1.5GB to 17GB (unencoded size).

Don't have an account?
Coming from Hortonworks? Activate your account here