I am running queries using TPC-DS queries in Impala. The impala version 2.10.0, cdh version 5.13.0 and there are 4 impalad nodes along with four hdfs data nodes. Each data node has 7 HDD disk attached. The size of dataset is 7.5T for parquet format (original text format is 11T). Then I run query q42.sql in TPC-DS queries. After running, I learned that the data size reading for store_sales table on average for the 4 node is 10.04 GB. While I was running query, I used NMON to collect the disk matrix. The weird thing is the data size I calculated using NMON collection is 14.4 G (on the same impala node). So, why is there a gap between these two numbers (10.04 GB in profile and 14.4 GB in nmon)?
... View more