Created on 08-15-2017 08:45 AM - edited 09-16-2022 05:05 AM
Hi,
I noticed when doing impala insert from select, the size is smaller than when i did in hive, then from profile i saw
HDFS_SCAN_NODE (id=0):(Total: 1m12s, non-child: 1m12s, % non-child: 100.00%) Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:22/19.78 GB ExecOption: PARQUET Codegen Enabled, Codegen enabled: 22 out of 22 Hdfs Read Thread Concurrency Bucket: 0:48.12% 1:51.88% 2:0% 3:0% 4:0% File Formats: PARQUET/SNAPPY:704
i didnot set compression, is this default?
Also i read this tha snappy can be slow?
----------------------------------------------------------------------------------------------------------------------------------------------------
At the same time, the less agressive the compression, the faster the data can be decompressed. In this case using a table with a billion rows, a query that evaluates all the values for a particular column runs faster with no compression than with Snappy compression, and faster with Snappy compression than with Gzip compression. Query performance depends on several other factors, so as always, run your own benchmarks with your own data to determine the ideal tradeoff between data size, CPU efficiency, and speed of insert and query operations.
-----------------------------------------------------------------------------------------------------------------------------------------------------
should we aviod using snappy?
Thanks
Shannon
Created 08-15-2017 08:47 AM
From profile
- DecompressionTime: 3m31s
Is this due to snappy or in general parquet decompression?
Thanks
Shannon
Created 08-15-2017 09:04 AM
Saw this
----------------------------------------------------------------------------------------------------------------
By default, the underlying data files for a Parquet table are compressed with Snappy.
Created 08-15-2017 09:14 AM
DecompressionTime should only include Parquet/Gzip/whatever compression, not decoding of Parquet's native encoding/compression.
Generally Snappy works well but there are probably cases where uncompressed would be faster. Compression often improves performance if your workload is bottlenecked on disk I/O since it reduces the amount of data that needs to be read from disk.
Created 08-15-2017 09:17 AM
For reference, Snappy can decompress in the range of 1.5GB/s per core, so it is pretty fast!
https://www.percona.com/blog/2016/04/13/evaluating-database-compression-methods-update/
Created 08-15-2017 09:18 AM
Thanks Tim,
so in this case, i see
- DecompressionTime: 3m31s
taking 3 min, should I avoid compression for this table?
Thanks
Shannon
Created 08-15-2017 09:35 AM
It's hard to answer in isolation. You could compare to MaterializeTupleTime, which is the amount of time spent materializing the rows from the table. If it's small in comparison then it's not eating up a significant amount of CPU time.
I suspect if it's taking that long to decompress then you're also saving significant I/O. You could look at StorageWaitTime to get an idea of how much time is spent waiting for I/O.
I haven't seen many cases in practice where snappy decompression is the performance bottleneck for a query.
Created 08-15-2017 10:42 AM
i searched and saw
- MaterializeTupleTime(*): 13m22s
i saw 6 and roughly the same.
and below TotalStorageWaitTime from 30s to 1m12s
TotalStorageWaitTime: 1m12s
Created 08-15-2017 11:22 AM
Based on that the query is likely bottlenecked more on the number of rows that it's scanning instead of decompression.
Created 08-15-2017 11:29 AM
I see, yes this pariticular table is huge, anything else we can do to improve?
Thanks
Shannon
Created 08-15-2017 11:58 AM
Created 08-15-2017 12:29 PM
Thanks, we will look at the option for upgrade.
Will also check sorted table, i can put multiple columns right? any difference for the order of the sorted columns?
Thanks
Shannon
Created 08-15-2017 01:04 PM
It will sort by the first column first, then the second column, etc. So filters on the first column will likely be most effective.