Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Impala insert with snappy

Contributor

Hi,

 

I noticed when doing impala insert from select, the size is smaller than when i did in hive, then from profile i saw

 

HDFS_SCAN_NODE (id=0):(Total: 1m12s, non-child: 1m12s, % non-child: 100.00%)
          Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:22/19.78 GB 
          ExecOption: PARQUET Codegen Enabled, Codegen enabled: 22 out of 22
          Hdfs Read Thread Concurrency Bucket: 0:48.12% 1:51.88% 2:0% 3:0% 4:0% 
          File Formats: PARQUET/SNAPPY:704 

i didnot set compression, is this default?

 

Also i read this tha snappy can be slow?

----------------------------------------------------------------------------------------------------------------------------------------------------

At the same time, the less agressive the compression, the faster the data can be decompressed. In this case using a table with a billion rows, a query that evaluates all the values for a particular column runs faster with no compression than with Snappy compression, and faster with Snappy compression than with Gzip compression. Query performance depends on several other factors, so as always, run your own benchmarks with your own data to determine the ideal tradeoff between data size, CPU efficiency, and speed of insert and query operations.

-----------------------------------------------------------------------------------------------------------------------------------------------------

 

should we aviod using snappy?

 

Thanks

Shannon

12 REPLIES 12

Contributor

From profile

 

 

- DecompressionTime: 3m31s

 

Is this due to snappy or in general parquet decompression?

 

Thanks

Shannon 

Contributor

Saw this 

 

----------------------------------------------------------------------------------------------------------------

By default, the underlying data files for a Parquet table are compressed with Snappy. 

DecompressionTime should only include Parquet/Gzip/whatever compression, not decoding of Parquet's native encoding/compression.


Generally Snappy works well but there are probably cases where uncompressed would be faster. Compression often improves performance if your workload is bottlenecked on disk I/O since it reduces the amount of data that needs to be read from disk.

For reference, Snappy can decompress in the range of 1.5GB/s per core, so it is pretty fast!

 

https://www.percona.com/blog/2016/04/13/evaluating-database-compression-methods-update/

Contributor

Thanks Tim,

 

so in this case, i see

 

 

- DecompressionTime: 3m31s

 

taking 3 min, should I avoid compression for this table?

 

Thanks

Shannon 

It's hard to answer in isolation. You could compare to MaterializeTupleTime, which is the amount of time spent materializing the rows from the table. If it's small in comparison then it's not eating up a significant amount of CPU time.

 

I suspect if it's taking that long to decompress then you're also saving significant I/O. You could look at StorageWaitTime to get an idea of how much time is spent waiting for I/O.

 

I haven't seen many cases in practice where snappy decompression is the performance bottleneck for a query.

Contributor

i searched and saw

 

 - MaterializeTupleTime(*): 13m22s

i saw 6 and roughly the same.

 

and below TotalStorageWaitTime from 30s to 1m12s

 

 

TotalStorageWaitTime: 1m12s

 

 

Based on that the query is likely bottlenecked more on the number of rows that it's scanning instead of decompression.

Contributor

I see, yes this pariticular table is huge, anything else we can do to improve?

 

Thanks

Shannon

CDH5.12 had some Parquet perf improvements, particularly for select scans
(e.g. you have a condition in your where clause that returns a small
fraction or rows).

If the queries frequently filter on a specific non-partition column, you
might be able to take advantage of Impala's min/max statistics by creating
a sorted table:
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_create_table.html
. If the data is sorted then Impala can often skip over whole files that
don't contain relevant data.

Contributor

Thanks, we will look at the option for upgrade.

 

Will also check sorted table, i can put multiple columns right? any difference for the order of the sorted columns?

 

Thanks

Shannon

It will sort by the first column first, then the second column, etc. So filters on the first column will likely be most effective.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.