Created 04-30-2020 06:24 AM
Hello Team,
We are using Impala to query data stored as parquet on s3. This has been an awesome feature. Recently Amazon S3 has announced a new feature called S3 Select which helps in speeding up the column projection when querying data stored on S3. As of now Hive and Presto support S3 select push down.
Is Impala going to support S3 select push downs?
Parth
Created 05-11-2020 01:31 AM
Hi @parthk,
This is a tough question, because when discussing S3 access multiple components come into picture:
I hope this 10,000 foot view helps, it is hard to answer questions about the future.
Created 05-11-2020 09:38 AM
We don't support it or have near-term plans to support it. We have an alternate strategy for S3 perf in Impala, based on our high-performance native Parquet implementation and our remote data cache. The idea is to cache all frequently-accessed parts of Parquet files (footers, columns referenced by queries, etc) on local storage on compute nodes. This avoids going across the network entirely once data is cached and can give speedups for all queries, not just highly selective queries. Like everything in databases, there are pros and cons to this design, but it's a really good fit for the kinda of interactive analytic workloads we see a lot of (BI, dashboards, ad-hoc querying, etc).
As far as I'm aware, benchmarks that report impressive speedups for S3 select on Parquet are not comparing to a system with any kind of data caching, or, necessarily, to an implementation of Parquet that is as optimized as Impala's. Comparing cost is also tricky in benchmarks, because it adds an extra way that you're indirectly paying for compute. Whether it's beneficial for cost/performance depends on a lot of variables, including the workload intensity, compute engine and structure of the files.
We've also been tackling selective query perf in different ways. We invested in Parquet page indices as a targeted enhancement for selective queries: https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/ and have been involved in adding bloom filters to the Parquet standard to help optimize selective queries across all engines and storage systems.
We also care a lot about open file formats, open implementations and interoperability between the engines we ship - if we were to add support we'd want to make sure it was 100% compatible. It's hard to ensure that with an opaque implementation of a fairly complex file format that can change underneath us. Our team has got battle-scars from ironing out a lot of the small details of the Parquet file format, and particularly getting predicate evaluation/push-down right so we don't take adding another layer/dimension lightly - getting all the details right and testing that it all works is a lot of work.
Anyway, it's a cool technology that's on our radar, but so far it hasn't been the right solution for any of the problems we've tackled for Impala query performance on Parquet.
Created 05-11-2020 01:31 AM
Hi @parthk,
This is a tough question, because when discussing S3 access multiple components come into picture:
I hope this 10,000 foot view helps, it is hard to answer questions about the future.
Created 05-11-2020 09:38 AM
We don't support it or have near-term plans to support it. We have an alternate strategy for S3 perf in Impala, based on our high-performance native Parquet implementation and our remote data cache. The idea is to cache all frequently-accessed parts of Parquet files (footers, columns referenced by queries, etc) on local storage on compute nodes. This avoids going across the network entirely once data is cached and can give speedups for all queries, not just highly selective queries. Like everything in databases, there are pros and cons to this design, but it's a really good fit for the kinda of interactive analytic workloads we see a lot of (BI, dashboards, ad-hoc querying, etc).
As far as I'm aware, benchmarks that report impressive speedups for S3 select on Parquet are not comparing to a system with any kind of data caching, or, necessarily, to an implementation of Parquet that is as optimized as Impala's. Comparing cost is also tricky in benchmarks, because it adds an extra way that you're indirectly paying for compute. Whether it's beneficial for cost/performance depends on a lot of variables, including the workload intensity, compute engine and structure of the files.
We've also been tackling selective query perf in different ways. We invested in Parquet page indices as a targeted enhancement for selective queries: https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/ and have been involved in adding bloom filters to the Parquet standard to help optimize selective queries across all engines and storage systems.
We also care a lot about open file formats, open implementations and interoperability between the engines we ship - if we were to add support we'd want to make sure it was 100% compatible. It's hard to ensure that with an opaque implementation of a fairly complex file format that can change underneath us. Our team has got battle-scars from ironing out a lot of the small details of the Parquet file format, and particularly getting predicate evaluation/push-down right so we don't take adding another layer/dimension lightly - getting all the details right and testing that it all works is a lot of work.
Anyway, it's a cool technology that's on our radar, but so far it hasn't been the right solution for any of the problems we've tackled for Impala query performance on Parquet.
Created 05-13-2020 04:53 AM
@Tim Armstrong thanks for detailed insights this will be very helpful.
Created 05-14-2020 09:44 AM
Hi @parthk ,
I'm happy to see you have found the resolution to your issue. Can you kindly mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future?
Thanks,
Vidya
Regards,
Vidya Sargur,Created 05-13-2020 02:00 AM
Thanks for the insights @tmater . This is very helpful