Support Questions

Find answers, ask questions, and share your expertise

Does Impala support S3 select push down?

avatar
Contributor

Hello Team,

We are using Impala to query data stored as parquet on s3. This has been an awesome feature. Recently Amazon S3 has announced a new feature called S3 Select  which helps in speeding up the column projection when querying data stored on S3. As of now Hive and Presto support  S3 select push down.

Is Impala going to support S3 select push downs?

 

Parth

2 ACCEPTED SOLUTIONS

avatar
Expert Contributor

Hi @parthk,

 

This is a tough question, because when discussing S3 access multiple components come into picture:

  1. First and foremost S3, S3 Select only supports CSV and JSON format at the moment, while Impala/Hive favors columnar storage formats Parquet/ORC in general. With just a couple of fields to filter on a partition strategy should possibly have the similar results with Parquet/ORC, I have not tested this, it would need some perf test on the datasets.
  2. Secondly Impala/Hive connects to S3 with the Hadoop S3A client, which is in the hadoop-aws module. An experimental S3 Select feature doc can be found here.
  3. Lastly the third party component has to support it as well. I spent some time on the AWS Hive S3 Select support and it seem to be a closed source INPUTFORMAT solution, I could not find 'com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat' anywhere. Digging a bit more I have found that upstream Hive does not support S3 Select either, the upstream Jira is HIVE-21112.

I hope this 10,000 foot view helps, it is hard to answer questions about the future.

View solution in original post

avatar

We don't support it or have near-term plans to support it. We have an alternate strategy for S3 perf in Impala, based on our high-performance native Parquet implementation and our remote data cache. The idea is to cache all frequently-accessed parts of Parquet files (footers, columns referenced by queries, etc) on local storage on compute nodes. This avoids going across the network entirely once data is cached and can give speedups for all queries, not just highly selective queries. Like everything in databases, there are pros and cons to this design, but it's a really good fit for the kinda of interactive analytic workloads we see a lot of (BI, dashboards, ad-hoc querying, etc).

As far as I'm aware, benchmarks that report impressive speedups for S3 select on Parquet are not comparing to a system with any kind of data caching, or, necessarily, to an implementation of Parquet that is as optimized as Impala's. Comparing cost is also tricky in benchmarks, because it adds an extra way that you're indirectly paying for compute. Whether it's beneficial for cost/performance depends on a lot of variables, including the workload intensity, compute engine and structure of the files.

We've also been tackling selective query perf in different ways. We invested in Parquet page indices as a targeted enhancement for selective queries: https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/ and have been involved in adding bloom filters to the Parquet standard to help optimize selective queries across all engines and storage systems.

We also care a lot about open file formats, open implementations and interoperability between the engines we ship - if we were to add support we'd want to make sure it was 100% compatible. It's hard to ensure that with an opaque implementation of a fairly complex file format that can change underneath us. Our team has got battle-scars from ironing out a lot of the small details of the Parquet file format, and particularly getting predicate evaluation/push-down right so we don't take adding another layer/dimension lightly - getting all the details right and testing that it all works is a lot of work.

Anyway, it's a cool technology that's on our radar, but so far it hasn't been the right solution for any of the problems we've tackled for Impala query performance on Parquet.

View solution in original post

5 REPLIES 5

avatar
Expert Contributor

Hi @parthk,

 

This is a tough question, because when discussing S3 access multiple components come into picture:

  1. First and foremost S3, S3 Select only supports CSV and JSON format at the moment, while Impala/Hive favors columnar storage formats Parquet/ORC in general. With just a couple of fields to filter on a partition strategy should possibly have the similar results with Parquet/ORC, I have not tested this, it would need some perf test on the datasets.
  2. Secondly Impala/Hive connects to S3 with the Hadoop S3A client, which is in the hadoop-aws module. An experimental S3 Select feature doc can be found here.
  3. Lastly the third party component has to support it as well. I spent some time on the AWS Hive S3 Select support and it seem to be a closed source INPUTFORMAT solution, I could not find 'com.amazonaws.emr.s3select.hive.S3SelectableTextInputFormat' anywhere. Digging a bit more I have found that upstream Hive does not support S3 Select either, the upstream Jira is HIVE-21112.

I hope this 10,000 foot view helps, it is hard to answer questions about the future.

avatar

We don't support it or have near-term plans to support it. We have an alternate strategy for S3 perf in Impala, based on our high-performance native Parquet implementation and our remote data cache. The idea is to cache all frequently-accessed parts of Parquet files (footers, columns referenced by queries, etc) on local storage on compute nodes. This avoids going across the network entirely once data is cached and can give speedups for all queries, not just highly selective queries. Like everything in databases, there are pros and cons to this design, but it's a really good fit for the kinda of interactive analytic workloads we see a lot of (BI, dashboards, ad-hoc querying, etc).

As far as I'm aware, benchmarks that report impressive speedups for S3 select on Parquet are not comparing to a system with any kind of data caching, or, necessarily, to an implementation of Parquet that is as optimized as Impala's. Comparing cost is also tricky in benchmarks, because it adds an extra way that you're indirectly paying for compute. Whether it's beneficial for cost/performance depends on a lot of variables, including the workload intensity, compute engine and structure of the files.

We've also been tackling selective query perf in different ways. We invested in Parquet page indices as a targeted enhancement for selective queries: https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/ and have been involved in adding bloom filters to the Parquet standard to help optimize selective queries across all engines and storage systems.

We also care a lot about open file formats, open implementations and interoperability between the engines we ship - if we were to add support we'd want to make sure it was 100% compatible. It's hard to ensure that with an opaque implementation of a fairly complex file format that can change underneath us. Our team has got battle-scars from ironing out a lot of the small details of the Parquet file format, and particularly getting predicate evaluation/push-down right so we don't take adding another layer/dimension lightly - getting all the details right and testing that it all works is a lot of work.

Anyway, it's a cool technology that's on our radar, but so far it hasn't been the right solution for any of the problems we've tackled for Impala query performance on Parquet.

avatar
Contributor

@Tim Armstrong  thanks for detailed insights this will be very helpful.

avatar
Community Manager

Hi @parthk , 

 

I'm happy to see you have found the resolution to your issue. Can you kindly mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future?

 

Thanks,

Vidya 



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Contributor

Thanks for the insights @tmater . This is very helpful