Support Questions
Find answers, ask questions, and share your expertise

Pushdown predicates using Amazon, Cloudera, Spark and Parquet




I'm using Cloudera 5.13 in amazon, Spark 2.3.0, EC2 and S3.


I want to ask you if this combination of tools apply pushdown predicates when processing parquet files ?

For example, if my job selects only few columns from the parquet file, is it transfering from S3 only those specified columns or the entire parquet file?


I could look at the spark job input size metric, but it behave very shows me kind of random values during the job execution. By random I mean that job input size increase and decrease randomely. I have it a better description of this behavior :



It would be helpful if you had these kind of issues and want to share your experiance 🙂