Reply
Highlighted
Explorer
Posts: 17
Registered: ‎06-17-2014

Pushdown predicates using Amazon, Cloudera, Spark and Parquet

Hi,

 

I'm using Cloudera 5.13 in amazon, Spark 2.3.0, EC2 and S3.

 

I want to ask you if this combination of tools apply pushdown predicates when processing parquet files ?

For example, if my job selects only few columns from the parquet file, is it transfering from S3 only those specified columns or the entire parquet file?

 

I could look at the spark job input size metric, but it behave very strange...it shows me kind of random values during the job execution. By random I mean that job input size increase and decrease randomely. I have it a better description of this behavior : https://stackoverflow.com/questions/51279115/running-spark-jobs-from-s3-produce-random-input-size-va...

 

 

It would be helpful if you had these kind of issues and want to share your experiance :)

Thanks,

Tudor

Announcements