07-12-2018 06:51 AM
I'm using Cloudera 5.13 in amazon, Spark 2.3.0, EC2 and S3.
I want to ask you if this combination of tools apply pushdown predicates when processing parquet files ?
For example, if my job selects only few columns from the parquet file, is it transfering from S3 only those specified columns or the entire parquet file?
I could look at the spark job input size metric, but it behave very strange...it shows me kind of random values during the job execution. By random I mean that job input size increase and decrease randomely. I have it a better description of this behavior : https://stackoverflow.com/questions/51279115/running-spark-jobs-from-s3-produce-random-input-size-va...
It would be helpful if you had these kind of issues and want to share your experiance :)