Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Pushdown predicates using Amazon, Cloudera, Spark and Parquet

Highlighted

Pushdown predicates using Amazon, Cloudera, Spark and Parquet

Explorer

Hi,

 

I'm using Cloudera 5.13 in amazon, Spark 2.3.0, EC2 and S3.

 

I want to ask you if this combination of tools apply pushdown predicates when processing parquet files ?

For example, if my job selects only few columns from the parquet file, is it transfering from S3 only those specified columns or the entire parquet file?

 

I could look at the spark job input size metric, but it behave very strange...it shows me kind of random values during the job execution. By random I mean that job input size increase and decrease randomely. I have it a better description of this behavior : https://stackoverflow.com/questions/51279115/running-spark-jobs-from-s3-produce-random-input-size-va...

 

 

It would be helpful if you had these kind of issues and want to share your experiance :)

Thanks,

Tudor

Don't have an account?
Coming from Hortonworks? Activate your account here