Welcome to the Cloudera Community

Tomas79 · ‎10-29-2016

Hi,

on Spark 1.6 trying to read a Parquet partitioned table created by Impala to DataFrame. Running a code in spark-shell with these options:

--conf spark.executor.memory=4G

--conf spark.executor.cores=4

--conf spark.sql.hive.metastorePartitionPruning=true

--conf spark.sql.parquet.filterPushdown=true

do not prune old partitions.

The spark code:

val q= "select day_id, co_id, dir_id from table where day_id >= 20161001 and day_id < 20161101"
val df = sqlContext.sql( q )
df.cache
df.count

Spark reads ALL the partitions, in SparkUI in Storage section I see RDD with all partitions:

RDD Name	Storage Level	Cached Partitions	Fraction Cached	Size in Memory	Size in ExternalBlockStore	Size on Disk
Project [co_id#15,dir_id#16

 : +- TungstenExchange hashpartitioning(co_id#15,200), None : +- Filter isnotnull(co_id#15) : +- Scan ParquetRelation: 
 default.table[co_id#15,day_id#3,dir_id#16] InputPaths: 
 hdfs://nameservice1/warehouse/table/day_id=20070403, 
 hdfs://nameservice1/warehouse/table/day_id=20070404, 
 hdfs://nameservice1/warehouse/table/day_id=20070405, 
 ....

Tried also

val q= "select day_id, co_id, dir_id from table where day_id >= 20161001 and day_id < 20161101"
val df = sqlContext.sql( q ).filter( 'day_id > 20161001 )
df.cache
df.count

But did not prune the partitions either,

Any ideas?

Thanks

Cloudera Community

Welcome to the Cloudera Community

Who agreed with this topic

sparkSQL not pruning directiories