Welcome to the Cloudera Community

Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Who agreed with this topic

sparkSQL not pruning directiories

avatar

Hi,

 on Spark 1.6 trying to read a Parquet partitioned table created by Impala to DataFrame. Running a code in spark-shell with these options:

 

--conf spark.executor.memory=4G

--conf spark.executor.cores=4

--conf spark.sql.hive.metastorePartitionPruning=true

--conf spark.sql.parquet.filterPushdown=true

 

do not prune old partitions.

 

The spark code:

 

val q= "select day_id, co_id, dir_id from table where day_id >= 20161001 and day_id < 20161101"
val df = sqlContext.sql( q )
df.cache
df.count 

Spark reads ALL the partitions, in SparkUI in Storage section I see RDD with all partitions:

 

RDD Name	Storage Level	Cached Partitions	Fraction Cached	Size in Memory	Size in ExternalBlockStore	Size on Disk
Project [co_id#15,dir_id#16

 : +- TungstenExchange hashpartitioning(co_id#15,200), None : +- Filter isnotnull(co_id#15) : +- Scan ParquetRelation: 
 default.table[co_id#15,day_id#3,dir_id#16] InputPaths: 
 hdfs://nameservice1/warehouse/table/day_id=20070403, 
 hdfs://nameservice1/warehouse/table/day_id=20070404, 
 hdfs://nameservice1/warehouse/table/day_id=20070405, 
 ....

 

 

 

Tried also

val q= "select day_id, co_id, dir_id from table where day_id >= 20161001 and day_id < 20161101"
val df = sqlContext.sql( q ).filter( 'day_id > 20161001 )
df.cache
df.count

But did not prune the partitions either,

Any ideas?

Thanks

 

 

Who agreed with this topic