Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Who Agreed with this topic

sparkSQL not pruning directiories

Master Collaborator

Hi,

 on Spark 1.6 trying to read a Parquet partitioned table created by Impala to DataFrame. Running a code in spark-shell with these options:

 

--conf spark.executor.memory=4G

--conf spark.executor.cores=4

--conf spark.sql.hive.metastorePartitionPruning=true

--conf spark.sql.parquet.filterPushdown=true

 

do not prune old partitions.

 

The spark code:

 

val q= "select day_id, co_id, dir_id from table where day_id >= 20161001 and day_id < 20161101"
val df = sqlContext.sql( q )
df.cache
df.count 

Spark reads ALL the partitions, in SparkUI in Storage section I see RDD with all partitions:

 

RDD Name	Storage Level	Cached Partitions	Fraction Cached	Size in Memory	Size in ExternalBlockStore	Size on Disk
Project [co_id#15,dir_id#16

 : +- TungstenExchange hashpartitioning(co_id#15,200), None : +- Filter isnotnull(co_id#15) : +- Scan ParquetRelation: 
 default.table[co_id#15,day_id#3,dir_id#16] InputPaths: 
 hdfs://nameservice1/warehouse/table/day_id=20070403, 
 hdfs://nameservice1/warehouse/table/day_id=20070404, 
 hdfs://nameservice1/warehouse/table/day_id=20070405, 
 ....

 

 

 

Tried also

val q= "select day_id, co_id, dir_id from table where day_id >= 20161001 and day_id < 20161101"
val df = sqlContext.sql( q ).filter( 'day_id > 20161001 )
df.cache
df.count

But did not prune the partitions either,

Any ideas?

Thanks

 

 

Who Agreed with this topic