28787
DISCUSSIONS
102095
MEMBERS
3161
ARTICLES
Created on 10-29-2016 06:04 AM - edited 10-29-2016 06:15 AM
Hi,
on Spark 1.6 trying to read a Parquet partitioned table created by Impala to DataFrame. Running a code in spark-shell with these options:
--conf spark.executor.memory=4G
--conf spark.executor.cores=4
--conf spark.sql.hive.metastorePartitionPruning=true
--conf spark.sql.parquet.filterPushdown=true
do not prune old partitions.
The spark code:
val q= "select day_id, co_id, dir_id from table where day_id >= 20161001 and day_id < 20161101" val df = sqlContext.sql( q ) df.cache df.count
Spark reads ALL the partitions, in SparkUI in Storage section I see RDD with all partitions:
RDD Name Storage Level Cached Partitions Fraction Cached Size in Memory Size in ExternalBlockStore Size on Disk Project [co_id#15,dir_id#16 : +- TungstenExchange hashpartitioning(co_id#15,200), None : +- Filter isnotnull(co_id#15) : +- Scan ParquetRelation: default.table[co_id#15,day_id#3,dir_id#16] InputPaths: hdfs://nameservice1/warehouse/table/day_id=20070403, hdfs://nameservice1/warehouse/table/day_id=20070404, hdfs://nameservice1/warehouse/table/day_id=20070405, ....
Tried also
val q= "select day_id, co_id, dir_id from table where day_id >= 20161001 and day_id < 20161101" val df = sqlContext.sql( q ).filter( 'day_id > 20161001 ) df.cache df.count
But did not prune the partitions either,
Any ideas?
Thanks