Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

sparkSQL not pruning directiories

sparkSQL not pruning directiories

Master Collaborator

Hi,

 on Spark 1.6 trying to read a Parquet partitioned table created by Impala to DataFrame. Running a code in spark-shell with these options:

 

--conf spark.executor.memory=4G

--conf spark.executor.cores=4

--conf spark.sql.hive.metastorePartitionPruning=true

--conf spark.sql.parquet.filterPushdown=true

 

do not prune old partitions.

 

The spark code:

 

val q= "select day_id, co_id, dir_id from table where day_id >= 20161001 and day_id < 20161101"
val df = sqlContext.sql( q )
df.cache
df.count 

Spark reads ALL the partitions, in SparkUI in Storage section I see RDD with all partitions:

 

RDD Name	Storage Level	Cached Partitions	Fraction Cached	Size in Memory	Size in ExternalBlockStore	Size on Disk
Project [co_id#15,dir_id#16

 : +- TungstenExchange hashpartitioning(co_id#15,200), None : +- Filter isnotnull(co_id#15) : +- Scan ParquetRelation: 
 default.table[co_id#15,day_id#3,dir_id#16] InputPaths: 
 hdfs://nameservice1/warehouse/table/day_id=20070403, 
 hdfs://nameservice1/warehouse/table/day_id=20070404, 
 hdfs://nameservice1/warehouse/table/day_id=20070405, 
 ....

 

 

 

Tried also

val q= "select day_id, co_id, dir_id from table where day_id >= 20161001 and day_id < 20161101"
val df = sqlContext.sql( q ).filter( 'day_id > 20161001 )
df.cache
df.count

But did not prune the partitions either,

Any ideas?

Thanks

 

 

4 REPLIES 4

Re: sparkSQL not pruning directiories

New Contributor

I'm facing the same issue. Is there any update on this?

Highlighted

Re: sparkSQL not pruning directiories

Master Collaborator
nothing yet

Re: sparkSQL not pruning directiories

New Contributor

It seems the issue is due to a missmatch between the Hive and Parquet schemas:

http://spark.apache.org/docs/1.6.0/sql-programming-guide.html#hive-metastore-parquet-table-conversio...

 

Basically, "Hive is case insensitive, while Parquet is not"

 

In our case, just executing the select statement in lower case solved the issue.

But not sure in your case because your statement is already in lower case :-S

Re: sparkSQL not pruning directiories

Contributor

Spark SQL in Spark 1.6 prints the physical plan which is supposed to list all the available dirs (InputPaths) rather actual directories to scan while fetching the results for the queries.

 

But while reading the data, it reads only relevant partition file and the same can be seen from "INFO" log messages. You can turn on the "INFO" level logging by setting:

 

sc.setLogLevel("INFO")

 in spark-shell.