I have 2 tables.
One is bucketed, the other is not - other than that they are identical.
Both stored as orc and partitioned.
Querying the bucketed table on a given partition is much slower than querying its twin on the same partition.
Is it expected?
If not what can be the reasons for it?
If yes then should I conclude my motivation to hold a bucketed table is to improve joins/improve sampling/allow streaming, and in this case hold 2 copies of the table - one bucketed and the other not (sounds wasteful)?
Bucketing itself was introduced for performance optimization, so that only a portion of data is used especially during map join operations. Thus buckets create smaller chunks of data which can be kept in memory in form of hash tables and then can be used by joins.
What is the query you are executing ?
"select colA,count(*) from my_table where ds="2015-09-18" group by colA"
the "2015-09-18" is a partition on both tables so you would expect similar performance.