Support Questions
Find answers, ask questions, and share your expertise

Hive TABLESAMPLE on clustered table

Explorer

Hello.

I want to ask the correct bucketing and tablesample way.

There is a table X which I created by

CREATE TABLE `X`(`action_id` string,`classifier` string)
CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
STORED AS ORC

Then I inserted 500M of rows into X by

set hive.enforce.bucketing=true;
INSERT OVERWRITE INTO X SELECT * FROM X_RAW

Then I want to count or search some rows with condition. roughly,

SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'

But I'd better to USE tablesample as I clustered X (action_id, classifier). So, the better query will be

SELECT COUNT(*) FROM X 
TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
WHERE action_id='aaa' AND classifier='bbb'

Is there any wrong above? But I can't not find any performance gain between these two query.

query1 and RESULT( with no tablesample.)

SELECT COUNT(*)) from X 
WHERE action_id='aaa' and classifier='bbb'

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED    256        256        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 15.35 s    
--------------------------------------------------------------------------------
It scans full data.

query 2 and RESULT

SELECT COUNT(*)) from X 
TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
WHERE action_id='aaa' and classifier='bbb'

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED    256        256        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 15.82     s    
--------------------------------------------------------------------------------
It ALSO scans full data.

query 2 RESULT WHAT I EXPECTED.

Result what I expected is something like...
(use 1 map and relatively faster than without tabmesample)
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 3.xx     s    
--------------------------------------------------------------------------------

Values of action_id and classifier are well distributed and there is no skewed data.

So I want to ask you what will be a correct query to target Only 1 Bucket and Use 1 map??

Ambari 2 and Hive 1.2 Using.

1 REPLY 1

Re: Hive TABLESAMPLE on clustered table

So, I'm not 100% sure, but I'm doubtful that bucketing is going to help you eliminate doing a scan of all the data (i.e. 256 mappers since you created 256 files with the # of buckets you declared). I don't know the unique number of combinations you have of action_id and classifier, but if not "insane" (yes, a technical term 😉 then maybe you could try partitioning so that when you add your WHERE clause with these two columns you will only read the contents of the folders that get created and get the 1 mapper you are looking for. Good luck!