Two years later: it appears impala still does not support bucketing? I could really use that for some performance improvements on a join across three multi billion row tables.
There is now a TABLESAMPLE clause in CDH 5.12. It uses DBMS-like syntax rather than the Hive bucketing notion. Syntax is like:
select c1, c2 from t1 tablesample system([percentage 0-100]) [repeatable(integer)] <rest of query>
The TABLESAMPLE clause can go immediately after any table name in the query, although not after a view name or a subquery.
The percentage represents the volume of data in the table, in bytes. Impala brings in files in arbitrary order until it reaches the specified percentage.
If the query involves partition pruning, the percentage applies to the data volume of just the relevant partitions, e.g.
-- Only sample 10% of the data for year 2017, if the table is partitioned by year
select c1, c2 from t1 tablesample system(10) where year = 2017;