Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive 3.1 ignores bucketing on INSERT (most of the time)

Highlighted

Hive 3.1 ignores bucketing on INSERT (most of the time)

New Contributor

I have Hive 3.1.0 installed on Centos 7 from HDP 3.1.0 RPM packages (http://public-repo-1.hortonworks.com/HDP/centos7/3.x/updates/3.1.0.0/hdp.repo).


# hive --versionHive 3.1.0.3.1.0.0-78
Git git://ctr-e138-1518143905142-586755-01-000015.hwx.site/grid/0/jenkins/workspace/HDP-parallel-centos7/SOURCES/hive -r 56673b027117d8cb3400675b1680a4d992360808
Compiled by jenkins on Thu Dec 6 12:27:21 UTC 2018
From source with checksum 97cc61f6acbe68b1fa988aa9f76b34cc


I created a bucketed table:

beeline> create table bucketed_table(a bigint) clustered by (a) into 32 buckets stored as orc;

beeline> show create table bucketed_table;
+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `bucketed_table`(                     |
|   `a` bigint)                                      |
| CLUSTERED BY (                                     |
|   a)                                               |
| INTO 32 BUCKETS                                    |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'      |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'  |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION                                           |
|   'hdfs://hadoop-master:9000/user/hive/warehouse/bucketed_table' |
| TBLPROPERTIES (                                    |
|   'bucketing_version'='2',                         |
|   'transient_lastDdlTime'='1565118381')            |
+----------------------------------------------------

Now, insert data:

beeline> insert into bucketed_table values (7), (13), (19), (1), (34324);


However, the data isn't bucketed:

# hdfs dfs -ls -R /user/hive/warehouse/bucketed_table/
-rw-r--r--   3 hive supergroup        213 2019-08-07 01:28 /user/hive/warehouse/bucketed_table/000000_0


What's surprising, the behavior seems to be determined by some unknown factor, because if I try this in a loop, I eventually get a bucketed table (usually within 3-10 attempts):

while test $(hdfs dfs -ls -R /user/hive/warehouse/bucketed_table/ | wc -l) -le 1; do
    beeline -n hive -e "insert overwrite table bucketed_table values (7), (13), (19), (1), (34324);"
done


# hdfs dfs -ls -R /user/hive/warehouse/bucketed_table/
-rw-r--r--   3 hive supergroup        187 2019-08-07 01:32 /user/hive/warehouse/bucketed_table/000023_0
-rw-r--r--   3 hive supergroup        204 2019-08-07 01:32 /user/hive/warehouse/bucketed_table/000024_0
-rw-r--r--   3 hive supergroup        187 2019-08-07 01:32 /user/hive/warehouse/bucketed_table/000027_0
-rw-r--r--   3 hive supergroup        187 2019-08-07 01:32 /user/hive/warehouse/bucketed_table/000028_0



How to fix it so that data is bucketed always?