Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hive bucketing is not working as expected in case of multiple inserts.

avatar
Contributor

Ideally the number of files should be equal to number of buckets declared in a table DDL. It is working fine whenever an initial insert or every insert overwrite is performed. But, insert into hive bucketed table is creating extra files from next insert on wards.

ex: # of Buckets = 4 No. of files after Initial insert --> 4

No. of files after 2nd insert --> 8

No. of files after 3rd insert --> 12

No. of files after n insert --> n* # of Buckets.

First insert :

-rwxrwxrwx 3 hvallur hdfs 49 2016-08-25 12:42 hdfs://dshdp-dev-cluster/apps/hive/warehouse/upsert_testing.db/test3/lname=vr/000000_0

-rwxrwxrwx 3 hvallur hdfs 49 2016-08-25 12:42 hdfs://dshdp-dev-cluster/apps/hive/warehouse/upsert_testing.db/test3/lname=vr/000001_0

-rwxrwxrwx 3 hvallur hdfs 308 2016-08-25 12:42 hdfs://dshdp-dev-cluster/apps/hive/warehouse/upsert_testing.db/test3/lname=vr/000002_0

-rwxrwxrwx 3 hvallur hdfs 49 2016-08-25 12:42 hdfs://dshdp-dev-cluster/apps/hive/warehouse/upsert_testing.db/test3/lname=vr/000003_0

2nd Insert: -rwxrwxrwx 3 hvallur hdfs 49 2016-08-25 12:42 hdfs://dshdp-dev-cluster/apps/hive/warehouse/upsert_testing.db/test3/lname=vr/000000_0

-rwxrwxrwx 3 hvallur hdfs 49 2016-08-25 12:47 hdfs://dshdp-dev-cluster/apps/hive/warehouse/upsert_testing.db/test3/lname=vr/000000_0_copy_1

-rwxrwxrwx 3 hvallur hdfs 49 2016-08-25 12:42 hdfs://dshdp-dev-cluster/apps/hive/warehouse/upsert_testing.db/test3/lname=vr/000001_0

-rwxrwxrwx 3 hvallur hdfs 49 2016-08-25 12:47 hdfs://dshdp-dev-cluster/apps/hive/warehouse/upsert_testing.db/test3/lname=vr/000001_0_copy_1

-rwxrwxrwx 3 hvallur hdfs 308 2016-08-25 12:42 hdfs://dshdp-dev-cluster/apps/hive/warehouse/upsert_testing.db/test3/lname=vr/000002_0

-rwxrwxrwx 3 hvallur hdfs 302 2016-08-25 12:47 hdfs://dshdp-dev-cluster/apps/hive/warehouse/upsert_testing.db/test3/lname=vr/000002_0_copy_1

-rwxrwxrwx 3 hvallur hdfs 49 2016-08-25 12:42 hdfs://dshdp-dev-cluster/apps/hive/warehouse/upsert_testing.db/test3/lname=vr/000003_0

-rwxrwxrwx 3 hvallur hdfs 49 2016-08-25 12:47 hdfs://dshdp-dev-cluster/apps/hive/warehouse/upsert_testing.db/test3/lname=vr/000003_0_copy_1

Does any one know how to resolve this issue ?

I thought of writing a program to consolidate all the copy files to related bucket file (i.e before _copy).

But, I doubt it might effect Hive meta store mapping because Hive stores the file name related to each bucket#. below query gives the details.

SELECT input__file__name,block__offset__inside__file,id from test2 where id%4==3 order by block__offset__inside__file; (Trying to display all the files related to 3rd bucket)

hdfs://dshdp-dev-cluster/apps/hive/warehouse/upsert_testing.db/test3/lname=v/000003_0_copy_1 163 3

hdfs://dshdp-dev-cluster/apps/hive/warehouse/upsert_testing.db/test3/lname=v/000003_0 166 3

Thank you!

1 ACCEPTED SOLUTION

avatar

It sounds like things are working as expected. Please consider a few things that may not be clear to you with regards to the number of underlying files.

First, when you do a subsequent insert (or load) into a (non bucketed) table with existing data you will NOT merge the contents together into a single file. You can test this out by loading the same simple file of 10 or so rows multiple times. You'll see that on the 2nd and 3rd insert/load you will then have an identical 2nd and then 3rd file in the underlying hive table's hdfs folder.

Second, for a new bucketed table that you add data to there is not really a guarantee that you will the number of files aligned to the number of buckets. With bucket hashing occurring on the clustered by field it is possible to have less files if the data doesn't align well. To see that in practice, create a table with 32 buckets and load a file with only 10 records into it. At most, you'll have 10 files (again, possibly fewer). Additionally, if the amount of data being added ends up having more data for a particular bucket that causes it to exceed the file block size, you'll actually get more than one file for that bucket.

So... what is happening on subsequent inserts/loads is that you are just creating new files that align to how the new data is bucketed and they sit alongside the additional files that are already there. At this time, Hive can still benefit from bucketing by lining up more than one file per bucket to the joining table's bucketed data (yes, it may have multiple files, too, for that same bucket).

If you want to get as few files as possible (just one for each bucket if all of a particular bucket's data fits within the block size) then you're right; you'll need to load the contents of this table into another table -- possibly using an ITAS or CTAS strategy.

View solution in original post

1 REPLY 1

avatar

It sounds like things are working as expected. Please consider a few things that may not be clear to you with regards to the number of underlying files.

First, when you do a subsequent insert (or load) into a (non bucketed) table with existing data you will NOT merge the contents together into a single file. You can test this out by loading the same simple file of 10 or so rows multiple times. You'll see that on the 2nd and 3rd insert/load you will then have an identical 2nd and then 3rd file in the underlying hive table's hdfs folder.

Second, for a new bucketed table that you add data to there is not really a guarantee that you will the number of files aligned to the number of buckets. With bucket hashing occurring on the clustered by field it is possible to have less files if the data doesn't align well. To see that in practice, create a table with 32 buckets and load a file with only 10 records into it. At most, you'll have 10 files (again, possibly fewer). Additionally, if the amount of data being added ends up having more data for a particular bucket that causes it to exceed the file block size, you'll actually get more than one file for that bucket.

So... what is happening on subsequent inserts/loads is that you are just creating new files that align to how the new data is bucketed and they sit alongside the additional files that are already there. At this time, Hive can still benefit from bucketing by lining up more than one file per bucket to the joining table's bucketed data (yes, it may have multiple files, too, for that same bucket).

If you want to get as few files as possible (just one for each bucket if all of a particular bucket's data fits within the block size) then you're right; you'll need to load the contents of this table into another table -- possibly using an ITAS or CTAS strategy.