Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hive bucketed table producing less files than expected

avatar
New Contributor

I seem to be having trouble creating a bucketed table and having hive (using spark sql in pyspark) recognize that table as bucketed during a join.   I have created a simplified table of 10 records and have the bucketed key of type integer.  I insert the 10 records with values 1-10 for the key column and expect to see 10 files but see 2 files. 

Scass_0-1694036112588.png

table with values  I will insert into above bucketed table:

Scass_1-1694036199399.pngScass_2-1694036272055.png

when I look at the number of files created I see only 2.  Was expecting to see 5 files if a mod is done on an integer value.  

Scass_3-1694036595631.png

In my real problem I am having a string key and running into memory issues as it appears hive does not believe the files are bucketed and sorted and is spilling quite a bit of data during the join of the 2 bucketed tables and the number of underlying files is much greater than the number of buckets.

 

2 REPLIES 2

avatar
Community Manager

@Scass Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our HIVE experts @Shmoo @JoseManuel  who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
New Contributor

Sorry meant to say 5 files above or at least a multiple of 5... instead of 2.