Support Questions

Find answers, ask questions, and share your expertise

Hive bucketed table producing less files than expected

avatar
New Contributor

I seem to be having trouble creating a bucketed table and having hive (using spark sql in pyspark) recognize that table as bucketed during a join.   I have created a simplified table of 10 records and have the bucketed key of type integer.  I insert the 10 records with values 1-10 for the key column and expect to see 10 files but see 2 files. 

Scass_0-1694036112588.png

table with values  I will insert into above bucketed table:

Scass_1-1694036199399.pngScass_2-1694036272055.png

when I look at the number of files created I see only 2.  Was expecting to see 5 files if a mod is done on an integer value.  

Scass_3-1694036595631.png

In my real problem I am having a string key and running into memory issues as it appears hive does not believe the files are bucketed and sorted and is spilling quite a bit of data during the join of the 2 bucketed tables and the number of underlying files is much greater than the number of buckets.

 

2 REPLIES 2

avatar
Community Manager

@Scass Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our HIVE experts @Shmoo @JoseManuel  who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
New Contributor

Sorry meant to say 5 files above or at least a multiple of 5... instead of 2.