Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

bucket or partitiones in hive

bucket or partitiones in hive

Super Collaborator

Hi:

For year, month and day, do you recomend me buckets or partitions???

For example I have entity, year , month , day, and date, so, what is better????

Thanks

9 REPLIES 9

Re: bucket or partitiones in hive

Super Collaborator

Hi:

Also is recommende to create partitions like this??

/log/2014-03-01/
/log/2014-03-02/
/log/2014-03-03/
/log/2014-03-04/
/log/2014-03-05/
Highlighted

Re: bucket or partitiones in hive

Expert Contributor

Could you please remove this answer and instead make it part of your question. You can still make changes to your questions. It would also benefit the quality of your question if you remove things like '????'

Re: bucket or partitiones in hive

It's better to create partition like 20140301 then query your data on the basis of day example. where coldate > 20140301 and < 20140310

Re: bucket or partitiones in hive

Expert Contributor

Partitioning is useful for chronological columns that don’t have a very high number of possible values, as you don’t want to end up with millions of partitions. So Partitions are something that you typically find in a WHERE or GROUP filter to optimize the amount of data be read by Hive. Time is a good candidate for that.

Partitions like /log/yyyy-mm-dd or /log/yyyy/mm/dd are very commonly used. Some prefer to create the partition value as STRING type to also support LIKE operations.

Bucketing is most useful for fact tables that are “most often” joined together on the same key. For example: joins by a patient-ID or customer-ID.

Tip: with partitioning and bucketing together, you can easily end up with very many small files and overwhelm result in bad performance.
Tip: make sure the number of partitions is not too large (tousands, millions, ..), or you’ll suffer in performance.

Re: bucket or partitiones in hive

Here’s some info about Bucketing specifically (since it's less understood)…

Partitioning is rather universal - you will use it all the time.

Bucketing fulfills a very specific need - think of it as a last resort…

If you're joining a bigtable to a bigtable often - use bucketing.

So use the example of a 1 billion row table joined to a 1 billion row table. This is a nasty join. If you bucket the tables (with 30 buckets) then your join suddenly becomes 30 smaller 33 million * 33 million row joins. Much more negotiable, and it reduces the data explosion you’d get with 2 much larger tables being joined together. Only corresponding buckets of data are processed together for the join.

But you have to have the following:

  • the same key bucketed in both tables
  • both should be very large tables
  • you must define the same # of buckets on each table
  • and the bucketing key should be rather smoothly distributed (not skewed).

So bucketing can be very useful, but you need to prove it out for your specific circumstance. It is never a “default” option for table design.

Re: bucket or partitiones in hive

Super Collaborator

hi:

If i have this query , i have buckets with canal and partition with codnrbeenf and index with fechaoprcnf is correct??

 SELECT canal, frecuencia as freq FROM canal_partitioned_v2 WHERE codnrbeenf in ("123,543,765") and (fechaoprcnf >= '2016-02-01' and fechaoprcnf <= '2016-02-07')

Re: bucket or partitiones in hive

Expert Contributor

If you have a question, ask a question. This is for answers of your original question. Was your question answered?

Re: bucket or partitiones in hive

Super Collaborator

Thanks, but what happend if in the where clausure there are more things like (entidad=9998 and oficina=0001)???? also I need to create a partitions????

Thanks

Re: bucket or partitiones in hive

Yes, it is possible to have multi column partition. So based on your requirement you can define "(entidad=9998 and oficina=0001)" as a partition, If you think that will optimize your queries. It completely depends on how frequently and extensively you want to query that partition :)

Don't have an account?
Coming from Hortonworks? Activate your account here