For year, month and day, do you recomend me buckets or partitions???
For example I have entity, year , month , day, and date, so, what is better????
Also is recommende to create partitions like this??
/log/2014-03-01/ /log/2014-03-02/ /log/2014-03-03/ /log/2014-03-04/ /log/2014-03-05/
Could you please remove this answer and instead make it part of your question. You can still make changes to your questions. It would also benefit the quality of your question if you remove things like '????'
Partitioning is useful for chronological columns that don’t have a very high number of possible values, as you don’t want to end up with millions of partitions. So Partitions are something that you typically find in a WHERE or GROUP filter to optimize the amount of data be read by Hive. Time is a good candidate for that.
Partitions like /log/yyyy-mm-dd or /log/yyyy/mm/dd are very commonly used. Some prefer to create the partition value as STRING type to also support LIKE operations.
Bucketing is most useful for fact tables that are “most often” joined together on the same key. For example: joins by a patient-ID or customer-ID.
Tip: with partitioning and bucketing together, you can easily end up with very many small files and overwhelm result in bad performance.
Tip: make sure the number of partitions is not too large (tousands, millions, ..), or you’ll suffer in performance.
Here’s some info about Bucketing specifically (since it's less understood)…
Partitioning is rather universal - you will use it all the time.
Bucketing fulfills a very specific need - think of it as a last resort…
If you're joining a bigtable to a bigtable often - use bucketing.
So use the example of a 1 billion row table joined to a 1 billion row table. This is a nasty join. If you bucket the tables (with 30 buckets) then your join suddenly becomes 30 smaller 33 million * 33 million row joins. Much more negotiable, and it reduces the data explosion you’d get with 2 much larger tables being joined together. Only corresponding buckets of data are processed together for the join.
But you have to have the following:
So bucketing can be very useful, but you need to prove it out for your specific circumstance. It is never a “default” option for table design.
If i have this query , i have buckets with canal and partition with codnrbeenf and index with fechaoprcnf is correct??
SELECT canal, frecuencia as freq FROM canal_partitioned_v2 WHERE codnrbeenf in ("123,543,765") and (fechaoprcnf >= '2016-02-01' and fechaoprcnf <= '2016-02-07')
Thanks, but what happend if in the where clausure there are more things like (entidad=9998 and oficina=0001)???? also I need to create a partitions????
Yes, it is possible to have multi column partition. So based on your requirement you can define "(entidad=9998 and oficina=0001)" as a partition, If you think that will optimize your queries. It completely depends on how frequently and extensively you want to query that partition :)