Created on 06-12-2017 10:18 AM - edited 09-16-2022 04:44 AM
I partition with year/month/day, is therre a difference if i query/aggregate using that or the timestamp (used for partition) column?
Thanks
Shannon
Created 06-19-2017 09:20 AM
Created 06-12-2017 10:49 AM
I'm inclined to say that yes there will be a difference. One or two example queries to show the alternatives would be helpful for me to give you a more accurate response.
Created 06-12-2017 11:58 AM
I was asking just in general if there is any difference, and which one you would recommand.
For me, one query wouold be aggregate by year/month/day.
Thanks
Shannon
Created 06-13-2017 01:03 PM
Created 06-13-2017 02:39 PM
Thanks.
Sorry i was not clear when i said diffference, i meant is there any performance difference?
Created 06-14-2017 06:04 PM
Yes, very likely there will be a performance difference, but it's hard to say which one will be better without concrete examples.
Created 06-19-2017 07:52 AM
Thanks.
I have a related question, how does hdfs/impala know that one of the fields/columns is used as the partition?
Shannon
Created 06-19-2017 09:20 AM
Created 06-19-2017 09:22 AM
Hdfs does not know about partitions. That information is stored in the Hive Metastore as part of the other table metadata.
A partition of a Impala/Hive table points to a directory in Hdfs. The values of partition columns are not stored in data files, they are "stored" in the Hdfs directory structure, e.g.
hdfs://warehouse/mytable/year=2017/month=6
might be a directory of a partitioned table "mytable" with partition columns year and month.
Created 06-20-2017 07:24 AM
Thanks