- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
query on partition question
- Labels:
-
Apache Impala
Created on 06-12-2017 10:18 AM - edited 09-16-2022 04:44 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I partition with year/month/day, is therre a difference if i query/aggregate using that or the timestamp (used for partition) column?
Thanks
Shannon
Created 06-19-2017 09:20 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created 06-12-2017 10:49 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm inclined to say that yes there will be a difference. One or two example queries to show the alternatives would be helpful for me to give you a more accurate response.
Created 06-12-2017 11:58 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was asking just in general if there is any difference, and which one you would recommand.
For me, one query wouold be aggregate by year/month/day.
Thanks
Shannon
Created 06-13-2017 01:03 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If I understand the question correctly you are asking this:
If there is a timestamp column that you use to create the partition columns, is there a difference in querying on each.
This goes back to partition columns being virtual columns. If you set a partitions column based on an actual column and just change the name, then the physical column (timestamp) remains and the virtual columns (YMD) exist in the form of the directory structure in HDFS. When you query on the partitions columns it will perform partition pruning, on the other side it will no. But in effect the results will be the same for the aggregation. This is the same if you partition by subsets, i.e. year/month/day.
Created 06-13-2017 02:39 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks.
Sorry i was not clear when i said diffference, i meant is there any performance difference?
Created 06-14-2017 06:04 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, very likely there will be a performance difference, but it's hard to say which one will be better without concrete examples.
Created 06-19-2017 07:52 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks.
I have a related question, how does hdfs/impala know that one of the fields/columns is used as the partition?
Shannon
Created 06-19-2017 09:20 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created 06-19-2017 09:22 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hdfs does not know about partitions. That information is stored in the Hive Metastore as part of the other table metadata.
A partition of a Impala/Hive table points to a directory in Hdfs. The values of partition columns are not stored in data files, they are "stored" in the Hdfs directory structure, e.g.
hdfs://warehouse/mytable/year=2017/month=6
might be a directory of a partitioned table "mytable" with partition columns year and month.
Created 06-20-2017 07:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks
