Support Questions

Find answers, ask questions, and share your expertise

query on partition question

avatar
Contributor

I partition with year/month/day, is therre a difference if i query/aggregate using that or the timestamp (used for partition) column?

 

Thanks

Shannon

1 ACCEPTED SOLUTION

avatar
Champion
The table definition defines the virtual/partition column and in HDFS it is created as directories and subdirectories. So it checks the table definition and then searches for a directory under the table directory that matches the partition column name, and then prunes by the value.

View solution in original post

9 REPLIES 9

avatar

I'm inclined to say that yes there will be a difference. One or two example queries to show the alternatives would be helpful for me to give you a more accurate response.

 

avatar
Contributor

I was asking just in general if there is any difference, and which one you would recommand.

 

For me, one query wouold be aggregate by year/month/day.

 

Thanks

Shannon 

avatar
Champion
TL:DR The results will be identical if used in the same manner but the runtime and resource requirements will be different.

If I understand the question correctly you are asking this:

If there is a timestamp column that you use to create the partition columns, is there a difference in querying on each.

This goes back to partition columns being virtual columns. If you set a partitions column based on an actual column and just change the name, then the physical column (timestamp) remains and the virtual columns (YMD) exist in the form of the directory structure in HDFS. When you query on the partitions columns it will perform partition pruning, on the other side it will no. But in effect the results will be the same for the aggregation. This is the same if you partition by subsets, i.e. year/month/day.

avatar
Contributor

Thanks.

 

Sorry i was not clear when i said diffference, i meant is there any performance difference?

avatar

Yes, very likely there will be a performance difference, but it's hard to say which one will be better without concrete examples.

avatar
Contributor

Thanks.

 

I have a related question, how does hdfs/impala know that one of the fields/columns is used as the partition?

 

Shannon

avatar
Champion
The table definition defines the virtual/partition column and in HDFS it is created as directories and subdirectories. So it checks the table definition and then searches for a directory under the table directory that matches the partition column name, and then prunes by the value.

avatar

Hdfs does not know about partitions. That information is stored in the Hive Metastore as part of the other table metadata.

 

A partition of a Impala/Hive table points to a directory in Hdfs. The values of partition columns are not stored in data files, they are "stored" in the Hdfs directory structure, e.g.

 

hdfs://warehouse/mytable/year=2017/month=6

 

might be a directory of a partitioned table "mytable" with partition columns year and month.

avatar
Contributor

Thanks