Support Questions

ponypony · ‎06-12-2017

I partition with year/month/day, is therre a difference if i query/aggregate using that or the timestamp (used for partition) column?

Thanks

Shannon

mbigelow · ‎06-19-2017

The table definition defines the virtual/partition column and in HDFS it is created as directories and subdirectories. So it checks the table definition and then searches for a directory under the table directory that matches the partition column name, and then prunes by the value.

View solution in original post

alex.behm · ‎06-12-2017

I'm inclined to say that yes there will be a difference. One or two example queries to show the alternatives would be helpful for me to give you a more accurate response.

ponypony · ‎06-12-2017

I was asking just in general if there is any difference, and which one you would recommand.

For me, one query wouold be aggregate by year/month/day.

Thanks

Shannon

mbigelow · ‎06-13-2017

TL:DR The results will be identical if used in the same manner but the runtime and resource requirements will be different.

If I understand the question correctly you are asking this:

If there is a timestamp column that you use to create the partition columns, is there a difference in querying on each.

This goes back to partition columns being virtual columns. If you set a partitions column based on an actual column and just change the name, then the physical column (timestamp) remains and the virtual columns (YMD) exist in the form of the directory structure in HDFS. When you query on the partitions columns it will perform partition pruning, on the other side it will no. But in effect the results will be the same for the aggregation. This is the same if you partition by subsets, i.e. year/month/day.

ponypony · ‎06-13-2017

Thanks.

Sorry i was not clear when i said diffference, i meant is there any performance difference?

alex.behm · ‎06-14-2017

Yes, very likely there will be a performance difference, but it's hard to say which one will be better without concrete examples.

ponypony · ‎06-19-2017

Thanks.

I have a related question, how does hdfs/impala know that one of the fields/columns is used as the partition?

Shannon

mbigelow · ‎06-19-2017

The table definition defines the virtual/partition column and in HDFS it is created as directories and subdirectories. So it checks the table definition and then searches for a directory under the table directory that matches the partition column name, and then prunes by the value.

alex.behm · ‎06-19-2017

Hdfs does not know about partitions. That information is stored in the Hive Metastore as part of the other table metadata.

A partition of a Impala/Hive table points to a directory in Hdfs. The values of partition columns are not stored in data files, they are "stored" in the Hdfs directory structure, e.g.

hdfs://warehouse/mytable/year=2017/month=6

might be a directory of a partitioned table "mytable" with partition columns year and month.

ponypony · ‎06-20-2017

Thanks

Cloudera Community

Support Questions

query on partition question