Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to use a UDF to specific a partition where clause?

Highlighted

How to use a UDF to specific a partition where clause?

Rising Star

We have lots of partitioned tables and need to write queries that have partition clauses that include year, month, day values that are not as simple as:

where year=2016 and month=8 and day between 7 and 14

Often, they require non contiguous ranges of days over different months and years, etc. So, I am trying to come up with a way to help users craft those partitions clauses more easily and programatically (to the greatest extent possible). I don't want users to have to write wrapper scripts around queries or have to write queries into files first as they can do something like this:

beeline 'connect-string' --hivevar part_string=$(./make_part.py 2015-12-25 2016-01-07) -f some.hql

Where make_part.py might be a Python script that takes two dates and forms the full partition clause string that can then be simply referenced in some.hql, like this, which I know would work:

select * from table where ${hivevar:part_string};

What I would like to do is something "functionally" like this but from within beeline so users can work more interactively in the beeline shell. For example, I wish you could do this from within beeline:

set hivevar:part_string=!sh ./make_part.py 2015-12-25 2016-01-07

And have the output of the !sh command become the value of the hive variable. That does not work, of course. So, I was wondering, is it possible to create a UDF that could be used in a WHERE clause that could return the partition clause string which would get evaluated by Hive properly and work as expected - meaning, proper partition pruning. Something like this:

select * from table where udf_make_part("2015-12-25",  "2016-01-07");

Where udf_make_part would do the same thing as make_part.py - take some date arguments and return some generated partition string.

I've not worked with UDFs so far but just wondering if they could be used in this context in the WHERE clause. Or, does anyone have another useful approach for dealing with long, complicated partition clauses?

Don't have an account?
Coming from Hortonworks? Activate your account here