A colleague recently asked me how to create a custom function for Hive using Python. You can pretty much create a function in any language and plug it into your Hive query using the Hive TRANSFORM clause. TRANSFORM lets you add your own mappers and/or reducers to process the data. The example in this article is working code that I wrote a few years ago using an early version of Hive to demonstrate how to add a custom function.
In earlier versions of Hive we had to implement our own functions to hash sensitive data for PII compliance. Beginning with Hive 1.3 the SHA2 UDF was added to calculate a hash using SHA-224, SHA-256, SHA-384, or SHA-512. In my example below I create a custom UDF using Python to calculate the SHA-256 hash for social security number. Keep in mind that when I did this there were no out of the box Hive UDF’s available. This example is to only demonstrate how to write your own custom functions for Hive using Python.
First, we need to write some python code that will read each record passed in from Hive and process the data. Save this to a file:
## we are receiving each record passed in from Hive via standard input
## By default, columns will be transformed to STRING and delimited by TAB
## Also, by default, NULL values will be converted to literal string \N to differentiate from empty strings
for line in sys.stdin:
line = line.strip()
(customer_no,ssn,plan,join_date,status,balance,region) = line.split('\t')
## hash social security number and emit all the fields to standard out
x = hashlib.sha256(str(ssn))
ssn = x.hexdigest()
Now you can call the above python code from your HiveQL:
ADD FILE /path-to-my-script/my_python_code.py;
CREATE VIEW customer_data_mart_view.v_customer_balance