Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar

A colleague recently asked me how to create a custom function for Hive using Python. You can pretty much create a function in any language and plug it into your Hive query using the Hive TRANSFORM clause. TRANSFORM lets you add your own mappers and/or reducers to process the data. The example in this article is working code that I wrote a few years ago using an early version of Hive to demonstrate how to add a custom function.

In earlier versions of Hive we had to implement our own functions to hash sensitive data for PII compliance. Beginning with Hive 1.3 the SHA2 UDF was added to calculate a hash using SHA-224, SHA-256, SHA-384, or SHA-512. In my example below I create a custom UDF using Python to calculate the SHA-256 hash for social security number. Keep in mind that when I did this there were no out of the box Hive UDF’s available. This example is to only demonstrate how to write your own custom functions for Hive using Python.

First, we need to write some python code that will read each record passed in from Hive and process the data. Save this to a file:

#!/usr/local/bin/python
import hashlib
import sys

## we are receiving each record passed in from Hive via standard input 
## By default, columns will be transformed to STRING and delimited by TAB 
## Also, by default, NULL values will be converted to literal string \N to differentiate from empty strings
for line in sys.stdin:
  line = line.strip()
  (customer_no,ssn,plan,join_date,status,balance,region) = line.split('\t')

  ## hash social security number and emit all the fields to standard out
  x = hashlib.sha256(str(ssn))
  ssn = x.hexdigest()
  print '\t'.join([str(customer_no),str(ssn),plan,str(join_date),status,str(balance),region])

Now you can call the above python code from your HiveQL:

ADD FILE /path-to-my-script/my_python_code.py;
CREATE VIEW customer_data_mart_view.v_customer_balance
SELECT
TRANSFORM (
customer_no   
,ssn
,plan
,join_date
,status
,balance
,region)
USING '/path-to-my-script/my_python_code.py'
 AS customer_no   
,ssn
,plan
,join_date
,status
,balance
,region
FROM customer_data_mart.customer_details;
28,021 Views
Comments
avatar
Expert Contributor

One thing I wish I had known when starting with python UDF's is that you can write to stderr to assist in debugging. Then look in the Yarn RM for the logs.

import sys
sys.stderr.write('>>>> Read a line \n' + line + '\n')