Community Articles

Find and share helpful community-sourced technical articles.
Labels (2)
avatar

Sharing the steps to make Hive UDF/UDAF/UDTF to work natively with SparkSQL

1- Open spark-shell with hive udf jar as parameter:

spark-shell --jars path-to-your-hive-udf.jar

2- From spark-shell, open declare hive context and create functions

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);

sqlContext.sql("""create temporary function balance as 'com.github.gbraccialli.hive.udf.BalanceFromRechargesAndOrders'""");


3- From spark-shell, use your UDFs directly in SparkSQL:

sqlContext.sql("""
create table recharges_with_balance_array as 
select 
  reseller_id,
  phone_number,
  phone_credit_id,
  date_recharge,
  phone_credit_value,
  balance(orders,'date_order', 'order_value', reseller_id, date_recharge, phone_credit_value) as balance
from orders
""");

PS: I found some issues using UDTFs with spark 1.3, which was fixed on spark 1.4.1. I tested all, UDF, UDAF and UDTF, all of them worked properly, same sql statements and same results.

18,945 Views
Comments
avatar

@Guilherme Braccialli thanks for the write up. Would love to see a sample implementation of the actual UDF. A basic shell of creating a scala project for the udf.

avatar

@azeltov@hortonworks.com see below github project for hive udf, exactly same jar built for hive project works fine with spark. The scala code is above:

https://github.com/gbraccialli/HiveUtils

avatar

Awesome! Will check it out. Another question, i had problems with zeppelin registering a custom jar, i had to use a maven dependency, could not load it using the path. It works fine using CLI and passing --jars path-to-your-hive-udf.jar argument. Curious if u tried it in zeppelin?

avatar

@azeltov@hortonworks.com I tried now and getting ClassNotFoundException with z.load("/path/to/jar") and also with maven local repo dependency.

Could you make it work with zeppelin?

avatar
@Guilherme Braccialli

Hopefully that JIRA will be fixed soon. Did you register the custom jar in the local maven of your sandbox? And that still did not work?

avatar

@azeltov@hortonworks.com

See what I tried.

First, with spark-shell

1- Download jar and register to local maven

su - zeppelin

wget https://raw.githubusercontent.com/gbraccialli/HiveUtils/master/target/HiveUtils-1.0-SNAPSHOT-jar-wit... /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar

mvn org.apache.maven.plugins:maven-install-plugin:2.5.2:install-file \
 -Dfile=/tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar \
 -DgroupId=com.github.gbraccialli \
 -DartifactId=HiveUtils \
 -Dversion=1.0-SNAPSHOT \
 -Dpackaging=jar 
avatar

2- Run spark-shell with dependency

spark-shell --master yarn-client --packages "com.github.gbraccialli:HiveUtils:1.0-SNAPSHOT"

3- Run spark code

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
sqlContext.sql("""create temporary function geohash_encode as 'com.github.gbraccialli.hive.udf.UDFGeohashEncode'""");
sqlContext.sql("""select geohash_encode(1.11,1.11,3) from sample_07 limit 10""").collect().foreach(println);

spark-shell worked fine!

avatar

Second, the same with zeppelin:

1- Restart interpreter

2- Load dependencies

%dep 

z.reset()
z.load("com.github.gbraccialli:HiveUtils:1.0-SNAPSHOT")

3- Execute sql, using same scale code

 val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
sqlContext.sql("""create temporary function geohash_encode as 'com.github.gbraccialli.hive.udf.UDFGeohashEncode'""");
sqlContext.sql("""select geohash_encode(1.11,1.11,3) from sample_07 limit 10""").collect().foreach(println);

It worked with scale code + zeppelin!!!!!

avatar

4- Execute sql, using sql interpreter

%sql
select geohash_encode(1.11,1.11,3) from sample_07 limit 10

It fails with sql interpreter + zeppelin:

java.lang.ClassNotFoundException: com.github.gbraccialli.hive.udf.UDFGeohashEncode
	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
...