Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (2)

Sharing the steps to make Hive UDF/UDAF/UDTF to work natively with SparkSQL

1- Open spark-shell with hive udf jar as parameter:

spark-shell --jars path-to-your-hive-udf.jar

2- From spark-shell, open declare hive context and create functions

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);

sqlContext.sql("""create temporary function balance as 'com.github.gbraccialli.hive.udf.BalanceFromRechargesAndOrders'""");


3- From spark-shell, use your UDFs directly in SparkSQL:

sqlContext.sql("""
create table recharges_with_balance_array as 
select 
  reseller_id,
  phone_number,
  phone_credit_id,
  date_recharge,
  phone_credit_value,
  balance(orders,'date_order', 'order_value', reseller_id, date_recharge, phone_credit_value) as balance
from orders
""");

PS: I found some issues using UDTFs with spark 1.3, which was fixed on spark 1.4.1. I tested all, UDF, UDAF and UDTF, all of them worked properly, same sql statements and same results.

9,252 Views
Comments

@Guilherme Braccialli thanks for the write up. Would love to see a sample implementation of the actual UDF. A basic shell of creating a scala project for the udf.

@azeltov@hortonworks.com see below github project for hive udf, exactly same jar built for hive project works fine with spark. The scala code is above:

https://github.com/gbraccialli/HiveUtils

Awesome! Will check it out. Another question, i had problems with zeppelin registering a custom jar, i had to use a maven dependency, could not load it using the path. It works fine using CLI and passing --jars path-to-your-hive-udf.jar argument. Curious if u tried it in zeppelin?

@azeltov@hortonworks.com I tried now and getting ClassNotFoundException with z.load("/path/to/jar") and also with maven local repo dependency.

Could you make it work with zeppelin?

@Guilherme Braccialli

Hopefully that JIRA will be fixed soon. Did you register the custom jar in the local maven of your sandbox? And that still did not work?

@azeltov@hortonworks.com

See what I tried.

First, with spark-shell

1- Download jar and register to local maven

su - zeppelin

wget https://raw.githubusercontent.com/gbraccialli/HiveUtils/master/target/HiveUtils-1.0-SNAPSHOT-jar-wit... /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar

mvn org.apache.maven.plugins:maven-install-plugin:2.5.2:install-file \
 -Dfile=/tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar \
 -DgroupId=com.github.gbraccialli \
 -DartifactId=HiveUtils \
 -Dversion=1.0-SNAPSHOT \
 -Dpackaging=jar 

2- Run spark-shell with dependency

spark-shell --master yarn-client --packages "com.github.gbraccialli:HiveUtils:1.0-SNAPSHOT"

3- Run spark code

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
sqlContext.sql("""create temporary function geohash_encode as 'com.github.gbraccialli.hive.udf.UDFGeohashEncode'""");
sqlContext.sql("""select geohash_encode(1.11,1.11,3) from sample_07 limit 10""").collect().foreach(println);

spark-shell worked fine!

Second, the same with zeppelin:

1- Restart interpreter

2- Load dependencies

%dep 

z.reset()
z.load("com.github.gbraccialli:HiveUtils:1.0-SNAPSHOT")

3- Execute sql, using same scale code

 val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
sqlContext.sql("""create temporary function geohash_encode as 'com.github.gbraccialli.hive.udf.UDFGeohashEncode'""");
sqlContext.sql("""select geohash_encode(1.11,1.11,3) from sample_07 limit 10""").collect().foreach(println);

It worked with scale code + zeppelin!!!!!

4- Execute sql, using sql interpreter

%sql
select geohash_encode(1.11,1.11,3) from sample_07 limit 10

It fails with sql interpreter + zeppelin:

java.lang.ClassNotFoundException: com.github.gbraccialli.hive.udf.UDFGeohashEncode
	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
...

@Guilherme Braccialli thanks for trying this. I am just starting with my Spark journey, but it seems that any time I try to do in zeppelin or jyputer i keep hitting different issues, I guess I should just stick with CLI for now. I will give demo's in zeppelin @ customer sites but will know the limitations of the product for now.

@Guilherme Braccialli Just tried your posted steps, everything worked great. Had problems doing mvn build using ur repo, but thats not an issue for me. I have the template idea, how to interact with hive and spark. Thanks for the post, very useful!

Not applicable

well , i tried create temporary function in beeline , and failed ; and create function, the function was created , and can be desc function. but , when i accessing it , it show can not find the function . so , do you have the same problem. I'm trying to let everyone connected to my thriftserver to have access to udf that I deployed. do you have any suggestions?

New Contributor

Hi @Guilherme Braccialli, so you did not run into this issue?

https://issues.apache.org/jira/browse/SPARK-20033

Thank you.

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎10-02-2015 04:47 PM
Updated by:
 
Contributors
Top Kudoed Authors