Created on 10-02-2015 04:47 PM
Sharing the steps to make Hive UDF/UDAF/UDTF to work natively with SparkSQL
1- Open spark-shell with hive udf jar as parameter:
spark-shell --jars path-to-your-hive-udf.jar
2- From spark-shell, open declare hive context and create functions
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc); sqlContext.sql("""create temporary function balance as 'com.github.gbraccialli.hive.udf.BalanceFromRechargesAndOrders'""");
3- From spark-shell, use your UDFs directly in SparkSQL:
sqlContext.sql(""" create table recharges_with_balance_array as select reseller_id, phone_number, phone_credit_id, date_recharge, phone_credit_value, balance(orders,'date_order', 'order_value', reseller_id, date_recharge, phone_credit_value) as balance from orders """);
PS: I found some issues using UDTFs with spark 1.3, which was fixed on spark 1.4.1. I tested all, UDF, UDAF and UDTF, all of them worked properly, same sql statements and same results.
Created on 11-10-2015 09:36 PM
@Guilherme Braccialli thanks for the write up. Would love to see a sample implementation of the actual UDF. A basic shell of creating a scala project for the udf.
Created on 11-10-2015 09:41 PM
@azeltov@hortonworks.com see below github project for hive udf, exactly same jar built for hive project works fine with spark. The scala code is above:
Created on 11-10-2015 09:44 PM
Awesome! Will check it out. Another question, i had problems with zeppelin registering a custom jar, i had to use a maven dependency, could not load it using the path. It works fine using CLI and passing --jars path-to-your-hive-udf.jar argument. Curious if u tried it in zeppelin?
Created on 11-11-2015 02:57 AM
@azeltov@hortonworks.com I tried now and getting ClassNotFoundException with z.load("/path/to/jar") and also with maven local repo dependency.
Could you make it work with zeppelin?
Created on 11-11-2015 02:59 AM
see this jira: https://issues.apache.org/jira/browse/ZEPPELIN-150
Created on 11-11-2015 03:11 AM
Hopefully that JIRA will be fixed soon. Did you register the custom jar in the local maven of your sandbox? And that still did not work?
Created on 11-11-2015 02:39 PM
See what I tried.
First, with spark-shell
1- Download jar and register to local maven
su - zeppelin wget https://raw.githubusercontent.com/gbraccialli/HiveUtils/master/target/HiveUtils-1.0-SNAPSHOT-jar-wit... /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar mvn org.apache.maven.plugins:maven-install-plugin:2.5.2:install-file \ -Dfile=/tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar \ -DgroupId=com.github.gbraccialli \ -DartifactId=HiveUtils \ -Dversion=1.0-SNAPSHOT \ -Dpackaging=jar
Created on 11-11-2015 02:39 PM
2- Run spark-shell with dependency
spark-shell --master yarn-client --packages "com.github.gbraccialli:HiveUtils:1.0-SNAPSHOT"
3- Run spark code
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc); sqlContext.sql("""create temporary function geohash_encode as 'com.github.gbraccialli.hive.udf.UDFGeohashEncode'"""); sqlContext.sql("""select geohash_encode(1.11,1.11,3) from sample_07 limit 10""").collect().foreach(println);
spark-shell worked fine!
Created on 11-11-2015 02:40 PM
Second, the same with zeppelin:
1- Restart interpreter
2- Load dependencies
%dep z.reset() z.load("com.github.gbraccialli:HiveUtils:1.0-SNAPSHOT")
3- Execute sql, using same scale code
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc); sqlContext.sql("""create temporary function geohash_encode as 'com.github.gbraccialli.hive.udf.UDFGeohashEncode'"""); sqlContext.sql("""select geohash_encode(1.11,1.11,3) from sample_07 limit 10""").collect().foreach(println);
It worked with scale code + zeppelin!!!!!
Created on 11-11-2015 02:40 PM
4- Execute sql, using sql interpreter
%sql select geohash_encode(1.11,1.11,3) from sample_07 limit 10
It fails with sql interpreter + zeppelin:
java.lang.ClassNotFoundException: com.github.gbraccialli.hive.udf.UDFGeohashEncode at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) ...