- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on 10-02-2015 04:47 PM
Sharing the steps to make Hive UDF/UDAF/UDTF to work natively with SparkSQL
1- Open spark-shell with hive udf jar as parameter:
spark-shell --jars path-to-your-hive-udf.jar
2- From spark-shell, open declare hive context and create functions
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc); sqlContext.sql("""create temporary function balance as 'com.github.gbraccialli.hive.udf.BalanceFromRechargesAndOrders'""");
3- From spark-shell, use your UDFs directly in SparkSQL:
sqlContext.sql(""" create table recharges_with_balance_array as select reseller_id, phone_number, phone_credit_id, date_recharge, phone_credit_value, balance(orders,'date_order', 'order_value', reseller_id, date_recharge, phone_credit_value) as balance from orders """);
PS: I found some issues using UDTFs with spark 1.3, which was fixed on spark 1.4.1. I tested all, UDF, UDAF and UDTF, all of them worked properly, same sql statements and same results.
Created on 11-10-2015 09:36 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
@Guilherme Braccialli thanks for the write up. Would love to see a sample implementation of the actual UDF. A basic shell of creating a scala project for the udf.
Created on 11-10-2015 09:41 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
@azeltov@hortonworks.com see below github project for hive udf, exactly same jar built for hive project works fine with spark. The scala code is above:
Created on 11-10-2015 09:44 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Awesome! Will check it out. Another question, i had problems with zeppelin registering a custom jar, i had to use a maven dependency, could not load it using the path. It works fine using CLI and passing --jars path-to-your-hive-udf.jar argument. Curious if u tried it in zeppelin?
Created on 11-11-2015 02:57 AM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
@azeltov@hortonworks.com I tried now and getting ClassNotFoundException with z.load("/path/to/jar") and also with maven local repo dependency.
Could you make it work with zeppelin?
Created on 11-11-2015 02:59 AM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
see this jira: https://issues.apache.org/jira/browse/ZEPPELIN-150
Created on 11-11-2015 03:11 AM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Hopefully that JIRA will be fixed soon. Did you register the custom jar in the local maven of your sandbox? And that still did not work?
Created on 11-11-2015 02:39 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
See what I tried.
First, with spark-shell
1- Download jar and register to local maven
su - zeppelin wget https://raw.githubusercontent.com/gbraccialli/HiveUtils/master/target/HiveUtils-1.0-SNAPSHOT-jar-wit... /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar mvn org.apache.maven.plugins:maven-install-plugin:2.5.2:install-file \ -Dfile=/tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar \ -DgroupId=com.github.gbraccialli \ -DartifactId=HiveUtils \ -Dversion=1.0-SNAPSHOT \ -Dpackaging=jar
Created on 11-11-2015 02:39 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
2- Run spark-shell with dependency
spark-shell --master yarn-client --packages "com.github.gbraccialli:HiveUtils:1.0-SNAPSHOT"
3- Run spark code
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc); sqlContext.sql("""create temporary function geohash_encode as 'com.github.gbraccialli.hive.udf.UDFGeohashEncode'"""); sqlContext.sql("""select geohash_encode(1.11,1.11,3) from sample_07 limit 10""").collect().foreach(println);
spark-shell worked fine!
Created on 11-11-2015 02:40 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Second, the same with zeppelin:
1- Restart interpreter
2- Load dependencies
%dep z.reset() z.load("com.github.gbraccialli:HiveUtils:1.0-SNAPSHOT")
3- Execute sql, using same scale code
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc); sqlContext.sql("""create temporary function geohash_encode as 'com.github.gbraccialli.hive.udf.UDFGeohashEncode'"""); sqlContext.sql("""select geohash_encode(1.11,1.11,3) from sample_07 limit 10""").collect().foreach(println);
It worked with scale code + zeppelin!!!!!
Created on 11-11-2015 02:40 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
4- Execute sql, using sql interpreter
%sql select geohash_encode(1.11,1.11,3) from sample_07 limit 10
It fails with sql interpreter + zeppelin:
java.lang.ClassNotFoundException: com.github.gbraccialli.hive.udf.UDFGeohashEncode at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) ...