Community Articles

gbraccialli3 · ‎10-02-2015

Sharing the steps to make Hive UDF/UDAF/UDTF to work natively with SparkSQL

1- Open spark-shell with hive udf jar as parameter:

spark-shell --jars path-to-your-hive-udf.jar

2- From spark-shell, open declare hive context and create functions

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);

sqlContext.sql("""create temporary function balance as 'com.github.gbraccialli.hive.udf.BalanceFromRechargesAndOrders'""");

3- From spark-shell, use your UDFs directly in SparkSQL:

sqlContext.sql("""
create table recharges_with_balance_array as 
select 
  reseller_id,
  phone_number,
  phone_credit_id,
  date_recharge,
  phone_credit_value,
  balance(orders,'date_order', 'order_value', reseller_id, date_recharge, phone_credit_value) as balance
from orders
""");

PS: I found some issues using UDTFs with spark 1.3, which was fixed on spark 1.4.1. I tested all, UDF, UDAF and UDTF, all of them worked properly, same sql statements and same results.

azeltov · ‎11-10-2015

@Guilherme Braccialli thanks for the write up. Would love to see a sample implementation of the actual UDF. A basic shell of creating a scala project for the udf.

gbraccialli3 · ‎11-10-2015

@azeltov@hortonworks.com see below github project for hive udf, exactly same jar built for hive project works fine with spark. The scala code is above:

https://github.com/gbraccialli/HiveUtils

azeltov · ‎11-10-2015

Awesome! Will check it out. Another question, i had problems with zeppelin registering a custom jar, i had to use a maven dependency, could not load it using the path. It works fine using CLI and passing --jars path-to-your-hive-udf.jar argument. Curious if u tried it in zeppelin?

gbraccialli3 · ‎11-11-2015

@azeltov@hortonworks.com I tried now and getting ClassNotFoundException with z.load("/path/to/jar") and also with maven local repo dependency.

Could you make it work with zeppelin?

gbraccialli3 · ‎11-11-2015

see this jira: https://issues.apache.org/jira/browse/ZEPPELIN-150

azeltov · ‎11-11-2015

@Guilherme Braccialli

Hopefully that JIRA will be fixed soon. Did you register the custom jar in the local maven of your sandbox? And that still did not work?

gbraccialli3 · ‎11-11-2015

@azeltov@hortonworks.com

See what I tried.

First, with spark-shell

1- Download jar and register to local maven

su - zeppelin

wget https://raw.githubusercontent.com/gbraccialli/HiveUtils/master/target/HiveUtils-1.0-SNAPSHOT-jar-wit... /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar

mvn org.apache.maven.plugins:maven-install-plugin:2.5.2:install-file \
 -Dfile=/tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar \
 -DgroupId=com.github.gbraccialli \
 -DartifactId=HiveUtils \
 -Dversion=1.0-SNAPSHOT \
 -Dpackaging=jar

gbraccialli3 · ‎11-11-2015

2- Run spark-shell with dependency

spark-shell --master yarn-client --packages "com.github.gbraccialli:HiveUtils:1.0-SNAPSHOT"

3- Run spark code

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
sqlContext.sql("""create temporary function geohash_encode as 'com.github.gbraccialli.hive.udf.UDFGeohashEncode'""");
sqlContext.sql("""select geohash_encode(1.11,1.11,3) from sample_07 limit 10""").collect().foreach(println);

spark-shell worked fine!

gbraccialli3 · ‎11-11-2015

Second, the same with zeppelin:

1- Restart interpreter

2- Load dependencies

%dep 

z.reset()
z.load("com.github.gbraccialli:HiveUtils:1.0-SNAPSHOT")

3- Execute sql, using same scale code

 val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
sqlContext.sql("""create temporary function geohash_encode as 'com.github.gbraccialli.hive.udf.UDFGeohashEncode'""");
sqlContext.sql("""select geohash_encode(1.11,1.11,3) from sample_07 limit 10""").collect().foreach(println);

It worked with scale code + zeppelin!!!!!

gbraccialli3 · ‎11-11-2015

4- Execute sql, using sql interpreter

%sql
select geohash_encode(1.11,1.11,3) from sample_07 limit 10

It fails with sql interpreter + zeppelin:

java.lang.ClassNotFoundException: com.github.gbraccialli.hive.udf.UDFGeohashEncode
	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
...

Cloudera Community

Community Articles

Using Hive UDF/UDAF/UDTF with SparkSQL

Apache Hive

Apache Spark

Re: Using Hive UDF/UDAF/UDTF with SparkSQL

Re: Using Hive UDF/UDAF/UDTF with SparkSQL

Re: Using Hive UDF/UDAF/UDTF with SparkSQL

Re: Using Hive UDF/UDAF/UDTF with SparkSQL

Re: Using Hive UDF/UDAF/UDTF with SparkSQL

Re: Using Hive UDF/UDAF/UDTF with SparkSQL

Re: Using Hive UDF/UDAF/UDTF with SparkSQL

Re: Using Hive UDF/UDAF/UDTF with SparkSQL

Re: Using Hive UDF/UDAF/UDTF with SparkSQL

Re: Using Hive UDF/UDAF/UDTF with SparkSQL

Federated SparkSQL using SAP HANA and HDP Hive usi...

SparkSQL jdbc Federation

Querying Data via SparkSQL with ODBC Tools

Connect to CDP DataHub Hive using Cloudera ODBC Dr...

Adding Hive SerDe jar on SparkSQL Thrift Server

Hive Statistics: Why Useful

How to insert data into Hive from SparkSQL

Is SparkSQL faster than Hive or Beeline?

SparkSQL - org.apache.spark.sql.catalyst.types.St...

Hql to Sparksql: Hive query execution giving error