Created 07-05-2017 12:22 PM
Steps:
1. Created a java class extending hive GenericUDTF, created SparkSession in that:
Public Class sparkUDTF extends GenericHiveUDTF { ... static Long sparkJob(String tableName) { SparkSession spark = SparkSession.builder().enableHiveSupport().master("yarn-client").appName("SampleSparkUDTF_yarnV1").getOrCreate(); Dataset inputData = spark.read().table(tableName); //input to the function “text”, “hive table” Long countRows = inputData.count(); //access hive table return countRows; } }
2. Copied this custom UDTF jar into hdfs and also into auxlib
3. Copied /usr/hdp/<2.6.x>/spark2/jars/*.jar into /usr/hdp/<2.6.x>/hive/auxlib/
4. Connecting to HS2 using beeline to run this Spark UDT:
beeline -u jdbc:hive2://localhost:10000 -d org.apache.hive.jdbc.HiveDriver CREATE TABLE TestTable (i int);INSERT INTO TestTable VALUES (1); CREATE FUNCTION SparkUDT AS 'SparkHiveUDTF' using jar 'hdfs:///tmp/sparkHiveGenericUDTF-1.0.jar' ; SELECT SparkUDT('tbl','TestTable');
On HDP 2.6 cluster, Spark 2.1 - causes this error:
Caused by: java.lang.IllegalStateException: Library directory '/hadoop/yarn/local/usercache/hive/appcache/application_1499162780176_0014/container_e03_1 499162780176_0014_01_000005/assembly/target/scala-2.11/jars' does not exist; make sure Spark is built. at org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:260) at org.apache.spark.launcher.CommandBuilderUtils.findJarsDir(CommandBuilderUtils.java:380) at org.apache.spark.launcher.YarnCommandBuilderUtils$.findJarsDir(YarnCommandBuilderUtils.scala:38) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:570) at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:895) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:171) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156) at org.apache.spark.SparkContext.<init>(SparkContext.scala:509) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2320) at org.apache.spark.sql.SparkSession$Builder$anonfun$6.apply(SparkSession.scala:868) at org.apache.spark.sql.SparkSession$Builder$anonfun$6.apply(SparkSession.scala:860) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860) at SparkHiveUDTF.sparkJob(SparkHiveUDTF.java:97) at SparkHiveUDTF.process(SparkHiveUDTF.java:78) at org.apache.hadoop.hive.ql.exec.UDTFOperator.process(UDTFOperator.java:109) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:841) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:841) at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:133) at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:170) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:555) ... 18 more
Created 07-05-2017 01:35 PM
After setting the config in SparkSession source code:
SparkSession spark = SparkSession
.builder()
.enableHiveSupport()
.master("yarn-client")
.appName("SampleSparkUDTF_yarnV1")
.config("spark.yarn.jars","hdfs:///hdp/apps/2.6.1.0-129/spark2")
.config("spark.yarn.am.extraJavaOptions","-Dhdp.version=2.6.1.0-129")
.config("spark.driver.extra.JavaOptions","-Dhdp.version=2.6.1.0-129")
.config("spark.executor.memory","4g")
.getOrCreate();
While testing via HS2 & this is the error:
beeline -u jdbc:hive2://localhost:10000 -d org.apache.hive.jdbc.HiveDriver
0: jdbc:hive2://localhost:10000>
……
], TaskAttempt 3 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable (null) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:194) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:185) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:181) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable (null) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:68) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:325) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150) ... 14 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable (null) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:563) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:83) ... 17 more Caused by: org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master. at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156) at org.apache.spark.SparkContext.<init>(SparkContext.scala:509) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2320) at org.apache.spark.sql.SparkSession$Builder$anonfun$6.apply(SparkSession.scala:868) at org.apache.spark.sql.SparkSession$Builder$anonfun$6.apply(SparkSession.scala:860) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860) at SparkHiveUDTF.sparkJob(SparkHiveUDTF.java:102) at SparkHiveUDTF.process(SparkHiveUDTF.java:78) at org.apache.hadoop.hive.ql.exec.UDTFOperator.process(UDTFOperator.java:109) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:841) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:841) at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:133) at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:170) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:555) ... 18 more
Created 07-05-2017 05:54 PM
@sudha Can you try the same with Spark thrift server?
Created 07-06-2017 02:58 AM
On spark thrift server, after create function, when function is called (SELECT SparkUDTF('txt','table')) gives error that function is not recognized.
Created 07-06-2017 05:30 AM
While testing like this, it does not read hive-site.xml, spark-env.sh of the cluster.
Is there a way to make it read spark config present in the cluster?