Reply
Explorer
Posts: 15
Registered: ‎06-06-2017

Spark & Impala Exception in Yarn Cluster

[ Edited ]

Hi, Could anyone experience below exception from spark and impala, Kindly help. This program is working fine from the local IDE, but receiving the below error in the YARN cluster environment.(used kerb & ssl authenticaton for impala connection)

 

Test Driver Program:

 

package com.spark.sample;

import java.util.Properties;

import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;

import com.spark.conf.ConfigUtil;

public class SparkSQLImpalaExecute{


public static void main(String args[]) throws Exception {
String conn = "jdbc:impala://host:21050/test;AuthMech=1;KrbRealm=xxxx;KrbHostFQDN=host;KrbServiceName=testSn;SSL=1;SSLTrustStore=test.jks;SSLTrustStorePwd=yyyyyy";

System.setProperty("java.security.auth.login.config",args[0]); //"kerb5.conf"
System.setProperty("javax.security.auth.useSubjectCredsOnly","false");

String isLocal = args[1];

JavaSparkContext sc = new JavaSparkContext("Y".equalsIgnoreCase(isLocal) ? ConfigUtil.getConf().setMaster("local") : ConfigUtil.getConf());
SQLContext sqlContext = new SQLContext(sc);

Properties props = ConfigUtil.getImpalaConfig();

DataFrame right = sqlContext.read().jdbc(conn, "db.tab1", props);
DataFrame joined = sqlContext.read().jdbc(conn, "db.tab2", props).join(right, "id");

joined.show();

}
}

 

 

 

 

 

kerb5.conf

 

Client{
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
useTicketCache=false
keyTab="test.keytab"
doNotPrompt=true
principal="testPrincipal"
debug=true;
};

 

Spark Submit Command: (Received same exception in cluster mode as well)

 

spark-submit --master yarn --deploy-mode client --driver-memory 2G --executor-memory 2G --conf spark.yarn.access.namenodes=hdfs://nameNodeHost:8020 --jars $(echo /tmp/sparkpoc/dep-jars/*.jar | tr ' ' ',') --files test.jks,test.keytab,kerb5conf.conf --class com.spark.sample.SparkSQLImpalaExecute spark-poc-1.0-SNAPSHOT.jar kerb5.conf N

 

Using com.cloudera.impala.jdbc41.Driver (ImpalaJDBC41.jar and its dependant set of jars).

 

Exception Received is below

 

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 4, gbl20054571.eu.hedani.net): java.sql.SQLException: [Simba][ImpalaJDBCDriver](500164) Error initialized or created transport for authentication: [Simba][ImpalaJDBCDriver](500169) Unable to connect to server: GSS initiate failed.
at com.cloudera.hivecommon.api.HiveServer2ClientFactory.createTransport(Unknown Source)
at com.cloudera.hivecommon.api.HiveServer2ClientFactory.createClient(Unknown Source)
at com.cloudera.hivecommon.core.HiveJDBCCommonConnection.connect(Unknown Source)
at com.cloudera.impala.core.ImpalaJDBCConnection.connect(Unknown Source)
at com.cloudera.jdbc.common.BaseConnectionFactory.doConnect(Unknown Source)
at com.cloudera.jdbc.common.AbstractDriver.connect(Unknown Source)
at org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper.connect(DriverWrapper.scala:45)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:61)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:52)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:347)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Caused by: com.cloudera.support.exceptions.GeneralException: [Simba][ImpalaJDBCDriver](500164) Error initialized or created transport for authentication: [Simba][ImpalaJDBCDriver](500169) Unable to connect to server: GSS initiate failed.
... 25 more
Caused by: java.lang.RuntimeException: [Simba][ImpalaJDBCDriver](500169) Unable to connect to server: GSS initiate failed
at com.cloudera.hivecommon.api.HiveServerPrivilegedAction.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at com.cloudera.hivecommon.api.HiveServer2ClientFactory.createTransport(Unknown Source)
at com.cloudera.hivecommon.api.HiveServer2ClientFactory.createClient(Unknown Source)
at com.cloudera.hivecommon.core.HiveJDBCCommonConnection.connect(Unknown Source)
at com.cloudera.impala.core.ImpalaJDBCConnection.connect(Unknown Source)
at com.cloudera.jdbc.common.BaseConnectionFactory.doConnect(Unknown Source)
at com.cloudera.jdbc.common.AbstractDriver.connect(Unknown Source)
at org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper.connect(DriverWrapper.scala:45)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:61)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:52)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:347)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.thrift.transport.TTransportException: GSS initiate failed
at org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:316)
at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
... 29 more

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1433)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1421)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1420)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1420)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1642)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1590)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1843)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1856)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1869)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212)
at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:170)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:350)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:311)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:319)
at com.spark.sample.SparkSQLImpalaExecute.main(SparkSQLImpalaExecute.java:45)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.sql.SQLException: [Simba][ImpalaJDBCDriver](500164) Error initialized or created transport for authentication: [Simba][ImpalaJDBCDriver](500169) Unable to connect to server: GSS initiate failed.
at com.cloudera.hivecommon.api.HiveServer2ClientFactory.createTransport(Unknown Source)
at com.cloudera.hivecommon.api.HiveServer2ClientFactory.createClient(Unknown Source)
at com.cloudera.hivecommon.core.HiveJDBCCommonConnection.connect(Unknown Source)
at com.cloudera.impala.core.ImpalaJDBCConnection.connect(Unknown Source)
at com.cloudera.jdbc.common.BaseConnectionFactory.doConnect(Unknown Source)
at com.cloudera.jdbc.common.AbstractDriver.connect(Unknown Source)
at org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper.connect(DriverWrapper.scala:45)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:61)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:52)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:347)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Caused by: com.cloudera.support.exceptions.GeneralException: [Simba][ImpalaJDBCDriver](500164) Error initialized or created transport for authentication: [Simba][ImpalaJDBCDriver](500169) Unable to connect to server: GSS initiate failed.
... 25 more
Caused by: java.lang.RuntimeException: [Simba][ImpalaJDBCDriver](500169) Unable to connect to server: GSS initiate failed
at com.cloudera.hivecommon.api.HiveServerPrivilegedAction.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at com.cloudera.hivecommon.api.HiveServer2ClientFactory.createTransport(Unknown Source)
at com.cloudera.hivecommon.api.HiveServer2ClientFactory.createClient(Unknown Source)
at com.cloudera.hivecommon.core.HiveJDBCCommonConnection.connect(Unknown Source)
at com.cloudera.impala.core.ImpalaJDBCConnection.connect(Unknown Source)
at com.cloudera.jdbc.common.BaseConnectionFactory.doConnect(Unknown Source)
at com.cloudera.jdbc.common.AbstractDriver.connect(Unknown Source)
at org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper.connect(DriverWrapper.scala:45)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:61)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:52)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:347)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.thrift.transport.TTransportException: GSS initiate failed
at org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:316)
at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
... 29 more

Posts: 365
Topics: 11
Kudos: 57
Solutions: 30
Registered: ‎09-02-2016

Re: Spark & Impala Exception in Yarn Cluster

@Msdhan

 

Is your JDBC connection working from command prompt (using beeline for impala) ? Try this and make sure you have required kerberos ticket and there is no issue with connectivity

 

beeline -u "jdbc:impala://hostname:21050/DBname;principal=impala/hostname@REALM"

 

Have you tried your connectivity with Hive? are you getting similar issue

 

 

Posts: 630
Topics: 3
Kudos: 102
Solutions: 66
Registered: ‎08-16-2016

Re: Spark & Impala Exception in Yarn Cluster

First, you are launching in the job with yarn as the master but in your code you are setting it to local. The precedence is that it will use what is set in the configuration object over the command line.

Second, this shouldn't affect anything but I recommend changing the name from kerb5.conf to jaas.conf. That will make it clear as to its purpose and not to be confused with the krb5.conf.

The error is a generic failed to authenticate error. I believe it is because you are passing kerb5.conf as the argument for the JAAS auth code but the file you passed with -files is kerb5conf.conf.
Explorer
Posts: 15
Registered: ‎06-06-2017

Re: Spark

Hi Mbigelow , Thank you very much for looking into the issue.

1- Actually passing argument to driver as "N" while spark submit, to run in
cluster without setting master,
2- Thanks for this suggestion will change accordingly.
3- My apologies, its typo while adding in here as have changed few names -
actually passing right file name, and keytab commit showing as success.

Kind Regards
Msdhan
Posts: 630
Topics: 3
Kudos: 102
Solutions: 66
Registered: ‎08-16-2016

Re: Spark

"jdbc:impala://host:21050/test;AuthMech=1;KrbRealm=xxxx;KrbHostFQDN=host;KrbServiceName=testSn;SSL=1;SSLTrustStore=test.jks;SSLTrustStorePwd=yyyyyy""

Is the service name for the Impala deamon really 'testSn'?

Have you had a chance to run the test that @saranvisa recommended? There isn't much else to go off of. Do you have access to the Impala daemon logs?
Explorer
Posts: 15
Registered: ‎06-06-2017

Re: Spark & Impala Exception in Yarn Cluster

Hi  S

Explorer
Posts: 15
Registered: ‎06-06-2017

Re: Spark

Its not the real name, the connection string is just a sample.
Posts: 630
Topics: 3
Kudos: 102
Solutions: 66
Registered: ‎08-16-2016

Re: Spark

Try to provide as accurate as possible information. I know there is a need to mask some information, but the service name is not one that is typically masked.

Ok, now we know that JDBC to Impala with Kerberos works.

Are you able to login and get a ticket using that keytab file?

kinit -kt test.keytab testPrincipal
klist -ef
klist -kt test.keytab
Explorer
Posts: 15
Registered: ‎06-06-2017

Re: Spark

mbigelow , sorry that i have masked the data including the service name. Actually i can login with keytab and below commands gives back the results required.

Explorer
Posts: 23
Registered: ‎01-14-2015

Re: Spark

Could you share what your exact configs are and sample code?

Running into same issue.

Announcements