Reply
Highlighted
Contributor
Posts: 57
Registered: ‎07-05-2018

Loading impala table into spark throws error

Hello Team,

 

We have CDH 5.15 with kerberos enabled cluster.

 

We trying to load Impala table into CDH and performed below steps, but while showing the results it throws HSS initiate failed error.

 

Kindly suggest?

 

-bash-4.2$ spark2-shell --master yarn --deploy-mode client --driver-class-path ImpalaJDBC41.jar --jars ImpalaJDBC41.jar

 

scala> val jdbcURL = s"jdbc:impala://host1:21050/external;AuthMech=1;KrbRealm=XYZ;KrbHostFQDN=host1;KrbServiceName=impala"

 

scala> val connectionProperties = new java.util.Properties()
connectionProperties: java.util.Properties = {}

scala> val hbaseDF = spark.sqlContext.read.jdbc(jdbcURL, "external.Names_text", connectionProperties)
hbaseDF: org.apache.spark.sql.DataFrame = [employeeid: int, firstname: string ... 3 more fields]

 

scala> hbaseDF.show
[Stage 0:> (0 + 1) / 1]19/03/08 07:11:46 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, a301-8530-3309.ldn.swissbank.com, executor 1): java.sql.SQLException: [Cloudera][ImpalaJDBCDriver](500164) Error initialized or created transport for authentication: [Cloudera][ImpalaJDBCDriver](500169) Unable to connect to server: GSS initiate failed.
at com.cloudera.impala.hivecommon.api.HiveServer2ClientFactory.createTransport(Unknown Source)
at com.cloudera.impala.hivecommon.api.HiveServer2ClientFactory.createClient(Unknown Source)
at com.cloudera.impala.hivecommon.core.HiveJDBCCommonConnection.establishConnection(Unknown Source)
at com.cloudera.impala.impala.core.ImpalaJDBCConnection.establishConnection(Unknown Source)
at com.cloudera.impala.jdbc.core.LoginTimeoutConnection.connect(Unknown Source)
at com.cloudera.impala.jdbc.common.BaseConnectionFactory.doConnect(Unknown Source)
at com.cloudera.impala.jdbc.common.AbstractDriver.connect(Unknown Source)
at org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper.connect(DriverWrapper.scala:45)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:63)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:54)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:271)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Caused by: com.cloudera.impala.support.exceptions.GeneralException: [Cloudera][ImpalaJDBCDriver](500164) Error initialized or created transport for authentication: [Cloudera][ImpalaJDBCDriver](500169) Unable to connect to server: GSS initiate failed.
... 27 more

 

Kindly help/suggest what i did wrong?

Cloudera Employee
Posts: 102
Registered: ‎03-23-2016

Re: Loading impala table into spark throws error

Hello Vijay,

 

Please see [1]. This use case isn't supported.

 

However, shared error suggests that executor isn't able to connect to Impala daemon due to authenitcation issues. This is because executor is running in a separate JVM and should acquire Kerberos TGT as well.

 

In order to perform this, you could make use of jaas configuration, see [2] and search for "To set up the JAAS login configuration file" (page 15). Once you have a tested Jaas login configuration and a keytab file, you could pass it as follows to the executors.

 

 

--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=jaas.conf -Djavax.security.auth.useSubjectCredsOnly=false" \

--conf "spark.yarn.dist.files=<path_to_keytab>.keytab,<path_to_keytab>/jaas.conf"

 

Alternatively, if your Impala can authenticate using LDAP, you could also test using it.

 

Hope this helps!

 

Thanks,

Sudarshan

 

[1] https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html#ki_jdbc_datasou...

[2] https://www.cloudera.com/documentation/other/connectors/impala-jdbc/latest/Cloudera-JDBC-Driver-for-...