Support Questions

VidyaSargur · ‎09-24-2020

Hello everyone,

we setup a Cloudera Environment which inherits a DataHub of type "7.1.0 - Data Engineering: Apache Spark, Apache Hive, Apache Oozie". We managed to connect to HIVE via a JDBC connection from our local machines.

But so far we were not able to connect from CML to HIVE via JDBC

I use the JayDeBeApi as follows:

conn_hive = jaydebeapi.connect('org.apache.hive.jdbc.HiveDriver', 'jdbc:hive2://dataengineering-master0.......:443/;ssl=1;transportMode=http;httpPath=dataengineering/cdp-proxy-api/hive;AuthMech=3;', \
	{'UID': "user_name", 'PWD': "password"}, '/home/cdsw/drivers/HiveJDBC41.jar',)

The error message is

TypeError: Class org.apache.hive.jdbc.HiveDriver is not found

I set the environment variable CLASSPATH to

'/home/cdsw/drivers/HiveJDBC41.jar'

which is were the jar actually rests. Hence I wanted to check if JAVA_HOME is set correctly and yes, there the env. variable is set to

'/usr/lib/jvm/java-8-openjdk-amd64/'

Howerver when i run the command !java --version I get an error

Unrecognized option: --version
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

Is this normal and can i expect JAVA to still work as expected, or could this be the source of my problem?

Since connecting via JDBC did not work. I also tried connecting via a SparkSession as I saw in yesterdays "CDP Priavte Cloud Partner Edition". The presented code looks as follows

from pyspark.sql import SparkSession

# Instantiate Spark-on-K8s Cluster
spark = SparkSession.builder.appName("Simple Spark Test") \
	.config("spark.executor.memory", "8g") \
    .config("spark.executor.cores", "2") \
    .config("spark.driver.memory", "2g") \
    .config("spark.executor.instances", "2") \
	.getOrCreate()

# Validate Spark Connectivity
spark.sql("SHOW databases").show()
spark.sql("use default")
spark.sql("show tables").show()
spark.sql('create table testcml (abc integer)').show()
spark.sql("insert into table testcml select t.* from (select 1) t").show()
spark.sql("select * from testcml").show()
spark.sql("drop table testcml").show()

# Stop Spark Session
spark.stop()

Listing the databases and the tables of a DB, as well as creating the "testcml" tables works fine. But the insert into testcml failes due to

Caused by: java.lang.IllegalStateException: Authentication with IDBroker failed.  Please ensure you have a Kerberos token by using kinit.
	at org.apache.knox.gateway.cloud.idbroker.s3a.IDBDelegationTokenBinding.getNewKnoxDelegationTokenSession(IDBDelegationTokenBinding.java:461)
	at org.apache.knox.gateway.cloud.idbroker.s3a.IDBDelegationTokenBinding.requestNewKnoxToken(IDBDelegationTokenBinding.java:406)
	at org.apache.knox.gateway.cloud.idbroker.s3a.IDBDelegationTokenBinding.getNewKnoxToken(IDBDelegationTokenBinding.java:484)
	at org.apache.knox.gateway.cloud.idbroker.s3a.IDBDelegationTokenBinding.maybeRenewAccessToken(IDBDelegationTokenBinding.java:476)
	at org.apache.knox.gateway.cloud.idbroker.s3a.IDBDelegationTokenBinding.deployUnbonded(IDBDelegationTokenBinding.java:335)
	at org.apache.hadoop.fs.s3a.auth.delegation.S3ADelegationTokens.deployUnbonded(S3ADelegationTokens.java:245)
	at org.apache.hadoop.fs.s3a.auth.delegation.S3ADelegationTokens.bindToAnyDelegationToken(S3ADelegationTokens.java:278)
	at org.apache.hadoop.fs.s3a.auth.delegation.S3ADelegationTokens.serviceStart(S3ADelegationTokens.java:199)
	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.bindAWSClient(S3AFileSystem.java:608)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:388)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3396)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:158)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3456)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3424)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:518)
	at org.apache.hadoop.fs.s3a.commit.AbstractS3ACommitterFactory.getDestinationFileSystem(AbstractS3ACommitterFactory.java:73)
	at org.apache.hadoop.fs.s3a.commit.AbstractS3ACommitterFactory.createOutputCommitter(AbstractS3ACommitterFactory.java:45)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.getOutputCommitter(FileOutputFormat.java:338)
	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupCommitter(HadoopMapReduceCommitProtocol.scala:100)
	at org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol.setupCommitter(SQLHadoopMapReduceCommitProtocol.scala:40)
	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupTask(HadoopMapReduceCommitProtocol.scala:217)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:229)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1289)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

With this my problem is, that I dont know how to pass this token or where to get it. I checked if any DENY rules of Ranger was active, but I did not see any.

I appreciat your help and thank you in advance.

Regards,
Dominic

DominicCap1 · ‎09-28-2020

Hello @pvidal !

So as usually the error was infront of the screen! I didnt actually check the Path within the JAR file which actually is "com.cloudera.hive.jdbc41.HS2Driver" after changing it everything works fine.

Sorry for the confusion and thanks for your support

View solution in original post

pvidal · ‎09-24-2020

Hey,

Have you checked this article?

https://community.cloudera.com/t5/Community-Articles/How-to-connect-to-CDP-Impala-from-python/ta-p/2...

DominicCap1 · ‎09-24-2020

Hi @pvidal,

thanks for the fast reply.

Yes indeed i saw this particular post. my implementation looks very similar - just not impala but hive:

!pip3 install JayDeBeApi
import jaydebeapi

conn_hive = jaydebeapi.connect('org.apache.hive.jdbc.HiveDriver', 'jdbc:hive2://our_host:443/;ssl=1;transportMode=http;httpPath=dataengineering/cdp-proxy-api/hive;AuthMech=3;', {'UID': "our_usre", 'PWD': "our_password"},jars='/home/cdsw/drivers/hive/HiveJDBC41.jar',)

curs_hive = conn_hive.cursor()

env variable CLASSPATH is set to the jar with which the connection via Java or DBeaver works:

'CLASSPATH': '/home/cdsw/drivers/HiveJDBC41.jar'

Still i get the error. Any further ideas?

pvidal · ‎09-24-2020

Did you actually run the export in a terminal session, as follows?

CLASSPATH=.:/home/cdsw/drivers/HiveJDBC41.jar

export CLASSPATH

DominicCap1 · ‎09-25-2020

Yes I did, but I had to add an "!" in order for the comand to be accepted

!CLASSPATH=/home/cdsw/drivers/hive/HiveJDBC41.jar

!export CLASSPATH

conn_hive = jaydebeapi.connect('org.apache.hive.jdbc.HiveDriver', 'jdbc:hive2://host:443/;ssl=1;transportMode=http;httpPath=dataengineering/cdp-proxy-api/hive;AuthMech=3;', \
    {'UID': "our_user", 'PWD': "our_pw"}, jars='/home/cdsw/drivers/hive/HiveJDBC41.jar',)

TypeError: Class org.apache.hive.jdbc.HiveDriver is not found

TypeError                                 Traceback (most recent call last)
in engine
----> 1 conn_hive = jaydebeapi.connect('org.apache.hive.jdbc.HiveDriver', 'jdbc:hive2://our_host:443/;ssl=1;transportMode=http;httpPath=dataengineering/cdp-proxy-api/hive;AuthMech=3;', 	{'UID': "our_user", 'PWD': "our_pw"}, jars='/home/cdsw/drivers/hive/HiveJDBC41.jar',)

/home/cdsw/.local/lib/python3.6/site-packages/jaydebeapi/__init__.py in connect(jclassname, url, driver_args, jars, libs)
    410     else:
    411         libs = []
--> 412     jconn = _jdbc_connect(jclassname, url, driver_args, jars, libs)
    413     return Connection(jconn, _converters)
    414 

/home/cdsw/.local/lib/python3.6/site-packages/jaydebeapi/__init__.py in _jdbc_connect_jpype(jclassname, url, driver_args, jars, libs)
    219             return jpype.JArray(jpype.JByte, 1)(data)
    220     # register driver for DriverManager
--> 221     jpype.JClass(jclassname)
    222     if isinstance(driver_args, dict):
    223         Properties = jpype.java.util.Properties

/home/cdsw/.local/lib/python3.6/site-packages/jpype/_jclass.py in __new__(cls, jc, loader, initialize)
     97 
     98         # Pass to class factory to create the type
---> 99         return _jpype._getClass(jc)
    100 
    101 

TypeError: Class org.apache.hive.jdbc.HiveDriver is not found

/home/cdsw/drivers/hive

ll

total 11864
-rwx------ 1 cdsw 12146136 Sep 25 07:07 HiveJDBC41.jar*

Still the same error

pvidal · ‎09-25-2020

Do me a favor and try this:

- open a terminal session (do not use !)

- run the following commands:

chmod a+r /home/cdsw/drivers/hive/HiveJDBC41.jar
CLASSPATH=.:/home/cdsw/drivers/hive/HiveJDBC41.jar
export CLASSPATH

- close the session and try to run your python code

DominicCap1 · ‎09-28-2020

So I started a new Session within CML, from which I started a terminal session via ">_Terminal Access" and ran the commands you postet.

I verfied if the CLASSPATH was set by running

echo "$CLASSPATH"

, which resulted in the expted output i.e.

.:/home/cdsw/drivers/hive/HiveJDBC41.jar

I then closed the Terminal Session and ran the code within my CML Session.

Howerver the error stayed the same.

DominicCap1 · ‎09-28-2020

Hello @pvidal !

So as usually the error was infront of the screen! I didnt actually check the Path within the JAR file which actually is "com.cloudera.hive.jdbc41.HS2Driver" after changing it everything works fine.

Sorry for the confusion and thanks for your support

pvidal · ‎09-28-2020

Ha! Good catch!

Cloudera Community

Support Questions

How to connect CML to Hive using Python

CML - Custom Runtime - VsCode using Python 3.10

Connect to CDP DataHub Hive using Cloudera ODBC Dr...

Spark in CML: Recommendations for using Spark in C...

How to connect to Snowflake in CML

CML Customize Runtime with ease (Python, R, and CU...

Using VS Code with CML/CDSW Runtimes

PandasOnSpark in Cloudera Machine Learning (CML)

How to deploy R Models in CML

How to connect CML to Teradata via PyODBC

Using Custom Data Connections in Cloudera Machine ...