Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to connect CML to Hive using Python

Solved Go to solution

How to connect CML to Hive using Python

New Contributor

Hello everyone,

 

we setup a Cloudera Environment which inherits a DataHub of type "7.1.0 - Data Engineering: Apache Spark, Apache Hive, Apache Oozie". We managed to connect to HIVE via a JDBC connection from our local machines.

But so far we were not able to connect from CML to HIVE via JDBC

I use the JayDeBeApi as follows:

 

 

 

conn_hive = jaydebeapi.connect('org.apache.hive.jdbc.HiveDriver', 'jdbc:hive2://dataengineering-master0.......:443/;ssl=1;transportMode=http;httpPath=dataengineering/cdp-proxy-api/hive;AuthMech=3;', \
	{'UID': "user_name", 'PWD': "password"}, '/home/cdsw/drivers/HiveJDBC41.jar',)

 

 

 

The error message is

 

 

 

TypeError: Class org.apache.hive.jdbc.HiveDriver is not found

 

 

 

I set the environment variable CLASSPATH to

'/home/cdsw/drivers/HiveJDBC41.jar'

which is were the jar actually rests. Hence I wanted to check if JAVA_HOME is set correctly and yes, there the env. variable is set to

'/usr/lib/jvm/java-8-openjdk-amd64/'

Howerver when i run the command !java --version I get an error

Unrecognized option: --version
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

Is this normal and can i expect JAVA to still work as expected, or could this be the source of my problem?

 

 

Since connecting via JDBC did not work. I also tried connecting via a SparkSession as I saw in yesterdays "CDP Priavte Cloud Partner Edition". The presented code looks as follows

 

 

 

from pyspark.sql import SparkSession

# Instantiate Spark-on-K8s Cluster
spark = SparkSession.builder.appName("Simple Spark Test") \
	.config("spark.executor.memory", "8g") \
    .config("spark.executor.cores", "2") \
    .config("spark.driver.memory", "2g") \
    .config("spark.executor.instances", "2") \
	.getOrCreate()

# Validate Spark Connectivity
spark.sql("SHOW databases").show()
spark.sql("use default")
spark.sql("show tables").show()
spark.sql('create table testcml (abc integer)').show()
spark.sql("insert into table testcml select t.* from (select 1) t").show()
spark.sql("select * from testcml").show()
spark.sql("drop table testcml").show()

# Stop Spark Session
spark.stop()

 

 

 

Listing the databases and the tables of a DB, as well as creating the "testcml" tables works fine. But the insert into testcml failes due to

 

 

 

Caused by: java.lang.IllegalStateException: Authentication with IDBroker failed.  Please ensure you have a Kerberos token by using kinit.
	at org.apache.knox.gateway.cloud.idbroker.s3a.IDBDelegationTokenBinding.getNewKnoxDelegationTokenSession(IDBDelegationTokenBinding.java:461)
	at org.apache.knox.gateway.cloud.idbroker.s3a.IDBDelegationTokenBinding.requestNewKnoxToken(IDBDelegationTokenBinding.java:406)
	at org.apache.knox.gateway.cloud.idbroker.s3a.IDBDelegationTokenBinding.getNewKnoxToken(IDBDelegationTokenBinding.java:484)
	at org.apache.knox.gateway.cloud.idbroker.s3a.IDBDelegationTokenBinding.maybeRenewAccessToken(IDBDelegationTokenBinding.java:476)
	at org.apache.knox.gateway.cloud.idbroker.s3a.IDBDelegationTokenBinding.deployUnbonded(IDBDelegationTokenBinding.java:335)
	at org.apache.hadoop.fs.s3a.auth.delegation.S3ADelegationTokens.deployUnbonded(S3ADelegationTokens.java:245)
	at org.apache.hadoop.fs.s3a.auth.delegation.S3ADelegationTokens.bindToAnyDelegationToken(S3ADelegationTokens.java:278)
	at org.apache.hadoop.fs.s3a.auth.delegation.S3ADelegationTokens.serviceStart(S3ADelegationTokens.java:199)
	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.bindAWSClient(S3AFileSystem.java:608)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:388)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3396)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:158)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3456)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3424)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:518)
	at org.apache.hadoop.fs.s3a.commit.AbstractS3ACommitterFactory.getDestinationFileSystem(AbstractS3ACommitterFactory.java:73)
	at org.apache.hadoop.fs.s3a.commit.AbstractS3ACommitterFactory.createOutputCommitter(AbstractS3ACommitterFactory.java:45)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.getOutputCommitter(FileOutputFormat.java:338)
	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupCommitter(HadoopMapReduceCommitProtocol.scala:100)
	at org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol.setupCommitter(SQLHadoopMapReduceCommitProtocol.scala:40)
	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupTask(HadoopMapReduceCommitProtocol.scala:217)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:229)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1289)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

 

 

 

With this my problem is, that I dont know how to pass this token or where to get it. I checked if any DENY rules of Ranger was active, but I did not see any.

 

I appreciat your help and thank you in advance.

Regards,
Dominic

 

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: How to connect CML to Hive using Python

New Contributor

Hello @pvidal !

 

So as usually the error was infront of the screen! I didnt actually check the Path within the JAR file which actually is "com.cloudera.hive.jdbc41.HS2Driver" after changing it everything works fine.

 

Sorry for the confusion and thanks for your support

View solution in original post

8 REPLIES 8
Highlighted

Re: How to connect CML to Hive using Python

Cloudera Employee
Highlighted

Re: How to connect CML to Hive using Python

New Contributor

Hi @pvidal,

thanks for the fast reply.

 

Yes indeed i saw this particular post. my implementation looks very similar - just not impala but hive:

 

!pip3 install JayDeBeApi
import jaydebeapi

conn_hive = jaydebeapi.connect('org.apache.hive.jdbc.HiveDriver', 'jdbc:hive2://our_host:443/;ssl=1;transportMode=http;httpPath=dataengineering/cdp-proxy-api/hive;AuthMech=3;', {'UID': "our_usre", 'PWD': "our_password"},jars='/home/cdsw/drivers/hive/HiveJDBC41.jar',)

curs_hive = conn_hive.cursor()

 

env variable CLASSPATH is set to the jar with which the connection via Java or DBeaver works:

'CLASSPATH': '/home/cdsw/drivers/HiveJDBC41.jar'

 Still i get the error. Any further ideas?

 

Highlighted

Re: How to connect CML to Hive using Python

Cloudera Employee

Did you actually run the export in a terminal session, as follows?

CLASSPATH=.:/home/cdsw/drivers/HiveJDBC41.jar

export CLASSPATH

 

 

Highlighted

Re: How to connect CML to Hive using Python

New Contributor

Yes I did, but I had to add an "!" in order for the comand to be accepted

!CLASSPATH=/home/cdsw/drivers/hive/HiveJDBC41.jar

!export CLASSPATH

conn_hive = jaydebeapi.connect('org.apache.hive.jdbc.HiveDriver', 'jdbc:hive2://host:443/;ssl=1;transportMode=http;httpPath=dataengineering/cdp-proxy-api/hive;AuthMech=3;', \
    {'UID': "our_user", 'PWD': "our_pw"}, jars='/home/cdsw/drivers/hive/HiveJDBC41.jar',)

TypeError: Class org.apache.hive.jdbc.HiveDriver is not found

TypeError                                 Traceback (most recent call last)
in engine
----> 1 conn_hive = jaydebeapi.connect('org.apache.hive.jdbc.HiveDriver', 'jdbc:hive2://our_host:443/;ssl=1;transportMode=http;httpPath=dataengineering/cdp-proxy-api/hive;AuthMech=3;', 	{'UID': "our_user", 'PWD': "our_pw"}, jars='/home/cdsw/drivers/hive/HiveJDBC41.jar',)

/home/cdsw/.local/lib/python3.6/site-packages/jaydebeapi/__init__.py in connect(jclassname, url, driver_args, jars, libs)
    410     else:
    411         libs = []
--> 412     jconn = _jdbc_connect(jclassname, url, driver_args, jars, libs)
    413     return Connection(jconn, _converters)
    414 

/home/cdsw/.local/lib/python3.6/site-packages/jaydebeapi/__init__.py in _jdbc_connect_jpype(jclassname, url, driver_args, jars, libs)
    219             return jpype.JArray(jpype.JByte, 1)(data)
    220     # register driver for DriverManager
--> 221     jpype.JClass(jclassname)
    222     if isinstance(driver_args, dict):
    223         Properties = jpype.java.util.Properties

/home/cdsw/.local/lib/python3.6/site-packages/jpype/_jclass.py in __new__(cls, jc, loader, initialize)
     97 
     98         # Pass to class factory to create the type
---> 99         return _jpype._getClass(jc)
    100 
    101 

TypeError: Class org.apache.hive.jdbc.HiveDriver is not found

/home/cdsw/drivers/hive

ll

total 11864
-rwx------ 1 cdsw 12146136 Sep 25 07:07 HiveJDBC41.jar*

 

Still the same error

Highlighted

Re: How to connect CML to Hive using Python

Cloudera Employee

Do me a favor and try this:

 

- open a terminal session (do not use !)

- run the following commands:

chmod a+r /home/cdsw/drivers/hive/HiveJDBC41.jar
CLASSPATH=.:/home/cdsw/drivers/hive/HiveJDBC41.jar
export CLASSPATH

- close the session and try to run your python code 

Highlighted

Re: How to connect CML to Hive using Python

New Contributor

So I started a new Session within CML, from which I started a terminal session via ">_Terminal Access" and ran the commands you postet.

I verfied if the CLASSPATH was set by running

echo "$CLASSPATH"

, which resulted in the expted output i.e.

.:/home/cdsw/drivers/hive/HiveJDBC41.jar

I then closed the Terminal Session and ran the code within my CML Session.

Howerver the error stayed the same.

 

 

Highlighted

Re: How to connect CML to Hive using Python

New Contributor

Hello @pvidal !

 

So as usually the error was infront of the screen! I didnt actually check the Path within the JAR file which actually is "com.cloudera.hive.jdbc41.HS2Driver" after changing it everything works fine.

 

Sorry for the confusion and thanks for your support

View solution in original post

Highlighted

Re: How to connect CML to Hive using Python

Cloudera Employee

Ha! Good catch!

Don't have an account?
Coming from Hortonworks? Activate your account here