Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Error in reading database through hive using pyspark

avatar
New Contributor

I am using spark 2.3.2 and i am trying to read tables from database. I established spark connection.

But i am unable to read database tables from HUE cloudera and unable to query them in pyspark as well.

 

Here is my code,


import findspark
findspark.init('C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("hive.metastore.uris", "thrift://10.1.1.70:8888").enableHiveSupport().getOrCreate()
#spark.catalog.listTables("tp_policy_operation")
import pandas as pd
sc = spark.sparkContext
sc


from pyspark import SparkContext
from pyspark.sql import SQLContext
sql_sc = SQLContext(sc)
SparkContext.setSystemProperty("hive.metastore.uris", "thrift://10.1.1.70:8888")
spark.sql("SELECT * FROM tp_policy_operation")

 

######The error i am getting

Traceback (most recent call last):

File "<ipython-input-4-8f0aa5852b01>", line 16, in <module>
spark.sql("SELECT * FROM tp_policy_operation")## Database ?

File "C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7\python\pyspark\sql\session.py", line 710, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)

File "C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)

File "C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7\python\pyspark\sql\utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)

AnalysisException: 'org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException;'

 

Kindly help me resolve the issue or guide me the changes in the code above.

1 ACCEPTED SOLUTION

avatar
Contributor

Hi @Logica .

 

please check whether database is selected or not for running the query-

 

below is code for reading hive table -

from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
sc= SparkContext('local','example')
hc = HiveContext(sc)
tf1 = sc.textFile("/user/BigData/nooo/SparkTest/train.csv")
#print(tf1.show(10))

#here reading hive table from pyspark
#print(data)
#data=tf1.top(10)
#print(data)
hc.sql("use default") #selected db here 
spf = hc.sql("SELECT * FROM tempaz LIMIT 100")
print(spf.show(5))

 

Thanks

HadoopHelp

 

View solution in original post

3 REPLIES 3

avatar
Contributor

Hi @Logica .

 

please check whether database is selected or not for running the query-

 

below is code for reading hive table -

from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
sc= SparkContext('local','example')
hc = HiveContext(sc)
tf1 = sc.textFile("/user/BigData/nooo/SparkTest/train.csv")
#print(tf1.show(10))

#here reading hive table from pyspark
#print(data)
#data=tf1.top(10)
#print(data)
hc.sql("use default") #selected db here 
spf = hc.sql("SELECT * FROM tempaz LIMIT 100")
print(spf.show(5))

 

Thanks

HadoopHelp

 

avatar
New Contributor

I changed the port no from 8888 to 9083 and it is working fine but when i tried to show the query result it shows;

 

df.show()
Traceback (most recent call last):

File "<ipython-input-12-1a6ce2362cd4>", line 1, in <module>
df.show()

File "C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7\python\pyspark\sql\dataframe.py", line 350, in show
print(self._jdf.showString(n, 20, vertical))

File "C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)

File "C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7\python\pyspark\sql\utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)

IllegalArgumentException: 'java.net.UnknownHostException: quickstart.cloudera'

 

Can you help me regarding this @HadoopHelp 

avatar
Contributor

Hi @Logica .

 

I think you need keep hive-site.xml file into spark -

 

Please follow the below steps for running the hive query or accessing the hive table through pyspark-

 

https://acadgild.com/blog/how-to-access-hive-tables-to-spark-sql

 

 

Thanks

HadoopHelp