Support Questions

GrazittiAPI · ‎03-18-2020

I am using spark 2.3.2 and i am trying to read tables from database. I established spark connection.

But i am unable to read database tables from HUE cloudera and unable to query them in pyspark as well.

Here is my code,

import findspark
findspark.init('C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("hive.metastore.uris", "thrift://10.1.1.70:8888").enableHiveSupport().getOrCreate()
#spark.catalog.listTables("tp_policy_operation")
import pandas as pd
sc = spark.sparkContext
sc

from pyspark import SparkContext
from pyspark.sql import SQLContext
sql_sc = SQLContext(sc)
SparkContext.setSystemProperty("hive.metastore.uris", "thrift://10.1.1.70:8888")
spark.sql("SELECT * FROM tp_policy_operation")

######The error i am getting

Traceback (most recent call last):

File "<ipython-input-4-8f0aa5852b01>", line 16, in <module>
spark.sql("SELECT * FROM tp_policy_operation")## Database ?

File "C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7\python\pyspark\sql\session.py", line 710, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)

File "C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)

File "C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7\python\pyspark\sql\utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)

AnalysisException: 'org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException;'

Kindly help me resolve the issue or guide me the changes in the code above.

HadoopHelp · ‎03-18-2020

Hi @Logica .

please check whether database is selected or not for running the query-

below is code for reading hive table -

from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
sc= SparkContext('local','example')
hc = HiveContext(sc)
tf1 = sc.textFile("/user/BigData/nooo/SparkTest/train.csv")
#print(tf1.show(10))

#here reading hive table from pyspark
#print(data)
#data=tf1.top(10)
#print(data)
hc.sql("use default") #selected db here 
spf = hc.sql("SELECT * FROM tempaz LIMIT 100")
print(spf.show(5))

Thanks

HadoopHelp

View solution in original post

HadoopHelp · ‎03-18-2020

Hi @Logica .

please check whether database is selected or not for running the query-

below is code for reading hive table -

from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
sc= SparkContext('local','example')
hc = HiveContext(sc)
tf1 = sc.textFile("/user/BigData/nooo/SparkTest/train.csv")
#print(tf1.show(10))

#here reading hive table from pyspark
#print(data)
#data=tf1.top(10)
#print(data)
hc.sql("use default") #selected db here 
spf = hc.sql("SELECT * FROM tempaz LIMIT 100")
print(spf.show(5))

Thanks

HadoopHelp

Logica · ‎03-18-2020

I changed the port no from 8888 to 9083 and it is working fine but when i tried to show the query result it shows;

df.show()
Traceback (most recent call last):

File "<ipython-input-12-1a6ce2362cd4>", line 1, in <module>
df.show()

File "C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7\python\pyspark\sql\dataframe.py", line 350, in show
print(self._jdf.showString(n, 20, vertical))

File "C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)

File "C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7\python\pyspark\sql\utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)

IllegalArgumentException: 'java.net.UnknownHostException: quickstart.cloudera'

Can you help me regarding this @HadoopHelp

HadoopHelp · ‎03-18-2020

Hi @Logica .

I think you need keep hive-site.xml file into spark -

Please follow the below steps for running the hive query or accessing the hive table through pyspark-

https://acadgild.com/blog/how-to-access-hive-tables-to-spark-sql

Thanks

HadoopHelp

Cloudera Community

Support Questions

Error in reading database through hive using pyspark