Created on 03-18-2020 12:11 AM - last edited on 03-18-2020 02:21 AM by VidyaSargur
I am using spark 2.3.2 and i am trying to read tables from database. I established spark connection.
But i am unable to read database tables from HUE cloudera and unable to query them in pyspark as well.
Here is my code,
import findspark
findspark.init('C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("hive.metastore.uris", "thrift://10.1.1.70:8888").enableHiveSupport().getOrCreate()
#spark.catalog.listTables("tp_policy_operation")
import pandas as pd
sc = spark.sparkContext
sc
from pyspark import SparkContext
from pyspark.sql import SQLContext
sql_sc = SQLContext(sc)
SparkContext.setSystemProperty("hive.metastore.uris", "thrift://10.1.1.70:8888")
spark.sql("SELECT * FROM tp_policy_operation")
######The error i am getting
Traceback (most recent call last):
File "<ipython-input-4-8f0aa5852b01>", line 16, in <module>
spark.sql("SELECT * FROM tp_policy_operation")## Database ?
File "C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7\python\pyspark\sql\session.py", line 710, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7\python\pyspark\sql\utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: 'org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException;'
Kindly help me resolve the issue or guide me the changes in the code above.
Created 03-18-2020 01:42 AM
Hi @Logica .
please check whether database is selected or not for running the query-
below is code for reading hive table -
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
sc= SparkContext('local','example')
hc = HiveContext(sc)
tf1 = sc.textFile("/user/BigData/nooo/SparkTest/train.csv")
#print(tf1.show(10))
#here reading hive table from pyspark
#print(data)
#data=tf1.top(10)
#print(data)
hc.sql("use default") #selected db here
spf = hc.sql("SELECT * FROM tempaz LIMIT 100")
print(spf.show(5))
Thanks
HadoopHelp
Created 03-18-2020 01:42 AM
Hi @Logica .
please check whether database is selected or not for running the query-
below is code for reading hive table -
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
sc= SparkContext('local','example')
hc = HiveContext(sc)
tf1 = sc.textFile("/user/BigData/nooo/SparkTest/train.csv")
#print(tf1.show(10))
#here reading hive table from pyspark
#print(data)
#data=tf1.top(10)
#print(data)
hc.sql("use default") #selected db here
spf = hc.sql("SELECT * FROM tempaz LIMIT 100")
print(spf.show(5))
Thanks
HadoopHelp
Created on 03-18-2020 01:48 AM - edited 03-18-2020 01:48 AM
I changed the port no from 8888 to 9083 and it is working fine but when i tried to show the query result it shows;
df.show()
Traceback (most recent call last):
File "<ipython-input-12-1a6ce2362cd4>", line 1, in <module>
df.show()
File "C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7\python\pyspark\sql\dataframe.py", line 350, in show
print(self._jdf.showString(n, 20, vertical))
File "C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\spark-2.3.2-bin-hadoop2.7\spark-2.3.2-bin-hadoop2.7\python\pyspark\sql\utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
IllegalArgumentException: 'java.net.UnknownHostException: quickstart.cloudera'
Can you help me regarding this @HadoopHelp
Created 03-18-2020 02:39 AM
Hi @Logica .
I think you need keep hive-site.xml file into spark -
Please follow the below steps for running the hive query or accessing the hive table through pyspark-
https://acadgild.com/blog/how-to-access-hive-tables-to-spark-sql
Thanks
HadoopHelp