Support Questions

hoc · ‎12-31-2017

I am using Hortonworks Sandbox VMware 2.6 and SSH into the Terminal to start pyspark:

su - hive -c pyspark

Then I typed below code:

df = sqlContext.read.format("jdbc").options(driver="org.apache.hive.jdbc.HiveDriver", url="jdbc:hive2://localhost:10016/default", dbtable="sample_07",user="maria_dev", password="maria_dev").load()

I am just following exact example here: https://spark.apache.org/docs/1.6.3/sql-programming-guide.html#jdbc-to-other-databases

However, it gave me below error:

17/12/30 19:55:14 INFO HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://localhost:10016/default
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/hdp/current/spark-client/python/pyspark/sql/readwriter.py", line 139, in load
    return self._df(self._jreader.load())
  File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/usr/hdp/current/spark-client/python/pyspark/sql/utils.py", line 45, in deco
    return f(*a, **kw)
  File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o119.load.
: java.sql.SQLException: Method not supported
at org.apache.hive.jdbc.HiveResultSetMetaData.isSigned(HiveResultSetMetaData.java:143)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:136)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:91)
at org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:57)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:748)

I believe this is the correlated error in Hive log:

2017-12-30 19:57:06,030 ERROR [Thread-13]: util.PolicyRefresher (PolicyRefresher.java:loadPolicyfromPolicyAdmin(288)) - PolicyRefresher(serviceName=Sandbox_hive): failed to refresh policies. Will continue to use last known version of policies (24)
com.sun.jersey.api.client.ClientHandlerException: java.net.ConnectException: Connection refused (Connection refused)
        at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
        at com.sun.jersey.api.client.Client.handle(Client.java:648)
        at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
        at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
        at com.sun.jersey.api.client.WebResource$Builder.get(WebResource.java:503)
        at org.apache.ranger.admin.client.RangerAdminRESTClient.getServicePoliciesIfUpdated(RangerAdminRESTClient.java:127)
        at org.apache.ranger.plugin.util.PolicyRefresher.loadPolicyfromPolicyAdmin(PolicyRefresher.java:264)
        at org.apache.ranger.plugin.util.PolicyRefresher.loadPolicy(PolicyRefresher.java:202)
        at org.apache.ranger.plugin.util.PolicyRefresher.run(PolicyRefresher.java:171)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
        at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
        at sun.net.www.http.HttpClient.New(HttpClient.java:339)
        at sun.net.www.http.HttpClient.New(HttpClient.java:357)
        at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1202)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1138)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1032)
        at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:966)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1546)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
        at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
        at com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:240)
        at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147)
        ... 8 more

However, when I verified with beeline and 10016 STS port works fine:

beeline> !connect jdbc:hive2://localhost:10016/default maria_dev maria_dev
Connecting to jdbc:hive2://localhost:10016/default
Connected to: Spark SQL (version 2.1.1.2.6.1.0-129)
Driver: Hive JDBC (version 1.2.1000.2.6.1.0-129)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10016/default> select * from sample_07 limit 2;
+----------+-------------------------+------------+---------+--+
|   code   |       description       | total_emp  | salary  |
+----------+-------------------------+------------+---------+--+
| 00-0000  | All Occupations         | 134354250  | 40690   |
| 11-0000  | Management occupations  | 6003930    | 96150   |
+----------+-------------------------+------------+---------+--+

So what did I do wrong in the pyspark code here? My goal is to have the authentication with JDBC. I guess I could go directly to the files and get the data or Metastore. But I'd like to use Spark Thrift Server for fine-grained authorization later.

Please help.

hoc · ‎12-31-2017

The strange thing is that if I use Postgres, it works fine:

df = sqlContext.read.format("jdbc").options(driver="org.postgresql.Driver", url="jdbc:postgresql://localhost:15436/mydb", dbtable="test1",user="bob", password="password").load()

hoc · ‎01-03-2018

Can anyone help me make sense of this?

I just tried some Scala code below and it worked fine via spark-shell:

import java.sql.{DriverManager, Connection, Statement, ResultSet}
import java.sql.Timestamp
import scala.collection.mutable.MutableList
case class StatsRec (
  code: String,
  description: String
)
val url = "jdbc:hive2://localhost:10016/default"
val driver = "org.apache.hive.jdbc.HiveDriver"
val user = "maria_dev"
val password = "maria_dev"
val conn: Connection = DriverManager.getConnection(url, user, password)
val res: ResultSet = conn.createStatement.executeQuery("SELECT * FROM sample_07 LIMIT 5")
val fetchedRes = MutableList[StatsRec]()
while(res.next()) {
  var rec = StatsRec(res.getString("code"), res.getString("description"))
  fetchedRes += rec
}
conn.close()
val rddStatsDelta = sc.parallelize(fetchedRes)
rddStatsDelta.cache()
println(rddStatsDelta.count)
rddStatsDelta.collect.take(10).foreach(println)

Output:

StatsRec(00-0000,All Occupations)
StatsRec(11-0000,Management occupations)
StatsRec(11-1011,Chief executives)
StatsRec(11-1021,General and operations managers)
StatsRec(11-1031,Legislators)

According to this post, looks like it's possible to get pyspark to work locally (not remote JDBC) but it doesn't say how to "fix" this error: http://belablotski.blogspot.in/2016/01/access-hive-tables-from-spark-using.html

Help!!

hoc · ‎01-07-2018

Hi all,

This thread has been over a week. I just want to see if anyone has answer.

Cloudera Community

Support Questions

pyspark JDBC Py4JJavaError: calling o95.load java.sql.SQLException: Method not supported