Created 09-27-2016 10:20 AM
I have a fresh small HDP 2.5.0.0 cluster with spark2 (tech preview) set up. It resides on 2 virtual machines, where one is a master and the other one is a worker/slave.
Using 'spark-submit', I can deploy a standalone python spark application that runs and finishes just normally. When I try to connect to a pgSQL database host running on a third VM by using JDBC, spark is unable to resolve the host.
My python code to access the pgSQL database from within the (correctly set up) spark context looks like this:
probe = spark.read.format('jdbc').options( url='jdbc:postgresql://10.255.1.2:5432/gis?user=<theuser>&password=<thepassword>', driver='org.postgresql.Driver', dbtable='(SELECT * FROM my_db_function({}, {})) AS my_db_function_alias)'.format(123, 456) ).load()
The driver is provided via --jars option for spark-submit and is correctly loaded (because otherwise its error would be raised earlier), but I cannot resolve the host from within the spark context:
File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 153, in load File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o68.load. : org.postgresql.util.PSQLException: Connection attempt failed. at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:275) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54) [...] Caused by: java.net.NoRouteToHostException: No route to host at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
I assume this is a NAT/config error on one of the virtual machines, but if I log on to either the master or the worker virtual machine, nslookup correctly resolves the DNS server and retrieves the correct private IP of my pgSQL database host:
# nslookup my.database.host Server: 10.255.0.1 Address: 10.255.0.1#53 Name: my.database.host Address: 10.255.1.2
As such, from both virtual machines that are part of the HDP installation, the target host is actually reachable. Even standalone non-cluster non-spark python scripts are able to connect to the database host if I start them on one of the cluster VMs.
Are there any settings in Ambari or wherever to enable this? Does HDP re-configure networking somehow?
Created 09-27-2016 01:57 PM
Alright, this is quite dumb. Sorry, my bad:
It was all about iptables on the third VM host -- didn't have that on mind since for HDP iptables is always off.
Created 09-27-2016 01:57 PM
Alright, this is quite dumb. Sorry, my bad:
It was all about iptables on the third VM host -- didn't have that on mind since for HDP iptables is always off.