So I'm using the sandbox and trying from my machine to connect to HDFS.
val shakespear = spark.sparkContext.textFile("hdfs://sandbox.HWX.com:9000/user/spark/all-shakespeare.txt").map(println)
shakespear.take(1)
The error shows I think that I can't access the (docker) Datanode from my local machine:
18/01/22 15:13:13 WARN BlockReaderFactory: I/O error constructing remote block reader.
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/172.17.0.2:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3436)
at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:777)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:694)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:355)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
at java.io.DataInputStream.read(DataInputStream.java:149
....
I already added 50010 to my docker ports. (Following the instructions here thanks @Roger Young great walk through)
What else do I need to do to be able to access the data? (I'm getting the feeling that I need to add network routing to address the docker instance directly... In ambari it does report that it's address is 172.17.0.2)
Should I be setting the FQDN of the docker instance?