Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Cannot copy from local machine to VM datanode via Java

avatar
Explorer

Hello,

I have an application that copies data to HDFS, but is failing due to the datanode being excluded. See snippet:

private void copyFileToHdfs(FileSystem hdfs, Path localFilePath, Path hdfsFilePath) throws IOException, InterruptedException {
    log.info("Copying " + localFilePath + " to " + hdfsFilePath);
    hdfs.copyFromLocalFile(localFilePath, hdfsFilePath);
}

However, when I try to execute, I get:

org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/dev/workflows/test.jar could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1583)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3109)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3033)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:725)

HDFS commands work fine. I can create, modify, delete. Copying data is the only problem. I've also attempted to check the ports by telnet'ing. I can telnet 8020, but not 50010. I assume this is the root of the issue and why the single datanode is being excluded. I attempted to add an iptable firewall rule but I still am running into the same issue.

Any help is appreciated.

1 ACCEPTED SOLUTION

avatar
Explorer

So, the problem was two issues. One was that the VM does not have port 50010 opened by default, so all the datanodes are excluded leading to the issue above. The other issue was I needed to set "dfs.client.use.datanode.hostname" to true to avoid the datanodes resolving to the internal ip on the VM which I did set. Finally, after stepping through the configuration, I found that it was still being set to "false" which turned out to be a problem in my own code. My FileSystem object was being created with a new Configuration() rather than the one I had loaded with the configs from hdfs-site and core-site pulled from Ambari. Whoops!

Anyway, Thanks for the help all.

View solution in original post

18 REPLIES 18

avatar
Master Mentor

avatar
Master Mentor

@Jim Fratzke

check firewall, in vm settings make sure allow all is enabled for network adapter, your vm fqdn is in /etc/hosts, your hdfs scheme in your URL in your code is properly identified and not recognizing local filesystem. Also hdfs Port on Sandbox is 8020 not 50010.

avatar
Explorer

VM settings was part of it. As soon as I added port 50010, I got some breakthrough. There's now some other issue where the dataQueue.wait() gets stuck in DFSOutputStream. Also the namenode listens on 8020 which is working fine as stated above. The problem was connecting to the datanodes which I can connect to now.

avatar
Master Mentor

@Jim Fratzke that's good, progress, please post logs. According to src, its waiting for acks so it is still may look like networking issue.

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-hdfs/0.23.1/org/apache/hado...

avatar
Explorer

Ok, a bit more progress. Digging into DFSOutputStream.createSocketForPipeline, the ip address for the datanode seems to be resolving to the internal ip 10.0.2.15, however I do have dfs.client.use.datanode.hostname set to true so I'm not sure why it's attempting to connect to that. My host machine has the hostname sandbox.hortonworks.com pointed at 127.0.0.1, however on the VM that hostname is pointed at 10.0.2.15.

avatar
Master Mentor
@Jim Fratzke

I'd be curious if you ran dhclient command once you ssh as root. Do you have a 2nd adapter added to the VM? If not, add it in VM network settings and make sure allow all is set for that adapter as well, then once it boots up, run dhclient command.

avatar
Explorer

So, the problem was two issues. One was that the VM does not have port 50010 opened by default, so all the datanodes are excluded leading to the issue above. The other issue was I needed to set "dfs.client.use.datanode.hostname" to true to avoid the datanodes resolving to the internal ip on the VM which I did set. Finally, after stepping through the configuration, I found that it was still being set to "false" which turned out to be a problem in my own code. My FileSystem object was being created with a new Configuration() rather than the one I had loaded with the configs from hdfs-site and core-site pulled from Ambari. Whoops!

Anyway, Thanks for the help all.

avatar
New Contributor

I had to do more task on OSX El Capitan: in /etc/hosts, add this entry:

127.0.0.1    sandbox.hortonworks.com

Without it, you will see a java.nio.channels.UnresolvedAddressException

avatar
New Contributor

I understand this is a while ago, but can you remember how you opened the port 50010 on the VM? Hortonworks has a tutorial on opening ports in HDP Sandbox box, but it is actually not working. After stopping the docker container, I cannot actually remove the container because connection is being refused.