Support Questions

Find answers, ask questions, and share your expertise

HBase exportSnapshot failing - need help with debugging

avatar
Explorer

This is related to the post - distcp post  

 

We are trying to export a snapshot from one cluster to another cluster using the below command

hbase org.apache.hadoop.hbase.snapshot.ExcportSnapshot -snapshot mysnapshot -copy-from hdfs://namenode1:8020/hbase -copy-to hdfs://namenode2:8020/hbase

We are running hbase 1.0.0 cdh5.5.1+274-1.cdh5.5.1.p0.15.e17.  Ports 8020 and 50010 are open between the 2 clusters, i.e., I can telnet to these ports.  When the command runs an empty file /hbase/.hbase-snapshot/.tmp/mysnapshot/.snapshotinfo is created on namenode2.  The error received is:

INFO [main] snapshot.ExportSnapshot: Copy Snapshot Manifest
WARN [Thread-6] hdfs.DFSClient: DataStreamer Exception java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
.....
.....
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java.668)
Exception in thread "main" org.apache.hadoop.hbase.snapshot.ExportSnapshotExcpetion:  Failed to copy the snapshot directory:  from=hdfs://namenode1:8020/hbase/.hbase-snapshot/contentSnapshot to=hdfs://namenode2:8020/hbase/.hbase-snapshot/.tmp/contentSnapshot
at org.apache.hadoop.hbase.snapshot.ExportSnapshot.run(ExportSnapshot.java:932)
.....
and more and more
...

So, anyway I am trying to debug this and I cannot find the exportSnapshot source code for the specific version of the software we are running.  Some connection between the 2 servers is happening because the empty file is created.  It seems to fail when copying the data.manifest (maybe?). 

 

I guess one question is can the source code be located for 1.0 5.5.1 ? 

 

1 ACCEPTED SOLUTION

avatar
Explorer

This problem was finally resolved.

 

For anyone else having similar, what appears to be quirky, problems with the exportSnapshot here is how we resolved it.  FYI - finding the source code of the version of exportSnapshot we were running helped to pinpoint exactly where the error was occurring and what had been executed and successful to that point.  I also ran it different ways from each cluster and found an unusual alias being used when trying the hftp://server:50070 port.

So the bottom line was each cluster - all namenodes and datanodes had to resolve (have added to /etc/hosts) EVERY alias being used including all internal ip'd aliases to the external ips whether you thought it was explicity being used by hadoop or hbase somewhere or not.

 

Thanks to all.  Better error messages and/or the ability to debug the code would have been helpful.

View solution in original post

2 REPLIES 2

avatar
Explorer

This problem was finally resolved.

 

For anyone else having similar, what appears to be quirky, problems with the exportSnapshot here is how we resolved it.  FYI - finding the source code of the version of exportSnapshot we were running helped to pinpoint exactly where the error was occurring and what had been executed and successful to that point.  I also ran it different ways from each cluster and found an unusual alias being used when trying the hftp://server:50070 port.

So the bottom line was each cluster - all namenodes and datanodes had to resolve (have added to /etc/hosts) EVERY alias being used including all internal ip'd aliases to the external ips whether you thought it was explicity being used by hadoop or hbase somewhere or not.

 

Thanks to all.  Better error messages and/or the ability to debug the code would have been helpful.

avatar
Expert Contributor

Thanks for sharing the steps to resolve the issue. Yes, indeed every NN/DN in each cluster should have access to other cluster's node and vice-versa since the ExportSnapshot is more of the HDFS distcp operation where the majority of the work involves copying the HFiles (associated with the snapshot) in a distributed fashion from the source to target (similar to distcp).

 

It would be helpful if you could share the complete stack trace of the exception which would also help to understand the flow during the failure.

 

Again thanks for taking the time to post the solution.