Reply
New Contributor
Posts: 1
Registered: ‎12-18-2014

Can we export HBase data from CDH3 and import it into CDH5 HBase?

We have a old CDH3 based cluster and we have data stored in HBase on this cluster. We also have brand new cluster in which CDH5.2 is installed.

We want to move HBase data from CDH3 cluster to CDH5 cluster.

 

I would like to know if it is possible to migrate data across different versions?

 

Is it straight forward like using distcp command?

 

What precautions I need to take before migrating data and during migration?

New Contributor
Posts: 2
Registered: ‎12-22-2014

Re: Can we export HBase data from CDH3 and import it into CDH5 HBase?

I am attempting to do the same task.

 

Migration from our old CDH3u5 cluster to a new CDH5u2 cluster. Currently I am struggling to get data copied from old cluster to new using distcp but I am confident I can figure out a solution for that.

 

The big question I have is once the data is moved to CDH5 cluster what steps are required to import that data?

 

I have found http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_rn_upgrade.html Which states:

 

"No upgrade directly from CDH 3 to CDH 5

You must upgrade to CDH 4, then to CDH 5"

 

Does this mean I will need to do a inplace upgrade either on old cluster, not preffered as we don't have a full backup of all data, or on the new cluster after copying and importing all critical data? Again this is not prefered since timing will be incredably difficult in a production environment.

 

Any information that could be provided would be greatly appriciated!

 

I am currently running: HBASE 0.90.6-cdh3u5

Upgrading to: HBASE 0.98.6-cdh5.2.0

 

Additionally we have a built out development environment to complete test and finalize a plane.

 

 

 

Cloudera Employee
Posts: 578
Registered: ‎01-20-2014

Re: Can we export HBase data from CDH3 and import it into CDH5 HBase?

[ Edited ]

The following steps work but I only attempted it on a very small table "t1" with one column family "cf1". Let me know if you have more questions

- Export to sequence file
ref: http://hbase.apache.org/book/ops_mgt.html#export
$ sudo -u hdfs hbase org.apache.hadoop.hbase.mapreduce.Export t1 /export

- Copy contents of /export to the CDH5 cluster using distcp or through a filesystem accessible from nodes on both clusters (run this on the CDH5 cluster)
ref:http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_admin_distcp_data_c...
# sudo -u hdfs hadoop distcp -update -skipcrccheck hftp://cdh3-namenode:port/export hdfs://cdh5-namenode/import

- Create the table on CDH5, eg: with HBase shell. Column families must be identical.

- Import the sequence file
# sudo -u hdfs hbase -Dhbase.import.version=0.94 org.apache.hadoop.hbase.mapreduce.Import t1 /import

- Verify contents have been imported correctly.

EDIT: fixed wrapping

Regards,
Gautam Gopalakrishnan
New Contributor
Posts: 2
Registered: ‎12-22-2014

Re: Can we export HBase data from CDH3 and import it into CDH5 HBase?

Thank you so much for your response. hbase.import.version appeared to do what I needed to import data from our older version of hbase.

I am still having trouble with distcp between the two versions though. If you could help me on that issue I would greatly appriciate it. In small scale testing I can move local then rsync over to new area. But when it comes time for production action I won't have the space or time to do that and will need distcp working.

I have tried a few different things but none of them work as hoped. While running the distcp command on cdh5 cluster to move data from cdh3 I am getting this error on a few files. Most of the data transfers fine but when I try and run the job even multiple times all of the data doesn't get moved:


15/01/05 20:13:31 INFO mapreduce.Job: Task Id : attempt_1417804428787_0016_m_000009_0, Status : FAILED
Error: java.io.IOException: File copy failed: hftp://cdh3-namenode:50070/export/5minute_ConsumerSiteProfile/part-m-02630 --> hdfs://cdh5-namenode/export/5minute_ConsumerSiteProfile_distcp/part-m-02630
        at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:284)
        at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:252)
        at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:50)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.io.IOException: Couldn't run retriable-command: Copying hftp://cdh3-namenode:50070/export/5minute_ConsumerSiteProfile/part-m-02630 to hdfs://cdh5-namenode/export/5minute_ConsumerSiteProfile_distcp/part-m-02630
        at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
        at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:280)
        ... 10 more
Caused by: org.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: java.net.SocketTimeoutException: Read timed out
        at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.getInputStream(RetriableFileCopyCommand.java:303)
        at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:248)
        at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyToFile(RetriableFileCopyCommand.java:184)
        at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:124)
        at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:100)
        at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
        ... 11 more
Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:152)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
        at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
        at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
        at org.apache.hadoop.hdfs.web.HftpFileSystem$RangeHeaderUrlOpener.connect(HftpFileSystem.java:370)
        at org.apache.hadoop.hdfs.web.ByteRangeInputStream.openInputStream(ByteRangeInputStream.java:120)
        at org.apache.hadoop.hdfs.web.ByteRangeInputStream.getInputStream(ByteRangeInputStream.java:104)
        at org.apache.hadoop.hdfs.web.ByteRangeInputStream.<init>(ByteRangeInputStream.java:89)
        at org.apache.hadoop.hdfs.web.HftpFileSystem$RangeHeaderInputStream.<init>(HftpFileSystem.java:383)
        at org.apache.hadoop.hdfs.web.HftpFileSystem$RangeHeaderInputStream.<init>(HftpFileSystem.java:388)
        at org.apache.hadoop.hdfs.web.HftpFileSystem.open(HftpFileSystem.java:404)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
        at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.getInputStream(RetriableFileCopyCommand.java:299)
        ... 16 more

I have tried the follow:

hadoop distcp -update hftp://cdh3-namenode:50070/export/table  hdfs://cdh5-namenode/export/table


hadoop distcp -skipcrccheck -update hftp://cdh3-namenode:50070/export/table  hdfs://cdh5-namenode/export/table


hadoop distcp -pb -skipcrccheck -update hftp://cdh3-namenode:50070/export/table  hdfs://cdh5-namenode/export/table

hadoop distcp -Ddfs.checksum.type=CRC32 -update hftp://cdh3-namenode:50070/export/table  hdfs://cdh5-namenode/export/table

I have also increased the timeout value for cdh5 within hdfs-site.xml:

  <property>
    <name>dfs.client.socket.timeout</name>
    <value>90000</value>
  </property>

Lastly I have attempted to change the dfs.checksum.type to CRC32(cdh3's default CRC check found it on this hadoop mail log https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/FSkuVQisSOE)  in hdfs-site.xml:

  <property>
        <name>dfs.checksum.type</name>
        <value>CRC32</value>
  </property>

 

 

Sadly non of these tests have worked for me. I am currently working in our Devolpment environment so I am able to make any config changes at any point.

 

The test export I am trying to move is only 40MB so the size of the data shouldn't be an issue.

 

Thanks in advance.

 

Adam

 

New Contributor
Posts: 1
Registered: ‎06-08-2015

Re: Can we export HBase data from CDH3 and import it into CDH5 HBase?

Hi Adam,

 

Were you able to resolve the distcp issue you saw? I am seeing similar issue and would apperciate any pointers. I have tried things that you had tried but no luck yet.

 

Thanks in advance!

Announcements