Created 04-19-2017 11:04 AM
I'm trying to copy a transaction table from a production cluster HDP 2.5 to a dev cluster HDP 2.6.
I set these ACID settings in dev cluster:
hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager hive.support.concurrency=true hive.enforce.bucketing=true hive.exec.dynamic.partition.mode=nonstrict hive.compactor.initiator.on=true hive.compactor.worker.threads=3
then I import table from prod to dev:
hive> export table hana.easy_check to 'export/easy_check'; hadoop distcp -prbugp hdfs://hdp-nn1:8020/user/hive/export/easy_check/ hdfs://dev-nn2:8020/user/hive/export/ hive> import from 'export/easy_check';
However, when I run any sql query on this table in dev cluster I get an error:
2017-04-19 11:08:33,879 [ERROR] [Dispatcher thread {Central}] |impl.VertexImpl|: Vertex Input: easy_check initializer failed, vertex=vertex_1492584180580_0005_1_00 [Map 1] org.apache.tez.dag.app.dag.impl.AMUserCodeException: java.lang.RuntimeException: serious problem at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallback.onFailure(RootInputInitializerManager.java:319) at com.google.common.util.concurrent.Futures$4.run(Futures.java:1140) at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) at com.google.common.util.concurrent.ExecutionList$RunnableExecutorPair.execute(ExecutionList.java:150) at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:135) at com.google.common.util.concurrent.ListenableFutureTask.done(ListenableFutureTask.java:91) at java.util.concurrent.FutureTask.finishCompletion(FutureTask.java:384) at java.util.concurrent.FutureTask.setException(FutureTask.java:251) at java.util.concurrent.FutureTask.run(FutureTask.java:271) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1258) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1285) at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:307) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:409) at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:155) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:273) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:266) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:266) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) at java.util.concurrent.FutureTask.run(FutureTask.java:266) ... 3 more Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Not enough history available for (0,x). Oldest available base: hdfs://development/apps/hive/warehouse/hana.db/easy_check/ym=2017-01/base_0001497 at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1235) ... 15 more Caused by: java.io.IOException: Not enough history available for (0,x). Oldest available base: hdfs://development/apps/hive/warehouse/hana.db/easy_check/ym=2017-01/base_0001497 at org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:594) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.callInternal(OrcInputFormat.java:773) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.access$600(OrcInputFormat.java:738) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator$1.run(OrcInputFormat.java:763) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator$1.run(OrcInputFormat.java:760) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:760) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:738) ... 4 more
What is wrong?
Both Hive 1.2.1
# Detailed Table Information Database: hana Owner: hive CreateTime: Wed Apr 19 13:27:00 MSK 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://development/apps/hive/warehouse/hana.db/easy_check Table Type: MANAGED_TABLE Table Parameters: NO_AUTO_COMPACTION false compactor.mapreduce.map.memory.mb 2048 compactorthreshold.hive.compactor.delta.num.threshold 4 compactorthreshold.hive.compactor.delta.pct.threshold 0.3 last_modified_by hive last_modified_time 1489647024 orc.bloom.filter.columns calday, request, material orc.compress ZLIB orc.compress.size 262144 orc.create.index true orc.row.index.stride 5000 orc.stripe.size 67108864 transactional true transient_lastDdlTime 1492597620 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Compressed: No Num Buckets: 1 Bucket Columns: [material] Sort Columns: [] Storage Desc Params: serialization.format 1
Created 07-10-2017 05:21 PM
This is not supported. Transactional table data cannot be simply copied from cluster to cluster. Each cluster maintains a global transaction ID sequence which is embedded in the data files and file names of transactional tables. Copying the data files confuses the target system. The only way to do this right now is to copy the data to a non-acid table on source cluster using "Insert ... Select..." and then using import/export to transfer it to target side.
Created 07-10-2017 05:21 PM
This is not supported. Transactional table data cannot be simply copied from cluster to cluster. Each cluster maintains a global transaction ID sequence which is embedded in the data files and file names of transactional tables. Copying the data files confuses the target system. The only way to do this right now is to copy the data to a non-acid table on source cluster using "Insert ... Select..." and then using import/export to transfer it to target side.
Created 07-11-2017 12:11 AM
Thanks for the clarification!
Created 09-14-2017 04:03 AM
@Eugene Koifman is there any other workaround possible that could cut down the time to go through the procedure of replicating an acid table to a secondary cluster? What is the recommendation for DR on acid tables ?
Created 09-14-2017 04:10 PM
There isn't. Perhaps @thejas has a recommendation.
Created 11-08-2017 09:43 AM
The import/export can be time consuming, you could try distcp'ing the non-transactional partitions over to the DR non-transactional and using MSCK REPAIR TABLE to pick them up? You'd still need to run the copy from non-tranasctional to transactional again.
Created 11-15-2017 07:28 AM
Hi all!
Where is I can find information about limitation Hadoop Distcp ?
(transactional / non-transactional etc )