Support Questions
Find answers, ask questions, and share your expertise

When I Distcp an ORC hive table from one cluster to a different cluster, the new hive table doesn't see/output any of the data

On my Dev cluster, I have a hive table that is stored as ORC. When I Distcp that table's files to my Prod Cluster, the orc table on the Prod cluster doesn't see and output any of the data. If I do a hadoop fs -ls on the HDFS directory on Prod, I can see the files but the hive table doesn't. I tried running an analyze table command to see if that got it to see the files but that didn't help. Any thoughts?

5 REPLIES 5

Re: When I Distcp an ORC hive table from one cluster to a different cluster, the new hive table doesn't see/output any of the data

Guru

Try running

MSCK REPAIR TABLE <tablename>

Is Distcp going onto separate partitions directories on base table directory? If that is the case, MSCK should help.

Re: When I Distcp an ORC hive table from one cluster to a different cluster, the new hive table doesn't see/output any of the data

I tried MSCK REPAIR TABLE and received an error. See below:

hive> msck repair table svcrpt.predictive_customers;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

Here is the table's DDL:

CREATE TABLE `svcrpt.predictive_customers`(
  `raw_customercode` varchar(1024),
  `prod_customercode` varchar(1024),
  `prod_country` varchar(1024),
  `prod_customernbr` varchar(1024),
  `region` varchar(1024),
  `lob` varchar(1024),
  `customer_name` varchar(1024),
  `comment` varchar(1024))
CLUSTERED BY (
  prod_country)
INTO 8 BUCKETS
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\u0001'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://PRD00/apps/hive/warehouse/svcrpt.db/predictive_customers'
TBLPROPERTIES (
  'serialization.null.format'='\\N',
  'transactional'='true',
  'transient_lastDdlTime'='1481566924')

Here is the HDFS directory on Prod:

hadoop fs -ls /apps/hive/warehouse/svcrpt.db/predictive_customers
Found 1 items
drwxr-xr-x   - cw185105 svc-grp          0 2016-12-14 10:19 /apps/hive/warehouse/svcrpt.db/predictive_customers/delta_0000196_0000196
[cw185105@rpc3748 predictive_customers]$ hadoop fs -ls /apps/hive/warehouse/svcrpt.db/predictive_customers/delta_0000196_0000196
Found 8 items
-rw-r--r--   3 cw185105 svc-grp       1383 2016-12-14 10:19 /apps/hive/warehouse/svcrpt.db/predictive_customers/delta_0000196_0000196/bucket_00000
-rw-r--r--   3 cw185105 svc-grp       1717 2016-12-14 10:19 /apps/hive/warehouse/svcrpt.db/predictive_customers/delta_0000196_0000196/bucket_00001
-rw-r--r--   3 cw185105 svc-grp       1763 2016-12-14 10:19 /apps/hive/warehouse/svcrpt.db/predictive_customers/delta_0000196_0000196/bucket_00002
-rw-r--r--   3 cw185105 svc-grp       1843 2016-12-14 10:19 /apps/hive/warehouse/svcrpt.db/predictive_customers/delta_0000196_0000196/bucket_00003
-rw-r--r--   3 cw185105 svc-grp       2816 2016-12-14 10:19 /apps/hive/warehouse/svcrpt.db/predictive_customers/delta_0000196_0000196/bucket_00004
-rw-r--r--   3 cw185105 svc-grp       3020 2016-12-14 10:19 /apps/hive/warehouse/svcrpt.db/predictive_customers/delta_0000196_0000196/bucket_00005
-rw-r--r--   3 cw185105 svc-grp       2966 2016-12-14 10:19 /apps/hive/warehouse/svcrpt.db/predictive_customers/delta_0000196_0000196/bucket_00006
-rw-r--r--   3 cw185105 svc-grp       4762 2016-12-14 10:19 /apps/hive/warehouse/svcrpt.db/predictive_customers/delta_0000196_0000196/bucket_00007

Re: When I Distcp an ORC hive table from one cluster to a different cluster, the new hive table doesn't see/output any of the data

Guru

Notice that there is a separate subdirectory (delta_.... ) within predictive_customer directory. There are 2 ways to work around this.

  1. Make sure distcp puts the fields directly into predictive_customers and not into a subdirectory.
  2. Add the following properties into to true in custom hive-site in ambari UI.
  • hive.mapred.supports.subdirectories to true
  • mapreduce.input.fileinputformat.input.dir.recursive to true (need to cross check if this is required)

Re: When I Distcp an ORC hive table from one cluster to a different cluster, the new hive table doesn't see/output any of the data

I definitely hear what you are saying. However, on my Dev cluster here is what the HDFS directory looks like and the Hive table can read the data just fine:

[cw185105@rpc3848 ~]$ hadoop fs -ls /apps/hive/warehouse/svcrpt.db/predictive_customers
Found 1 items
drwxr-xr-x   - cw185105 svc-grp          0 2016-12-14 13:14 /apps/hive/warehouse/svcrpt.db/predictive_customers/delta_0000198_0000198
[cw185105@rpc3848 ~]$ hadoop fs -ls /apps/hive/warehouse/svcrpt.db/predictive_customers/delta_0000198_0000198
Found 8 items
-rw-r--r--   2 cw185105 svc-grp       1389 2016-12-14 13:14 /apps/hive/warehouse/svcrpt.db/predictive_customers/delta_0000198_0000198/bucket_00000
-rw-r--r--   2 cw185105 svc-grp       1711 2016-12-14 13:14 /apps/hive/warehouse/svcrpt.db/predictive_customers/delta_0000198_0000198/bucket_00001
-rw-r--r--   2 cw185105 svc-grp       1769 2016-12-14 13:14 /apps/hive/warehouse/svcrpt.db/predictive_customers/delta_0000198_0000198/bucket_00002
-rw-r--r--   2 cw185105 svc-grp       1843 2016-12-14 13:14 /apps/hive/warehouse/svcrpt.db/predictive_customers/delta_0000198_0000198/bucket_00003
-rw-r--r--   2 cw185105 svc-grp       2816 2016-12-14 13:14 /apps/hive/warehouse/svcrpt.db/predictive_customers/delta_0000198_0000198/bucket_00004
-rw-r--r--   2 cw185105 svc-grp       3020 2016-12-14 13:14 /apps/hive/warehouse/svcrpt.db/predictive_customers/delta_0000198_0000198/bucket_00005
-rw-r--r--   2 cw185105 svc-grp       2966 2016-12-14 13:14 /apps/hive/warehouse/svcrpt.db/predictive_customers/delta_0000198_0000198/bucket_00006
-rw-r--r--   2 cw185105 svc-grp       4768 2016-12-14 13:14 /apps/hive/warehouse/svcrpt.db/predictive_customers/delta_0000198_0000198/bucket_00007

I checked the hive settings and here is what we currently have set (on Dev and Prod) without making any changes:

hive> set hive.mapred.supports.subdirectories;
hive.mapred.supports.subdirectories=false
hive> set mapreduce.input.fileinputformat.input.dir.recursive;
mapreduce.input.fileinputformat.input.dir.recursive=false

The table on Dev was inserted into via sql (TRUNCATE TABLE svcrpt.predictive_customers; INSERT INTO svcrpt.predictive_customers SELECT * from svctmp.predictive_customers_temp2;) and the Prod table was inserted via Distcp'ing the Dev HDFS directory to the Prod HDFS directory using "hadoop distcp -overwrite -delete"

Re: When I Distcp an ORC hive table from one cluster to a different cluster, the new hive table doesn't see/output any of the data

Also, I only receive the error for MSCK REPAIR TABLE when I run it on that table, if I run it on another table it works.