Reply
Explorer
Posts: 7
Registered: ‎08-22-2018

HDFS or HIVE Replication

[ Edited ]

Hi, 

 

HDFS and Hive Replication schedule is failing with beloew error. 

 

18/08/22 19:40:02 WARN security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:PROXY) via hdfs/<FQDN>@realm (auth:KERBEROS) cause:java.io.IOException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.YarnRuntimeException): org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Could not load history file hdfs://nameservice9:8020/user/history/done_intermediate/hdfs/job_1534964122122_0006-1534978377784-hdfs-HdfsReplication-1534981192465-20-1-SUCCEEDED-root.users.hdfs-1534978383044.jhist
at org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.getFullJob(CachedHistoryStorage.java:199)
at org.apache.hadoop.mapreduce.v2.hs.JobHistory.getJob(JobHistory.java:218)
at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler$1.run(HistoryClientService.java:208)
at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler$1.run(HistoryClientService.java:204)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.verifyAndGetJob(HistoryClientService.java:204)
at org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.getJobReport(HistoryClientService.java:236)
at org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getJobReport(MRClientProtocolPBServiceImpl.java:122)
at org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:275)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2220)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2214)
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Could not load history file hdfs://nameservice9:8020/user/history/done_intermediate/hdfs/job_1534964122122_0006-1534978377784-hdfs-HdfsReplication-1534981192465-20-1-SUCCEEDED-root.users.hdfs-1534978383044.jhist
at org.apache.hadoop.mapreduce.v2.hs.CompletedJob.loadFullHistoryData(CompletedJob.java:341)
at org.apache.hadoop.mapreduce.v2.hs.CompletedJob.<init>(CompletedJob.java:101)
at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$HistoryFileInfo.loadJob(HistoryFileManager.java:479)
at org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.loadJob(CachedHistoryStorage.java:180)
at org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.access$000(CachedHistoryStorage.java:52)
at org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage$1.load(CachedHistoryStorage.java:103)
at org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage$1.load(CachedHistoryStorage.java:100)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
at com.google.common.cache.LocalCache$LocalManualCache.getUnchecked(LocalCache.java:4834)
at org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.getFullJob(CachedHistoryStorage.java:193)
... 18 more
Caused by: org.apache.hadoop.security.AccessControlException: Permission denied: user=mapred, access=READ, inode="/user/history/done_intermediate/hdfs/job_1534964122122_0006-1534978377784-hdfs-HdfsReplication-1534981192465-20-1-SUCCEEDED-root.users.hdfs-1534978383044.jhist":hdfs:supergroup:-rwxrwx---

Master
Posts: 313
Registered: ‎07-01-2015

Re: HDFS or HIVE Replication

This just says you dont have a permission with mapred user to that directory. Without knowing any details, one of the solution could be to add mapred user to a supergroup group on every worker node.

 

Highlighted
Expert Contributor
Posts: 133
Registered: ‎01-08-2018

Re: HDFS or HIVE Replication

If ownership of "/user/history" has not been set to "hdfs:supergroup" in purpose, then this is simply a configuration issue. This directory should be owned by mapred user. Normally, this directory is automatically created and configured by Cloudera Manager, If you are doing manual installation, then you can check in https://www.cloudera.com/documentation/enterprise/5-9-x/topics/cdh_ig_yarn_cluster_deploy.html#topic...
Explorer
Posts: 7
Registered: ‎08-22-2018

Re: HDFS or HIVE Replication

Are you talking on destination cluster? Do you know how to add mapred user to supergroup? I don't see supergroup name at all in /etc/group. But I see mapred user in hadoop group. I need to add mapred user to supergroup as I am running BDR jobs as hdfs user. 

Explorer
Posts: 7
Registered: ‎08-22-2018

Re: HDFS or HIVE Replication

Since we are running BDR jobs using hdfs user, I think this '/user/history' has been set to 'hdfs:supergroup'. I changed to mapred:hadoop and ran the job again but still it is failing with the same error. I see it creates another directoy inside '/user/history/done_intermediate/hdfs' with the name whoever runs a mapreduce job. It is writing .jhist, .summary, .xml extension files with below mentioned permissions and ownerships. 

 

-rwxrwx---   1 hdfs supergroup

 

Can we change anything in such a way that files its going to create in '/user/history/done_intermediate/hdfs' will have enough permissions to read/write. 

Expert Contributor
Posts: 133
Registered: ‎01-08-2018

Re: HDFS or HIVE Replication

In that case, you can use HDFS ACL and grant mapred user read permissions to all existing and any future file under this directory, to mapred user. https://www.cloudera.com/documentation/enterprise/5-14-x/topics/cdh_sg_hdfs_ext_acls.html#concept_hd...
Master
Posts: 313
Registered: ‎07-01-2015

Re: HDFS or HIVE Replication

As @GeKas said, enable ACL on HDFS (I you dont have it already - dfs.namenode.acls.enabled should be checked).

 

Then you need to set the default group access for the parent directory, so every other subdirectory will be accessible by the mapred user (assuming mapred is in hadoop group): 

 

hdfs dfs -setfacl -R -m default:group:hadoop:r-x /user/historyh
hdfs dfs -setfacl -R -m group:hadoop:r-x /user/history

 And try it again.

Explorer
Posts: 7
Registered: ‎08-22-2018

Re: HDFS or HIVE Replication

I triggered HDFS BDR job as hdfs user. It is failed with below error message. 

 

INFO distcp.DistCp: map 96% reduce 0% files 99% bytes 80% throughput 24.7 (MB/s) remaining time 25 mins 27 secs running mappers 1
 INFO distcp.DistCp: map 96% reduce 32% files 99% bytes 80% throughput 24.7 (MB/s) remaining time 25 mins 27 secs running mappers 1
 INFO ipc.Client: Retrying connect to server: FQDN/IP:PORT. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
 INFO ipc.Client: Retrying connect to server: FQDN/IP:PORT. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
 INFO ipc.Client: Retrying connect to server: FQDN/IP:PORT. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
WARN security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:PROXY) via <hdfs_principal_name> (auth:KERBEROS) cause:java.io.IOException: Job status not available
 ERROR util.DistCpUtils: Exception encountered
java.io.IOException: Job status not available
at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:334)
at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:621)
at com.cloudera.enterprise.distcp.DistCp.checkProgress(DistCp.java:471)
at com.cloudera.enterprise.distcp.DistCp.execute(DistCp.java:461)
at com.cloudera.enterprise.distcp.DistCp$1.run(DistCp.java:151)
at com.cloudera.enterprise.distcp.DistCp$1.run(DistCp.java:148)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at com.cloudera.enterprise.distcp.DistCp.run(DistCp.java:148)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at com.cloudera.enterprise.distcp.DistCp.main(DistCp.java:843)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
18/09/05 16:32:09 INFO distcp.DistCp: Used diff: false
18/09/05 16:32:09 WARN distcp.DistCp: Killing submitted job job_1536157028747_0004

Posts: 1,748
Kudos: 364
Solutions: 277
Registered: ‎07-31-2013

Re: HDFS or HIVE Replication

This is related to the JobHistoryServer log reported earlier.

Please ensure/perform the following items for JHS and job completions to thoroughly work:

First: Ensure that 'mapred' and 'yarn' are part of the 'hadoop' group in common:

~> hdfs groups mapred
~> hdfs groups yarn

Both command must include 'hadoop' in their outputs. If not, ensure they are added to that group name.

Second, all files and directories under HDFS /tmp/logs aggregation dir (or whatever you've reconfigured it to use) and /user/history/* have their group set to 'hadoop' and not anything else:

~> hadoop fs -chgrp -R hadoop /user/history /tmp/logs
~> hadoop fs -chmod -R g+rwx /user/history /tmp/logs

Note: ACLs suggested earlier are not required to resolve this problem. The group used on these dirs is what matters in the default state, and the group setup described above is how YARN and JHS daemon users share information and responsibilities with each other. You may remove any ACLs set, or leave them be as they are still permissive.
Announcements