About Sayed016

Sayed016 · ‎04-21-2022

The logs from the CM agent on the host doing the task are shown below. [21/Apr/2022 15:55:04 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] Launching process. one-off True, command dr/precopylistingcheck.sh, args [u'-bandwidth', u'100', u'-i', u'-m', u'20', u'-prbugpa', u'-skipAclErr', u'-update', u'-proxyuser', u'hbackup', u'-log', u'/user/PROXY_USER_PLACEHOLDER/.cm/distcp/2022-04-21_9975', u'-sequenceFilePath', u'/user/PROXY_USER_PLACEHOLDER/.cm/distcp-staging/2022-04-21-13-55-02-50a875dd/fileList.seq', u'-diffRenameDeletePath', u'/user/PROXY_USER_PLACEHOLDER/.cm/distcp-staging/2022-04-21-13-55-02-50a875dd/renamesDeletesList.seq', u'-sourceconf', u'source-client-conf', u'-sourceprincipal', u'hdfs/SOURCE_HOSTNAME', u'-sourcetktcache', u'source.tgt', u'-copyListingOnSource', u'-useSnapshots', u'distcp-33--26584462', u'-ignoreSnapshotFailures', u'-diff', u'-useDistCpFileStatus', u'-replaceNameservice', u'-strategy', u'dynamic', u'-filters', u'exclusion-filter.list', u'-scheduleId', u'33', u'-scheduleName', u'test-copy', u'/test-prod2-copy', u'/test-prod2-copy'] [21/Apr/2022 15:55:04 +0200] 1697 __run_queue supervisor WARNING Failed while getting process info. Retrying. (<Fault 10: 'BAD_NAME: 2815-hdfs-precopylistingcheck-40444302'>) [21/Apr/2022 15:55:04 +0200] 1697 __run_queue supervisor INFO Triggering supervisord update. [21/Apr/2022 15:55:04 +0200] 1697 __run_queue util INFO Using generic audit plugin for process hdfs-precopylistingcheck-40444302 [21/Apr/2022 15:55:04 +0200] 1697 __run_queue util INFO Creating metadata plugin for process hdfs-precopylistingcheck-40444302 [21/Apr/2022 15:55:04 +0200] 1697 __run_queue util INFO Using specific metadata plugin for process hdfs-precopylistingcheck-40444302 [21/Apr/2022 15:55:04 +0200] 1697 __run_queue util INFO Using generic metadata plugin for process hdfs-precopylistingcheck-40444302 [21/Apr/2022 15:55:04 +0200] 1697 __run_queue process INFO Begin audit plugin refresh [21/Apr/2022 15:55:04 +0200] 1697 __run_queue throttling_logger INFO (22 skipped) Scheduling a refresh for Audit Plugin for hdfs-precopylistingcheck-40444302 with count 1 pipelines names ['']. [21/Apr/2022 15:55:04 +0200] 1697 __run_queue process INFO Begin metadata plugin refresh [21/Apr/2022 15:55:04 +0200] 1697 __run_queue process INFO Not creating a monitor for 2815-hdfs-precopylistingcheck-40444302: should_monitor returns false [21/Apr/2022 15:55:04 +0200] 1697 __run_queue process INFO Daemon refresh complete for process 2815-hdfs-precopylistingcheck-40444302. [21/Apr/2022 15:55:09 +0200] 1697 Metadata-Plugin navigator_plugin INFO Pipelines updated for Metadata Plugin: [] [21/Apr/2022 15:55:09 +0200] 1697 Metadata-Plugin throttling_logger INFO (22 skipped) Refreshing Metadata Plugin for hdfs-precopylistingcheck-40444302 with count 0 pipelines names []. [21/Apr/2022 15:55:09 +0200] 1697 Audit-Plugin navigator_plugin INFO Pipelines updated for Audit Plugin: [] [21/Apr/2022 15:55:10 +0200] 1697 MainThread process INFO [2815-hdfs-precopylistingcheck-40444302] Unregistered supervisor process EXITED [21/Apr/2022 15:55:10 +0200] 1697 MainThread supervisor INFO Triggering supervisord update. [21/Apr/2022 15:55:10 +0200] 1697 MainThread throttling_logger INFO Removed keytab /var/run/cloudera-scm-agent/process/2815-hdfs-precopylistingcheck-40444302/hdfs.keytab as a candidate to kinit from [21/Apr/2022 15:55:25 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] Updating process: False {u'running': (True, False), u'run_generation': (1, 5)} [21/Apr/2022 15:55:25 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] Deactivating process (skipped) [21/Apr/2022 15:55:25 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] stopping monitors [21/Apr/2022 15:55:29 +0200] 1697 Metadata-Plugin navigator_plugin INFO stopping Metadata Plugin for hdfs-precopylistingcheck-40444302 with count 0 pipelines names []. [21/Apr/2022 15:55:29 +0200] 1697 Audit-Plugin navigator_plugin INFO stopping Audit Plugin for hdfs-precopylistingcheck-40444302 with count 0 pipelines names []. [21/Apr/2022 15:55:40 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] Updating process: False {u'run_generation': (5, 8)} [21/Apr/2022 15:55:40 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] Deactivating process (skipped) [21/Apr/2022 15:55:40 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] stopping monitors [21/Apr/2022 15:55:55 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] Updating process: False {u'run_generation': (8, 11)} [21/Apr/2022 15:55:55 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] Deactivating process (skipped) [21/Apr/2022 15:55:55 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] stopping monitors [21/Apr/2022 15:56:10 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] Updating process: False {u'run_generation': (11, 15)} [21/Apr/2022 15:56:10 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] Deactivating process (skipped) [21/Apr/2022 15:56:10 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] stopping monitors [21/Apr/2022 15:56:25 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] Updating process: False {u'run_generation': (15, 19)} [21/Apr/2022 15:56:25 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] Deactivating process (skipped) [21/Apr/2022 15:56:25 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] stopping monitors [21/Apr/2022 15:56:40 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] Updating process: False {u'run_generation': (19, 23)} [21/Apr/2022 15:56:40 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] Deactivating process (skipped) [21/Apr/2022 15:56:40 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] stopping monitors [21/Apr/2022 15:56:55 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] Updating process: False {u'run_generation': (23, 27)} [21/Apr/2022 15:56:55 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] Deactivating process (skipped) [21/Apr/2022 15:56:55 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] stopping monitors The below logs keeps repeating [21/Apr/2022 15:56:55 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] Updating process: False {u'run_generation': (23, 27)} [21/Apr/2022 15:56:55 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] Deactivating process (skipped) [21/Apr/2022 15:56:55 +0200] 1697 __run_queue process INFO [2815-hdfs-precopylistingcheck-40444302] stopping monitors

Sayed016 · ‎04-21-2022

Hello Team, In our customer cluster, we are testing the HDFS replication through Cloudera Manager. The replication policy looks as follows. All the other configuration is the default. The replication is hung in the below state for a long time. We looked into the Cloudera Manager logs and we can see the below error repeatedly occurring. Can you please help us to resolve the issue? 2022-04-21 12:27:57,199 ERROR CommandPusher-1:com.cloudera.cmf.service.AgentResultFetcher: Exception occured while handling tempfile com.cloudera.cmf.service.AgentResultFetcher@618eac09 Best Regards Sayed Anisul Hoque

Sayed016 · ‎04-19-2022

@Bharati Thank you! This worked. However, could you please share which logs had shown that it was trying to copy the system database and information_schema?

Sayed016 · ‎04-19-2022

Hello Team, We are setting the Hive replication through Cloudera Manager. The replication policy looks as follows. Note that we also enabled the snapshot on the source cluster for the path - /warehouse However, when we press save policy then we get the below notification. We looked into the Cloudera Manager logs and we can see the below error. Can you please help us to get the correct configuration to resolve the issue? 2022-04-19 15:51:07,848 ERROR scm-web-1686:com.cloudera.server.web.cmf.WebController: getHiveWarehouseSnapshotsEnabled javax.ws.rs.NotAuthorizedException: HTTP 401 Unauthorized at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.cxf.jaxrs.client.AbstractClient.convertToWebApplicationException(AbstractClient.java:507) at org.apache.cxf.jaxrs.client.ClientProxyImpl.checkResponse(ClientProxyImpl.java:324) at org.apache.cxf.jaxrs.client.ClientProxyImpl.handleResponse(ClientProxyImpl.java:878) at org.apache.cxf.jaxrs.client.ClientProxyImpl.doChainedInvocation(ClientProxyImpl.java:791) .... .... .... .... at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683) at java.lang.Thread.run(Thread.java:750) 2022-04-19 15:51:07,849 ERROR scm-web-1686:com.cloudera.server.web.common.JsonResponse: JsonResponse created with throwable: javax.ws.rs.NotAuthorizedException: HTTP 401 Unauthorized at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.cxf.jaxrs.client.AbstractClient.convertToWebApplicationException(AbstractClient.java:507) at org.apache.cxf.jaxrs.client.ClientProxyImpl.checkResponse(ClientProxyImpl.java:324) at org.apache.cxf.jaxrs.client.ClientProxyImpl.handleResponse(ClientProxyImpl.java:878) at org.apache.cxf.jaxrs.client.ClientProxyImpl.doChainedInvocation(ClientProxyImpl.java:791)

Sayed016 · ‎03-29-2022

With the help of @mszurap we could narrow down the issue. There are 2 issues, the first one was coming due to the OOM and the second one was from the application itself. Below are some of the logs that we noticed during the Oozie job run. 22/03/23 13:18:54 INFO mapred.SparkHadoopMapRedUtil: attempt_20220323131847_0000_m_000000_0: Committed 22/03/23 13:18:54 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 1384 bytes result sent to driver 22/03/23 13:19:55 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM 22/03/23 13:19:55 INFO storage.DiskBlockManager: Shutdown hook called 22/03/23 13:19:55 INFO util.ShutdownHookManager: Shutdown hook called from Miklos, "executor.Executor" ... "RECEIVED SIGNAL TERM" is completely normal that an executor is killed by the AM/Driver. Since the Spark job was succeeding in the lower environments (like Dev/Test) the suggestion was to check if the application is using the same dependencies too in lower environments (get the Spark event logs for the good and bad run). Also to check the driver YARN logs, there could be a possibility of some abrupt exit due to an OOM. We then looked in the direction of the OOM, and also checked if there were no System.exit() calls in the Spark code. We updated the driver memory to 2GB and ran the job, now we can see the actual error (the error from the application). Hope this helps someone in the future.

Sayed016 · ‎03-25-2022

Hello Team, Spark job through Oozie is failing with the below exception in the Prod cluster. Note that the same job passes in the lower clusters (f.e. Dev) 22/03/24 05:05:22 INFO spark.SparkContext: Invoking stop() from shutdown hook 22/03/24 05:05:22 ERROR scheduler.AsyncEventQueue: Listener EventLoggingListener threw an exception java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:477) at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:627) at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:583) at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:134) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:145) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:145) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:145) at org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:191) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:57) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1231) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82) 22/03/24 05:05:22 INFO server.AbstractConnector: Stopped Spark@12bcc45b{HTTP/1.1, (http/1.1)}{0.0.0.0:0} 22/03/24 05:05:22 INFO ui.SparkUI: Stopped Spark web UI at http://xxxxxxxxxxxxxxx:33687 22/03/24 05:05:22 INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors 22/03/24 05:05:22 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down 22/03/24 05:05:22 INFO cluster.SchedulerExtensionServices: Stopping SchedulerExtensionServices (serviceOption=None, services=List(), started=false) 22/03/24 05:05:22 ERROR util.Utils: Uncaught exception in thread shutdown-hook-0 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:477) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1685) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1745) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1742) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1757) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1723) at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:249) at org.apache.spark.SparkContext$$anonfun$stop$9$$anonfun$apply$mcV$sp$7.apply(SparkContext.scala:1966) at org.apache.spark.SparkContext$$anonfun$stop$9$$anonfun$apply$mcV$sp$7.apply(SparkContext.scala:1966) at scala.Option.foreach(Option.scala:257) at org.apache.spark.SparkContext$$anonfun$stop$9.apply$mcV$sp(SparkContext.scala:1966) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1269) at org.apache.spark.SparkContext.stop(SparkContext.scala:1965) at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:578) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1874) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) 22/03/24 05:05:22 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 22/03/24 05:05:22 INFO memory.MemoryStore: MemoryStore cleared 22/03/24 05:05:22 INFO storage.BlockManager: BlockManager stopped 22/03/24 05:05:22 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 22/03/24 05:05:22 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 22/03/24 05:05:22 INFO spark.SparkContext: Successfully stopped SparkContext 22/03/24 05:05:22 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED 22/03/24 05:05:22 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered. Already looked into the NameNode logs and couldn't find any ERROR regarding this. Please help in resolving the issue. Best Regards

Sayed016 · ‎03-24-2022

@Scharan Thank you! This helps. I appreciate!

Sayed016 · ‎03-24-2022

@Scharan Can you please give a short explanation as my customer is asking for it as to why shadow file matters in this case i.e. what's the relation with Knox with shadow file? Thank you!

Sayed016 · ‎03-24-2022

Yes, that resolved the issue! I had 000 as my permission. Thank you @Scharan I appreciate the quick reply.

Sayed016 · ‎03-24-2022

Hello Team, I have an issue with setting the Knox authentication with PAM. I have the default login in /etc/pam.d/ $ cat /etc/pam.d/login #%PAM-1.0 auth [user_unknown=ignore success=ok ignore=ignore default=bad] pam_securetty.so auth substack system-auth auth include postlogin account required pam_nologin.so account include system-auth password include system-auth # pam_selinux.so close should be the first session rule session required pam_selinux.so close session required pam_loginuid.so session optional pam_console.so # pam_selinux.so open should only be followed by sessions to be executed in the user context session required pam_selinux.so open session required pam_namespace.so session optional pam_keyinit.so force revoke session include system-auth session include postlogin -session optional pam_ck_connector.so Knox-sso looks as following (the default one) I created a user named - test with a password. I tried to access the Knox Gateway UI but I get the issue. The Knox Gateway log says: (KnoxPamRealm.java:handleAuthFailure(170)) - Shiro unable to login: null Note: I am using CDP 7.1.6 and I can login to my host (where Knox Gateway is installed) using the test user. Also, there's no Kerberos setup. Please share if there's something that needs to be adjusted. Best Regards Sayed

Online	Offline
Last Visited	‎12-11-2024 07:17 AM

Member Since	‎05-21-2021 03:03 AM
Last Visited	‎12-11-2024 07:17 AM
Posts	34
Kudos received	1

Cloudera Community

Re: YARN HistoryServer Web Status issue

Re: HDFS replication issue through Cloudera Manage...

Re: Spark job through Oozie is failing with - sche...

Re: HDFS replication issue through Cloudera Manage...

HDFS replication issue through Cloudera Manager

Re: Hive - Replication issue through Cloudera Mana...

Hive - Replication issue through Cloudera Manager

Re: Spark job through Oozie is failing with - sche...

Spark job through Oozie is failing with - schedule...

Re: Knox authentication with PAM

Re: Knox authentication with PAM

Re: Knox authentication with PAM

Knox authentication with PAM