Member since
12-19-2017
23
Posts
2
Kudos Received
0
Solutions
07-22-2019
04:10 AM
Last night we had a weird situation.
One of Spark processes ended 3 minutes after the backup job started.
That backup just has a simple mysqldump to get all the metadata, followed by a fetchImage from the HDFS.
My question is... is it possible... that a specific Spark job which was running correctly for a few hours, was ended because the backup process started?
This spark job is only doing an access to the HDFS (said by the development team...) so... could it be that the fetchImage is killing something or... signaling something to stop reading from the HDFS?
I'm kind of confused at this moment... this is why I'm asking the question here.
Our cluster is super stable at this point in time, this never happened before. The only thing weird at this point the actual time and day of the backup which is the same as the crashing Spark job. Like... 1+1 = 2...
Could it be something else?
2019-07-21 21:46:26 INFO ContainerManagementProtocolProxy:260 - Opening proxy : "NODE1 :)":8041
2019-07-22 00:33:35 INFO YarnAllocator:54 - Completed container container_e16_1562587047011_1317_01_000013 on host: "NODE 5 =)" (state: COMPLETE, exit status: 1)
2019-07-22 00:33:35 WARN YarnAllocator:66 - Container marked as failed: container_e16_1562587047011_1317_01_000013 on host: "NODE 5 =)". Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_e16_1562587047011_1317_01_000013
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
at org.apache.hadoop.util.Shell.run(Shell.java:507)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:399)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Shell output: main : command provided 1
main : run as user is XXXXX
main : requested yarn user is XXXXX
Writing to tmp file /u11/hadoop/yarn/nm/nmPrivate/application_1562587047011_1317/container_e16_1562587047011_1317_01_000013/container_e16_1562587047011_1317_01_000013.pid.tmp
Writing to cgroup task files...
Thank you.
... View more
05-02-2019
01:14 AM
Sure but this is not the objective (local mode in an external machine) when you have a 6 cluster node.... In this case... we need to run on --master yarn.
... View more
04-30-2019
01:02 AM
Hi everyone... once again I come to this community forum in despair. Let me explain. Our customer is trying to run Spark 2.2.0 on an external node, that doesn't belong to the Cloudera Cluster. This Cloudera Cluster, CDH 5.15.1, has Spark on YARN (1.6.0) and Spark 2.6.0. The problem is that running a simple wordcount with the Spark2.2.0 in the external node with --master yarn property, starts on Spark 1.6.0 in the Cluster... I've made multiple test with no sucess... I'm starting to think that the only way to run Spark2 is inside a cluster node... Any ideas will be helpful since I don't know what to do at this point to help the customer working with Spark2... (last resort is giving them direct access to the CDH cluster nodes... but we don't want that for security reasons...) Thanks... EDIT1: Found this https://www.cloudera.com/documentation/spark2/latest/topics/spark2_admin.html#default_tools It seems cloudera has already something to force Spark2 only on the cluster. Anyone tried this?
... View more
12-18-2018
04:01 AM
We basically disabled "Kerberos Authentication for HTTP Web-Consoles" in YARN... so.. yeh... lets hope someone finds this and figures out what is happening.
... View more
12-17-2018
01:29 AM
My team did the same last week, to test the expected behaviour. We didn't notice anything weird or unusual. Still, this was made in a couple of Virtual Machines... doing this in production without any reference from Cloudera, feels a bit... dangerous. 🙂
... View more
12-13-2018
05:23 AM
We have customers looking for this same situation. We want some official answers please. Or any advise would be good.
... View more
12-04-2018
12:11 AM
1 Kudo
I don't fully remember but, I think back then we had to use SSSD (Via LDAP. We have another customer via Keytab) to fetch the groups with hadoop.security.group.mapping = org.apache.hadoop.security.ShellBasedUnixGroupsMapping That way we are able to fetch the groups of each user on the backend.
... View more
11-22-2018
01:12 AM
1 Kudo
Hi. I've manage to solve (at least) the ticket cache. We were missing in krb5.conf the information below [domain_realm] tag. Restarted all agents. (I'm still not 100% sure if this was it...) klist /var/run/cloudera-scm-agent/krb5cc_cm_agent_0
Ticket cache: FILE:/var/run/cloudera-scm-agent/krb5cc_cm_agent_0
Default principal: HTTP/sl000060.domain.stuff@LOCAL.REALM
Valid starting Expires Service principal
11/22/2018 09:00:10 11/23/2018 09:00:10 krbtgt/LOCAL.REALM@LOCAL.REALM
renew until 11/27/2018 09:00:10
11/22/2018 09:01:04 11/23/2018 09:00:10 HTTP/sl000060.domain.stuff@LOCAL.REALM
renew until 11/27/2018 09:00:10 Still, the reported error didn't go away... Summary Notes: the principals are just examples of how the things are configured. I remember you guys, that this is only happening with the following: 1. Only the Cloudera Agent with the Yarn RM on Standby (High Availability on) presents the error. 2. If we disable the property in YARN, " Enable Kerberos Authentication for HTTP Web-Consoles" the error goes away. 3. All other services are running perfectly accordingly to Cloudera Manager (no alarms/warnings) except the reported " Bad : The Cloudera Manager Agent is not able to communicate with this role's web server." on the Standby YARN RM. 4. Zookeeper seems to be performing the correct actions, Activating the Standby RM after disabling the previous Active one. (Standby to Active, and vice versa). 5. If we disable the High Availability on YARN, therefore having only one Resource Manager, the problem goes away.
... View more
11-21-2018
07:56 AM
Problem: Cloudera Agent HTTP error 401. Local Kerberos: Active Version: CDH 6.0.0 HDFS and YARN, both with HA on, running perfectly. Disabling/Enabling ResourceManager works perfectly. One RM goes active after the other goes down (example). Still, whenever one RM goes in Standby mode, the Cloudera-SCM-Agent starts showing the following error. [21/Nov/2018 15:00:31 +0000] 30488 Monitor-GenericMonitor url ERROR Autentication error on attempt 1. Retrying after sleeping 1.000000 seconds.
Traceback (most recent call last):
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/util/url.py", line 241, in urlopen_with_retry_on_authentication_errors
return function()
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/monitor/generic/metric_collectors.py", line 220, in _open_url
password=self._password_value)
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/util/url.py", line 82, in urlopen_with_timeout
return opener.open(url, data, timeout)
File "/usr/lib64/python2.7/urllib2.py", line 437, in open
response = meth(req, response)
File "/usr/lib64/python2.7/urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib64/python2.7/urllib2.py", line 469, in error
result = self._call_chain(*args)
File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/urllib2_kerberos.py", line 203, in http_error_401
retry = self.http_error_auth_reqed(host, req, headers)
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/urllib2_kerberos.py", line 127, in http_error_auth_reqed
return self.retry_http_kerberos_auth(req, headers, neg_value)
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/urllib2_kerberos.py", line 143, in retry_http_kerberos_auth
resp = self.parent.open(req)
File "/usr/lib64/python2.7/urllib2.py", line 437, in open
response = meth(req, response)
File "/usr/lib64/python2.7/urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib64/python2.7/urllib2.py", line 469, in error
result = self._call_chain(*args)
File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 656, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/usr/lib64/python2.7/urllib2.py", line 437, in open
response = meth(req, response)
File "/usr/lib64/python2.7/urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib64/python2.7/urllib2.py", line 475, in error
return self._call_chain(*args)
File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/https.py", line 219, in http_error_default
raise e
HTTPError: HTTP Error 401: Authentication required If, we disable the property " Enable Kerberos Authentication for HTTP Web-Consoles" this problem goes away. At the moment this is the only problematic situation we have with Kerberos. We've tried to Regenerate Keytabs, but the problem remains. This only happens in the Agent that has the Standby Resource Manager. On the Active there is no problem. If we shutdown the Active, the Standby goes up (to Active as it should), and then the error starts appearing in that Agent (in the "new" Standby). Also, if we remove the HA from YARN, (only one RM "active") the problem goes away... Any ideas? [Update]: we found out the following (fake names for example). klist /var/run/cloudera-scm-agent/krb5cc_cm_agent_0
Ticket cache: FILE:/var/run/cloudera-scm-agent/krb5cc_cm_agent_0
Default principal: HTTP/sl000060.besp.dsp.gbes@CLBGDXD01.BESP.DSP.GBES
Valid starting Expires Service principal
11/21/2018 16:44:55 11/22/2018 16:44:55 krbtgt/LOCAL.REALM@LOCAL.REALM
renew until 11/26/2018 16:44:55
11/21/2018 16:45:25 11/22/2018 16:44:55 HTTP/sl000060.domain.stuff@
renew until 11/26/2018 16:44:55
11/21/2018 16:45:25 11/22/2018 16:44:55 HTTP/sl000060.domain.stuff@LOCAL.REALM
renew until 11/26/2018 16:44:55 Where is this coming from stuff@, when we only have HTTP principals in kadmin.local with the format of HTTP/node.domain.stuff@LOCAL.REALM ???
... View more
Labels:
10-31-2018
01:11 AM
@GautamG thank you for the explanation! Have a nice day!
... View more
10-31-2018
01:05 AM
Hi @GautamG ! Thank you for your answer! If I understood correctly what you are saying is that, Cloudera Manager won't block any correct logins, even if you have a LDAP user/group filter? In order to block a login (in the login screen, if user/password is validated by the AD) we need to create a script with the information you gave me? Thank you for your time Gautam!
... View more
10-30-2018
06:33 AM
Hi. Me and my team have been struggling with the setup of CM and the External Authentication. We would like to block the access to users who don't belong to a certain group our pattern. We want to completely block the access in the login menu. I don't want them to even enter the CM backoffice. Is this possible to do? We have been testing with LDAP parameters with no sucess. It seems that if the user is authenticated in the AD it will enter CM no matter what... Any ideas? Sugestions?
... View more
Labels:
10-30-2018
06:27 AM
We manage to find a... sort of... solution... I think... at least... it seems to be working. Changed: LDAP User Filter (user_filter) from empty to (|(memberOf=CN=GBGDATA1,OU=stuff4, OU=stuff5,DC=stuff1,DC=stuff2,DC=stuff3) (memberOf=CN=GBGDATA2,OU=stuff4, OU=stuff5,DC=stuff1,DC=stuff2,DC=stuff3)(memberOf=CN=GBGDATA3,OU=stuff4, OU=stuff5,DC=stuff1,DC=stuff2,DC=stuff3)) LDAP Group Filter (group_filter) from (&(objectClass=group)(cn=GBGDATA*)) to (objectClass=group) Is there anyway of doing this but with a wildcard *? Like GBGDATA*? If we need to put more groups... this is going to become a huge pain in the a...
... View more
10-26-2018
07:25 AM
Hello my dear gods of the Big Data! I'm having the following problems: Problem #1 - all users are login in as superusers. How is this possible? I have a 5.12 cluster and this isn't happening. On a the new one (CDH 6), Hue is giving this permission to everyone. What am I missing? Problem #2 - LDAP configuration. Hue isn't using my filters!? LDAP Configuration: Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini [desktop] [[ldap]] sync_groups_on_login=true debug_level=255 trace_level=9 Authentication Backend ( LdapBackend ldap_url) - ldap://stuff1.stuff2.stuff3:389 LDAP Username Pattern (ldap_username_pattern) - empty Use Search Bind Authentication (search_bind_authentication) - True Create LDAP users on login (create_users_on_login) - True LDAP Search Base (base_dn) - dc=stuff1,dc=stuff2,dc=stuff3 LDAP Bind User Distinguished Name (bind_dn) - CN=user,OU=stuff4,DC=stuff1,DC=stuff2,DC=stuff3 LDAP Bind Password (bind_password) - ••••••••••••••••••••• LDAP User Filter (user_filter) - empty LDAP Username Attribute (user_name_attr) - sAMAccountName LDAP Group Filter (group_filter) - (&(objectClass=group)(cn=GBGDATA*)) LDAP Group Name Attribute (group_name_attr) - cn LDAP Group Membership Attribute (group_member_attr) - member The idea behind this configuration is to filter all accesses to users that belong to all groups which start with "GBGDATA". In access.log, debug shows this: [26/Oct/2018 14:57:52 +0100] DEBUG search_s('dc=stuff1,dc=stuff2,dc=stuff3', 2, '(&(sAMAccountName=%(user)s)(objectclass=*))') returned 1 objects: cn=myuser,ou=stuff5,dc=stuff1,dc=stuff2,dc=stuff3
[26/Oct/2018 14:57:52 +0100] DEBUG Populating Django user myuser
[26/Oct/2018 14:57:53 +0100] WARNING 123.123.123.123 myuser - "POST /hue/accounts/login HTTP/1.1"-- Successful login for user: myuser Why in the hell HUE is using: (&(sAMAccountName=%(user)s)(objectclass=*)) Instead of what I've set above??? Thanks everyone!
... View more
Labels:
10-23-2018
03:01 AM
Just a few changes I've detected.
In the agent configuration don't use EXPORT.
In /etc/default/cloudera-scm-agent just add:
KRB5_CONFIG=/path/krb5.conf
Also, you will need to hammer Kerberos Server and Kadmin files.
KDC -> /etc/sysconfig/krb5kdc
Add: KRB5_CONFIG=/path/krb5.conf
And also:
Kadmin -> /etc/sysconfig/kadmin
Add: KRB5_CONFIG=/path/krb5.conf
This will allow you to start both services after you create the databases (kdb5_util create -s). If you don't do this, kerberos will still read /etc/krb5.conf and weird stuff will appear.
LET'S GO PEOPLE!!! 😄 Hammer on!
... View more
10-18-2018
12:56 AM
Dude, with this reply you should really get a raise! Thanks! We will test this solution! Thank you so much! ____________________________________________________ After some tests the solution is correct! It worked 🙂
... View more
10-16-2018
06:49 AM
Hi. I've been able to configure our cluster with a local kerberos realm using the usual /etc/krb5.conf file. The thing is... we need to change the path that the system will use. Basically the customer is using the path /etc/krb5.conf with something else, but we need to setup Cloudera Manager (and the rest of the big data services) to use a different krb5.conf location. Anyone know how to perform this change in Cloudera Manager (cluster wide)? To use a different krb5.conf file, other than the usual /etc/krb5.conf? Thanks... :'(
... View more
Labels:
10-08-2018
12:50 AM
Hi. One of our customers asked us if it was possible to change the username of the CDH installation from "cloudera-scm" to another username. Is this possible without using single-user mode? I've searched in the internet but did not find anything related to this. Thank you.
... View more
Labels:
01-05-2018
01:47 AM
I'm currently trying to run HDFS DFS -LS / with 1 kerberos principal (one that should be in the AD) and we are having some issues... try to run debug on kerberos to check if you can run commands on HDFS with a principal that is in the AD. We are doing this in order to test if the problem is in the HDFS/Kerberos/AD configuration.
... View more
12-27-2017
01:33 AM
I can't actually do that, cause the AD comes from a major company... and it's managed by them. 😞 Thanks for the reply!
... View more
12-22-2017
04:32 AM
Hi there. I have the same problem but I didn't understand the solution. What did you do? Sorry but I'm a little desperate here... 😞
... View more
12-21-2017
06:28 AM
Hi. My company is running a CDH Cluster, with Hue setup with AD. Sentry and Hive. Below all this we also have Kerberos. The main problem right now is that when Hive tries to search for the groups of a user I get this error. 2017-12-21 14:12:57,687 WARN org.apache.hadoop.security.LdapGroupsMapping: [HiveServer2-Handler-Pool: Thread-108]: Failed to get groups for user ex76196 (retry=0) by javax.naming.AuthenticationException: [LDAP: error code 49 - 80090308: LdapErr: DSID-0C09042A, comment: AcceptSecurityContext error, data 52e, v3839] 2017-12-21 14:12:57,706 WARN org.apache.hadoop.security.LdapGroupsMapping: [HiveServer2-Handler-Pool: Thread-108]: Failed to get groups for user ex76196 (retry=1) by javax.naming.AuthenticationException: [LDAP: error code 49 - 80090308: LdapErr: DSID-0C09042A, comment: AcceptSecurityContext error, data 52e, v3839] 2017-12-21 14:12:57,724 WARN org.apache.hadoop.security.LdapGroupsMapping: [HiveServer2-Handler-Pool: Thread-108]: Failed to get groups for user ex76196 (retry=2) by javax.naming.AuthenticationException: [LDAP: error code 49 - 80090308: LdapErr: DSID-0C09042A, comment: AcceptSecurityContext error, data 52e, v3839] 2017-12-21 14:12:57,724 WARN org.apache.sentry.provider.common.HadoopGroupMappingService: [HiveServer2-Handler-Pool: Thread-108]: Unable to obtain groups for ex76196 java.io.IOException: No groups found for user ex76196 at org.apache.hadoop.security.Groups.noGroupsForUser(Groups.java:190) at org.apache.hadoop.security.Groups.access$400(Groups.java:69) at org.apache.hadoop.security.Groups$GroupCacheLoader.load(Groups.java:307) at org.apache.hadoop.security.Groups$GroupCacheLoader.load(Groups.java:257) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) at com.google.common.cache.LocalCache.get(LocalCache.java:3965) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969) at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829) at org.apache.hadoop.security.Groups.getGroups(Groups.java:215) at org.apache.sentry.provider.common.HadoopGroupMappingService.getGroups(HadoopGroupMappingService.java:60) at org.apache.sentry.binding.hive.authz.HiveAuthzBinding.getGroups(HiveAuthzBinding.java:372) at org.apache.sentry.binding.hive.HiveAuthzBindingHook.postAnalyze(HiveAuthzBindingHook.java:395) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:449) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:312) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1201) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1188) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:143) at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:215) at org.apache.hive.service.cli.operation.Operation.run(Operation.java:326) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:425) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:402) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:258) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:500) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:746) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) When I try to setup Hive with AD/LDAP it says that only Kerberos or AD/LDAP can be on. Anyone have any idea how to solve this? The objective is basically give the AD groups permissions to the Hive tables. Kind of lost right now... any ideas would be very appreciated. Thanks.
... View more
12-19-2017
05:38 AM
I'm having the same problem. (19/12/2017) Anyone found the solution to this? I also have AD and Kerberos.
... View more