Member since
02-08-2016
793
Posts
669
Kudos Received
85
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3067 | 06-30-2017 05:30 PM | |
3986 | 06-30-2017 02:57 PM | |
3308 | 05-30-2017 07:00 AM | |
3884 | 01-20-2017 10:18 AM | |
8400 | 01-11-2017 02:11 PM |
12-23-2016
06:14 PM
3 Kudos
SYMPTOM: Yarn timeline logs are growing very fast and the disk is now 100% utilized. Below are my configs set for ATS - Configs: <property>
<name>yarn.timeline-service.ttl-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.timeline-service.ttl-ms</name>
<value>1339200000</value>
/property>
<property>
<name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
<value>150000</value>
</property>
ROOT CAUSE: This config does not affect the semantic of the ATS purging process. However, it affects the concrete behavior of a level-db based storage implementation to do purging. This config decides the time interval between two purges in a level-db based ATS storage (like leveldb storage and rolling leveldb storage). Here in this case, the customer set ttl to 1339200000 ms, 1339200 seconds, or 372 hours or 15.5 days. On a normal cluster with limited disk space budgeted this may cause some problems (13 MB per hour). Reducing this value may help to alleviate the problem. RESOLUTION: In this case the issue was resolved by modifying the value of the property "yarn.timeline-service.ttl-ms" in the Application Timeline configuration from 1339200000, 15.5 days, to 669600000 or 7 days. <property>
<name>yarn.timeline-service.ttl-ms</name>
<value>669600000</value>
/property>
... View more
Labels:
12-27-2016
01:26 PM
Post deleting WF_ACTIONS/wf_jobs and COORD_ACTIONS/coord_jobs do we need to ensure anything on BUNDLE_JOBS/bundle_jobs. Any other steps to performed for removing stale/cache entries. Irshad Ahmed
... View more
12-23-2016
05:25 PM
4 Kudos
SYMPTOM: Ambari is showing Alert about a connection failed to the journal node service. Below is the alert - 2016-06-30 18:50:39,865 [CRITICAL] [HDFS] [journalnode_process] (JournalNode Process) Connection failed to http://jn1.example.com:8480 (Execution of 'curl -k --negotiate -u : -b /var/lib/ambari-agent/tmp/cookies/f8ed47d4-f63e-482c-be70-36755387ca4b -c /var/lib/ambari-agent/tmp/cookies/f8ed47d4-f63e-482c-be70-36755387ca4b -w '%{http_code}' http://jn.example.com:8480 --connect-timeout 5 --max-time 7 -o /dev/null 1>/tmp/tmpE9v3mg 2>/tmp/tmpKOSncN' returned 28. % Total % Received % Xferd Average Speed Time Time Time Current ERROR: Below are the journal logs 2016-07-01 10:21:29,390
WARN namenode.FSImage (EditLogFileInputStream.java:scanEditLog(350)) - Caught
exception after scanning through 0 ops from
/hadoop/hdfs/journal/phadcluster01/current/edits_inprogress_0000000002510372012
while determining its valid length. Position was 712704
java.io.IOException: Can't scan a pre-transactional edit log.
at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4959)
at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)
at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:346)
at org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:520)
at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:192)
at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:152)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:90)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:99)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.startLogSegment(JournalNodeRpcServer.java:161)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.startLogSegment(QJournalProtocolServerSideTranslatorPB.java:186)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25425)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
ROOT CAUSE: From the log below it seems that the journal node edits were corrupted 2016-07-01 10:21:16,007 WARN namenode.FSImage (EditLogFileInputStream.java:scanEditLog(350)) - Caught exception after scanning through 0 ops from /hadoop/hdfs/journal/phadcluster01/current/edits_inprogress_0000000002510372012 while determining its valid length. Position was 712704 java.io.IOException: Can't scan a pre-transactional edit log. RESOLUTION: Below are steps taken to resolve the issue - 1.stopped journal node
2.backup existing jn directory metadata
3.copied working edits_inprogress from other JN node
4.Modified the permission to hdfs:hadoop
5.Restart the Journal node.
6.JN started successfully and no more errors are seen in the log.
... View more
Labels:
12-23-2016
12:06 PM
4 Kudos
SYMPTOM: The Standby NameNode process running on our 2nd of four management node servers isn't running. Interrogating the log files, I've found an exception relating to an Oozie job ERROR: Below was the error logs - 2016-12-20 09:20:17,286 INFO namenode.EditLogInputStream (RedundantEditLogInputStream.java:nextOp(176)) - Fast-forwarding stream 'http://node1:8480/getJournal?jid=namenodeha&segmentTxId=16740759&storageInfo=-63%3A1400038789%3A0%3ACID-031f35b2-59c9-42f9-8942-550aee3d39e6, http://node1:8480/getJournal?jid=namenodeha&segmentTxId=16740759&storageInfo=-63%3A1400038789%3A0%3ACID-031f35b2-59c9-42f9-8942-550aee3d39e6' to transaction ID 16713078
2016-12-20 09:20:17,287 INFO namenode.EditLogInputStream (RedundantEditLogInputStream.java:nextOp(176)) - Fast-forwarding stream 'http://node1:8480/getJournal?jid=namenodeha&segmentTxId=16740759&storageInfo=-63%3A1400038789%3A0%3ACID-031f35b2-59c9-42f9-8942-550aee3d39e6' to transaction ID 16713078
2016-12-20 09:20:18,287 INFO namenode.FSEditLogLoader (FSEditLogLoader.java:loadEditRecords(266)) - replaying edit log: 48858/805951 transactions completed. (6%)
2016-12-20 09:20:18,485 ERROR namenode.FSEditLogLoader (FSEditLogLoader.java:loadEditRecords(242)) - Encountered exception on operation DeleteSnapshotOp [snapshotRoot=/apps/hive/warehouse, snapshotName=oozie-snapshot-2016_12_16-08_01, RpcClientId=1f566cee-d0eb-4a84-a615-40cdd31bc772, RpcCallId=1]
2016-12-20 09:20:18,599 ERROR namenode.NameNode (NameNode.java:main(1712)) - Failed to start namenode.
2016-12-20 09:20:18,601 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
2016-12-20 09:20:18,602 INFO namenode.NameNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG:
ROOT CAUSE: Suspected that the edits logs were corrupted and it was causing the issue for Standby namenode to startup. Replicating the metadata from primary namenode to standby didn't worked. This is a BUG - https://issues.apache.org/jira/browse/HDFS-6908 Affected version: HDP - 2.4.0
Ambari - 2.2.1.1 RESOLUTION: This is resolved in HDP2.5 and apache hadoop 2.6.0 for current scenario we need to request a patch from hortonworks dev team.
... View more
Labels:
07-20-2017
02:06 AM
Works fine but doesn't answer the question. Should I understand that it's impossible to delete view using Ambari UI?
... View more
12-23-2016
05:58 AM
4 Kudos
SYMPTOM: After upgrading ambari from 2.1.1 to 2.2.2.2 tried restarting oozie service which failed with error - " su: cannot set user id: Resource temporarily unavailable" ERROR: Below are the error logs- Execution, [[0000002-160227115902137-oozie-oozi-C@4]::CoordActionInputCheck:: Ignoring action. Coordinator job is not in RUNNING/RUNNINGWITHERROR/PAUSED/PAUSEDWITHERROR state, but state=SUSPENDED], Error Code: E1100
2016-07-02 13:04:42,457 WARN CoordActionInputCheckXCommand:523 - SERVER[hdmlup000a.machine.group] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000002-160227115902137-oozie-oozi-C] ACTION[0000002-160227115902137-oozie-oozi-C@5] E1100: Command precondition does not hold before execution, [[0000002-160227115902137-oozie-oozi-C@5]::CoordActionInputCheck:: Ignoring action. Coordinator job is not in RUNNING/RUNNINGWITHERROR/PAUSED/PAUSEDWITHERROR state, but state=SUSPENDED], Error Code: E1100
2016-07-02 13:04:42,459 WARN CoordActionInputCheckXCommand:523 - SERVER[hdmlup000a.machine.group] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000002-160227115902137-oozie-oozi-C] ACTION[0000002-160227115902137-oozie-oozi-C@6] E1100: Command precondition does not hold before execution, [[0000002-160227115902137-oozie-oozi-C@6]::CoordActionInputCheck:: Ignoring action. Coordinator job is not in RUNNING/RUNNINGWITHERROR/PAUSED/PAUSEDWITHERROR state, but state=SUSPENDED], Error Code: E1100
2016-07-02 13:04:42,460 WARN CoordActionReadyXCommand:523 - SERVER[hdmlup000a.machine.group] USER[falcon] GROUP[-] TOKEN[] APP[FALCON_PROCESS_DEFAULT_Push03to04run03] JOB[0000002-160227115902137-oozie-oozi-C] ACTION[] E1100: Command precondition does not hold before execution, [[0000002-160227115902137-oozie-oozi-C]::CoordActionReady:: Ignoring job. Coordinator job is not in RUNNING state, but state=SUSPENDED], Error Code: E1100
2016-07-02 13:04:53,076 INFO PauseTransitService:520 - SERVER[hdmlup000a.machine.group] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] Acquired lock for [org.apache.oozie.service.PauseTransitService]
2016-07-02 13:04:53,086 INFO PauseTransitService:520 - SERVER[hdmlup000a.machine.group] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] Released lock for [org.apache.oozie.service.PauseTransitService]
ROOT CAUSE: The issue is probably due to nproc settings. You need to modify the nproc settings for particular service user. RESOLUTION: Below were steps performed for resolution 1.Check output of "ps -u oozie -L | wc -l"
Nproc limit for oozie was set to 16000 in ambari oozie config.
2. Modified the nproc limit from 16000 to 32000 using ambari->services->oozie->configs
3. Restarted oozie.The oozie process was down from ambari UI but was showing running using ps command.
4.The issue was with the process was in stale state and was showing running from X no of days.
5.We tried restarting oozie server but still the process was not getting restarted as checked from cli.
6.Killed the oozie server process from cli also tried clearing agent cache using below command -
mv /var/lib/ambari-agent/data/structured-out-status.json /var/lib/ambari-agent/data/structured-out-status.json.bak
7. Restarted ambari agent process.
8. Restarted oozie server process which worked well and now oozie process is showing right status in ps command output.
... View more
Labels:
12-23-2016
05:45 AM
4 Kudos
SYMPTOM: HDFS service is not able to start and throwing below error in logs - python error " File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/ranger_functions.py", line 124, in create_ranger_repository" ERROR: Ambari operation logs shows below message - stderr: /var/lib/ambari-agent/data/errors-22280.txt
Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 433, in <module>
NameNode().execute()
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 219, in execute
method(env)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 524, in restart
self.start(env, upgrade_type=upgrade_type)
File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 102, in start
namenode(action="start", hdfs_binary=hdfs_binary, upgrade_type=upgrade_type, env=env)
File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", line 89, in thunk
return fn(*args, **kwargs)
File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 60, in namenode
setup_ranger_hdfs(upgrade_type=upgrade_type)
File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/setup_ranger_hdfs.py", line 61, in setup_ranger_hdfs
hdp_version_override = hdp_version, skip_if_rangeradmin_down= not params.retryAble)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/setup_ranger_plugin_xml.py", line 78, in setup_ranger_plugin
policy_user)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/ranger_functions.py", line 124, in create_ranger_repository
repo = self.get_repository_by_name_urllib2(repo_name, component, 'true', ambari_username_password_for_ranger)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/decorator.py", line 82, in wrapper
return function(*args, **kwargs)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/ranger_functions.py", line 77, in get_repository_by_name_urllib2
response = json.loads(result.read())
File "/usr/lib/python2.6/site-packages/ambari_simplejson/__init__.py", line 307, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.6/site-packages/ambari_simplejson/decoder.py", line 335, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.6/site-packages/ambari_simplejson/decoder.py", line 353, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
ROOT CAUSE: This is due to issue with "amb_ranger_admin" user password and hence ranger service are not able to communicate with Ranger admin. RESOLUTION: Below were steps performed for resolution - 1. Tried disabling the HDFS plugin for ranger and restarting HDFS worked well.
2. Removed the HDFS repository policy cache files from both namenodes.
3. Enabled HDFS plugin and restarted standby namenode which got failed again with same error.
4. Checked in Ranger UI in Audit -> Access->login tab was displaying wrong credentials for ambari admin user
5. We Reset the password for amb_ranger_admin from Ranger UI also modified same value in Ambari -> Services->Ranger->Configs
6. Restarted ranger
... View more
Labels:
12-23-2016
05:34 AM
4 Kudos
SYMPTOM: Ambari smoke test fails for hbase service. Below is the current scenario -
Ranger is installed in the cluster HBase policy has been enabled ambari-qa user has the privileges correctly defined in the HBase policy ERROR: ERROR: org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient permissions (user=ambari-qa, scope=default, params=[namespace=default,table=default:ambarismoketest,family=family],action=CREATE)
2015-10-27 09:52:03,342 ERROR [main] client.AsyncProcess: Failed to get region location
org.apache.hadoop.hbase.TableNotFoundException: Table 'ambarismoketest' was not found, got: XXXXX01.
ROOT CAUSE: If the Ranger co-processor is not correctly defined in the HBase configuration, the smoke test from Ambari would fail. Any table creation as non-hbase user could also fail. RESOLUTION: Verify the Ranger configuration for HBase.
Ensure that the following properties are set correctly and that the co-processors include Ranger classes hbase.coprocessor.master.classes
hbase.coprocessor.region.classes
hbase.coprocessor.regionserver.classes All of the above should include org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor hbase.security.authorization should be enabled i.e. set to true.
... View more
Labels:
12-23-2016
05:21 AM
4 Kudos
SYMPTOM: Ranger plugin is enabled for HIVE. Restarting HIVE service its not able to start and stucking on below error
ERROR:
2015-10-15 13:02:51,683 - u"File['/var/lib/ambari-agent/data/tmp/ojdbc6.jar']" {'content': DownloadSource('http://sjcservicenode04-prod.xxxinternal.com:8080/resources//oracle-jdbc-driver.jar')}
2015-10-15 13:02:51,796 - Not downloading the file from http://sjcservicenode04-prod.xxxinternal.com:8080/resources//oracle-jdbc-driver.jar, because /var/lib/ambari-agent/data/tmp/oracle-jdbc-driver.jar already exists
2015-10-15 13:02:51,996 - call['hdp-select status hadoop-client'] {'timeout': 20}
ROOT CAUSE:
Ranger Hive policies http url calls were taking forever to return results Ranger makes a lot of calls to urllib2.urlopen(request) that don't have a timeout in Ambari 2.0 Opened Ambari BUG to put time=5 in the ranger_function.py file.
https://hortonworks.jira.com/browse/BUG-46275
RESOLUTION:
1) Edit /usr/lib/python2.6/site-packages/resource-management/libraries/functions/ranger_functions.py and copy to all host to be safe (only Hive nodes)
all urllib2.urlopen(request) do urllib2.urlopen(request, timeout=5)
2) Delete duplicate x_group_users - MYSQL https://hortonworks.jira.com/browse/BUG-43119
... View more
Labels:
12-23-2016
05:08 AM
4 Kudos
Scenario: Lets say you have 2 ranger admin instances configured in your cluster. Now if you want to enable Ranger HA, you need to delete one of the ranger admin instance[since Ranger HA admin wizard will create 2nd instance of ranger for you]
In such case you need to remove one instance of Ranger admin which is already installed.
The following steps will guide you how to remove Ranger Admin instance using Ambari API
Backup Ambari Server database [https://ambari.apache.org/current/installing-hadoop-using-ambari/content/ambari-chap11-1.html]
Stop the Ranger service using Ambari. In case the Ranger Admin fails to stop, try stopping the Ranger Service as follows using Ambari API:
curl -u admin:admin -H 'X-Requested-By: ambari' -X PUT -d '{"RequestInfo":{"context":"Stop Service"},"Body":{"ServiceInfo":{"state":"INSTALLED"}}}' \
http://xxx.hostname:8080/api/v1/clusters/TEST/services/RANGER
curl -u admin:admin -H 'X-Requested-By: ambari' -X PUT -d '{"RequestInfo": {"context" :"Stop Service"}, "Body": {"ServiceComponentInfo": {"state": "INSTALLED"}}}' \
http://xxx.hostname:8080/api/v1/clusters/TEST/hosts/xxx.hostname/host_components/RANGER_ADMIN
Remove Ranger Admin using API:
curl -u admin:admin -H "X-Requested-By: ambari" -X DELETE http://xxx.hostname:8080/api/v1/clusters/TEST/hosts/xxx.hostname/host_components/RANGER_ADMIN
... View more
Labels: