About sshimpi

sshimpi · ‎12-24-2016

SYMPTOM: Hive jobs failing on Production Aggregation cluster by giving j"ava.net.UnknownHostException: Matrix-Aggr" Error. Matrix-Aggr is the nameservice for Namenode HA. ERROR: Error log is as below - Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: Matrix-Aggr at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:665) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:601) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2619) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2653) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2635) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFromFS(AvroSerdeUtils.java:149) at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:110) at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.getSchema(AvroGenericRecordReader.java:112) at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.<init>(AvroGenericRecordReader.java:70) at org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat.getRecordReader(AvroContainerInputFormat.java:51) at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:65) ... 16 more Caused by: java.net.UnknownHostException: Matrix-Aggr ... 33 more Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask ROOT CAUSE: HDP-2.2.4 has a bug where they reset client configuration with proper HA settings in AvroSerdeUtils.java at below line and hence we get UnknownHostException Schema s = getSchemaFromFS(schemaString, new Configuration()); RESOLUTION: This is fixed in recent version via HIVE-9299. We can workaround it by using file:// for avro.schema.url and keeping the schema file in all NodeManager machines. You might need to request for patch to HWX as workaround else get upgraded HDP to latest version.

sshimpi · ‎12-24-2016

SYMPTOM: During processes like adding a service, or upgrading, Ambari UI complains package not found for installation. From the log we can see that Ambari is searching for a repo higher version than cluster's current HDP version For example: -- Current version : Ambari Version 1.7.0 and HDP 2.2.0 -- But Ambari is searching in repo version for 2.2.8 ROOT CAUSE: It seems that /var/lib/ambari-server/resources/stacks/HDP/<VERSION>/repos/repoinfo.xml file has been updated with wrong latest version info. RESOLUTION: 1. comment out the below line in file /var/lib/ambari-server/resources/stacks/HDP/<VERSION>/repos/repoinfo.xml '<latest>http://public-repo-1.hortonworks.com/HDP/hdp_urlinfo.json</latest>' 2. Open Ambari database, and check content of table metainfo . For example : In case the metainfo_key "repo:/HDP/2.2/redhat6/HDP-<VERSION>:baseurl" is missing, use following command to add content to table : ' INSERT INTO metainfo VALUES ('repo:/HDP/2.2/redhat6/HDP-<VERSION>:baseurl', 'http://public-repo-1.hortonworks.com/HDP/centos6/2.x/GA/2.2.0.0'); ' 3. restart ambari server and agents 4. yum clean all command on service master hosts 5. re-install or re-run upgrade

sshimpi · ‎12-23-2016

SYMPTOM: Trying to add components using ambari ui but its failing. We are using RHN satellite repos to download packages. The HDP.repo and HDP_UTILS.repo were configured with "enable=0" On all servers But they always be modified with "enable=1". Below are my repos Output: [HDP-2.5] name=HDP-2.5 baseurl=http://172.26.64.249/hdp/centos6/HDP-2.5.3.0/ path=/ enabled=0 [HDP-UTILS-1.1.0.21] name=HDP-UTILS-1.1.0.21 baseurl=http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.21/repos/centos6 path=/ enabled=0 ROOT CAUSE: As ambari uses puppet it will always revert the repo files back to orignal. RESOLUTION: Modified the respective file depending upon os, in my case it was - /var/lib/ambari-server/resources/stacks/HDP/2.0.6/hooks/before-INSTALL/templates/repo_suse_rhel.j2 and replaced - enabled=1 to enabled=0 Restarted ambari server after which services were able to install using RHN Satellite repository.

sshimpi · ‎12-23-2016

SYMPTOM: We have some alerts about high heap size in datanode in Ambari for production cluster. The maximum of heap size of the datanode is set to 16G ERROR: Below is the snapshot ROOT CAUSE: DN operations are IO expensive do not require 16GB of the heap. RESOLUTION: Tuning GC parameters resolved the issue - 4GB Heap recommendation : -Xms4096m -Xmx4096m -XX:NewSize=800m -XX:MaxNewSize=800m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -XX:ParallelGCThreads=8

sshimpi · ‎12-23-2016

SYMPTOM: Upon starting the App Timeline Service after an Ambari & HDP upgrade, the following errors were thrown and the service was unable to start: ERROR: 2015-08-02 22:56:24,311 INFO service.AbstractService (AbstractService.java:noteFailure(272)) - Service org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore failed in state INITED; cause: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 116 missing files; e.g.: /tmp/hadoop/yarn/timeline/leveldb-timeline-store.ldb/001052.sst org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 116 missing files; e.g.: /tmp/hadoop/yarn/timeline/leveldb-timeline-store.ldb/001052.sst ROOT CAUSE: The issue is with corrupted SST's in the app timeline db path. RESOLUTION: Navigate to /hadoop/yarn/timeline/leveldb-timeline-store.ldb From there you will see a text file named "CURRENT" Please back this file up in /tmp and then remove the file as such: cp /hadoop/yarn/timeline/leveldb-timeline-store.ldb/CURRENT /tmp rm /hadoop/yarn/timeline/leveldb-timeline-store.ldb/CURRENT Restart the service via Ambari

sshimpi · ‎12-23-2016

SYMPTOM: Yarn timeline logs are growing very fast and the disk is now 100% utilized. Below are my configs set for ATS - Configs: <property> <name>yarn.timeline-service.ttl-enable</name> <value>true</value> </property> <property> <name>yarn.timeline-service.ttl-ms</name> <value>1339200000</value> /property> <property> <name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name> <value>150000</value> </property> ROOT CAUSE: This config does not affect the semantic of the ATS purging process. However, it affects the concrete behavior of a level-db based storage implementation to do purging. This config decides the time interval between two purges in a level-db based ATS storage (like leveldb storage and rolling leveldb storage). Here in this case, the customer set ttl to 1339200000 ms, 1339200 seconds, or 372 hours or 15.5 days. On a normal cluster with limited disk space budgeted this may cause some problems (13 MB per hour). Reducing this value may help to alleviate the problem. RESOLUTION: In this case the issue was resolved by modifying the value of the property "yarn.timeline-service.ttl-ms" in the Application Timeline configuration from 1339200000, 15.5 days, to 669600000 or 7 days. <property> <name>yarn.timeline-service.ttl-ms</name> <value>669600000</value> /property>

sshimpi · ‎12-23-2016

Question: Is it ok to run purge scripts on WF_JOBS & COORD_JOBS tables that are in oozie Database configured in MySQL? Will the purge scripts remove the running workflows and coordinators? Below are the scripts we will be running to purge - DELETE FROM WF_ACTIONS where WF_ID IN (SELECT ID from WF_JOBS where end_time < timestamp('2016-06-01 00:00:00')); DELETE from wf_jobs where end_time < timestamp('2016-06-01 00:00:00'); DELETE from COORD_ACTIONS where JOB_ID in (select ID from COORD_JOBS where END_TIME < timestamp('2016-06-01 00:00:00')); DELETE from coord_jobs where END_TIME < timestamp('2016-06-01 00:00:00'); Reply: Oozie has feature to purge older jobs from database. by default it's 30 days. Actions related to long running coordinators do not purged until co-ordinator completes, ( example - if you have coordinator running for 6 months, then all the related workflows will be there in database for 6 months ) Will the purge scripts remove the running workflows and coordinators? --> No it will not. For Oozie purge scripts - I think this should be fine. As running coordinators/workflows wont be in DB with ENDTIME. Make sure you have backup oozie DB first and take 10-15 days gap before cleaning the DB

sshimpi · ‎12-23-2016

SYMPTOM: Ambari is showing Alert about a connection failed to the journal node service. Below is the alert - 2016-06-30 18:50:39,865 [CRITICAL] [HDFS] [journalnode_process] (JournalNode Process) Connection failed to http://jn1.example.com:8480 (Execution of 'curl -k --negotiate -u : -b /var/lib/ambari-agent/tmp/cookies/f8ed47d4-f63e-482c-be70-36755387ca4b -c /var/lib/ambari-agent/tmp/cookies/f8ed47d4-f63e-482c-be70-36755387ca4b -w '%{http_code}' http://jn.example.com:8480 --connect-timeout 5 --max-time 7 -o /dev/null 1>/tmp/tmpE9v3mg 2>/tmp/tmpKOSncN' returned 28. % Total % Received % Xferd Average Speed Time Time Time Current ERROR: Below are the journal logs 2016-07-01 10:21:29,390 WARN namenode.FSImage (EditLogFileInputStream.java:scanEditLog(350)) - Caught exception after scanning through 0 ops from /hadoop/hdfs/journal/phadcluster01/current/edits_inprogress_0000000002510372012 while determining its valid length. Position was 712704 java.io.IOException: Can't scan a pre-transactional edit log. at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4959) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:346) at org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:520) at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:192) at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:152) at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:90) at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:99) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.startLogSegment(JournalNodeRpcServer.java:161) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.startLogSegment(QJournalProtocolServerSideTranslatorPB.java:186) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25425) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145) ROOT CAUSE: From the log below it seems that the journal node edits were corrupted 2016-07-01 10:21:16,007 WARN namenode.FSImage (EditLogFileInputStream.java:scanEditLog(350)) - Caught exception after scanning through 0 ops from /hadoop/hdfs/journal/phadcluster01/current/edits_inprogress_0000000002510372012 while determining its valid length. Position was 712704 java.io.IOException: Can't scan a pre-transactional edit log. RESOLUTION: Below are steps taken to resolve the issue - 1.stopped journal node 2.backup existing jn directory metadata 3.copied working edits_inprogress from other JN node 4.Modified the permission to hdfs:hadoop 5.Restart the Journal node. 6.JN started successfully and no more errors are seen in the log.

sshimpi · ‎12-23-2016

@Sami Ahmad Seems the krb5-conf is missing or corrupted. Please try Manually create the kerberos-env and krb5-conf by issuing the Ambari REST API call explained below: PUT /api/v1/clusters/CLUSER_NAME [ { "Clusters": { "desired_config": { "type": "krb5-conf", "tag": "version1234", "properties": { "domains":"", "manage_krb5_conf": "true", "conf_dir":"/etc", "content" : "[libdefaults]\n renew_lifetime = 7d\n forwardable= true\n default_realm = {{realm|upper()}}\n ticket_lifetime = 24h\n dns_lookup_realm = false\n dns_lookup_kdc = false\n #default_tgs_enctypes = {{encryption_types}}\n #default_tkt_enctypes ={{encryption_types}}\n\n{% if domains %}\n[domain_realm]\n{% for domain in domains.split(',') %}\n {{domain}} = {{realm|upper()}}\n{% endfor %}\n{%endif %}\n\n[logging]\n default = FILE:/var/log/krb5kdc.log\nadmin_server = FILE:/var/log/kadmind.log\n kdc = FILE:/var/log/krb5kdc.log\n\n[realms]\n {{realm}} = {\n admin_server = {{admin_server_host|default(kdc_host, True)}}\n kdc = {{kdc_host}}\n }\n\n{# Append additional realm declarations below #}\n" } } } }, { "Clusters": { "desired_config": { "type": "kerberos-env", "tag": "version1234", "properties": { "kdc_type": "mit-kdc", "manage_identities": "false", "install_packages": "true", "encryption_types": "aes des3-cbc-sha1 rc4 des-cbc-md5", "realm" : "EXAMPLE.COM", "kdc_host" : "hdc.host", "admin_server_host" : "kadmin.host", "executable_search_paths" : "/usr/bin, /usr/kerberos/bin, /usr/sbin, /usr/lib/mit/bin, /usr/lib/mit/sbin", "password_length": "20", "password_min_lowercase_letters": "1", "password_min_uppercase_letters": "1", "password_min_digits": "1", "password_min_punctuation": "1", "password_min_whitespace": "0", "service_check_principal_name" : "${cluster_name}-${short_date}", "case_insensitive_username_rules" : "false" } } } } ] Note: manage_identities is set to false indicating that Ambari is to not interact with the KDC. This is because the customer did not want Ambari to destroy the principals in the KDC. Since Ambari was not managing the Kerberos identities, there was no need to fill in the correct data about the KDC. TIP When issuing the API call mentioned above, place the payload into a file and use curl like: curl -H "X-Requested-By:ambari" -u admin:admin -i -X PUT -d @./payload.json http://AMBARI_SEVER:8080/api/v1/clusters/CLUSTER_NAME

sshimpi · ‎12-23-2016

SYMPTOM: The Standby NameNode process running on our 2nd of four management node servers isn't running. Interrogating the log files, I've found an exception relating to an Oozie job ERROR: Below was the error logs - 2016-12-20 09:20:17,286 INFO namenode.EditLogInputStream (RedundantEditLogInputStream.java:nextOp(176)) - Fast-forwarding stream 'http://node1:8480/getJournal?jid=namenodeha&segmentTxId=16740759&storageInfo=-63%3A1400038789%3A0%3ACID-031f35b2-59c9-42f9-8942-550aee3d39e6, http://node1:8480/getJournal?jid=namenodeha&segmentTxId=16740759&storageInfo=-63%3A1400038789%3A0%3ACID-031f35b2-59c9-42f9-8942-550aee3d39e6' to transaction ID 16713078 2016-12-20 09:20:17,287 INFO namenode.EditLogInputStream (RedundantEditLogInputStream.java:nextOp(176)) - Fast-forwarding stream 'http://node1:8480/getJournal?jid=namenodeha&segmentTxId=16740759&storageInfo=-63%3A1400038789%3A0%3ACID-031f35b2-59c9-42f9-8942-550aee3d39e6' to transaction ID 16713078 2016-12-20 09:20:18,287 INFO namenode.FSEditLogLoader (FSEditLogLoader.java:loadEditRecords(266)) - replaying edit log: 48858/805951 transactions completed. (6%) 2016-12-20 09:20:18,485 ERROR namenode.FSEditLogLoader (FSEditLogLoader.java:loadEditRecords(242)) - Encountered exception on operation DeleteSnapshotOp [snapshotRoot=/apps/hive/warehouse, snapshotName=oozie-snapshot-2016_12_16-08_01, RpcClientId=1f566cee-d0eb-4a84-a615-40cdd31bc772, RpcCallId=1] 2016-12-20 09:20:18,599 ERROR namenode.NameNode (NameNode.java:main(1712)) - Failed to start namenode. 2016-12-20 09:20:18,601 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1 2016-12-20 09:20:18,602 INFO namenode.NameNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG: ROOT CAUSE: Suspected that the edits logs were corrupted and it was causing the issue for Standby namenode to startup. Replicating the metadata from primary namenode to standby didn't worked. This is a BUG - https://issues.apache.org/jira/browse/HDFS-6908 Affected version: HDP - 2.4.0 Ambari - 2.2.1.1 RESOLUTION: This is resolved in HDP2.5 and apache hadoop 2.6.0 for current scenario we need to request a patch from hortonworks dev team.

Online	Offline
Last Visited	‎12-07-2017 08:26 AM

Member Since	‎02-08-2016 09:06 AM
Last Visited	‎12-07-2017 08:26 AM
Posts	793
Kudos received	667

Cloudera Community

Re: Issue with Ranger User/group sync

Re: Ranger HDFS test connection fails

Re: Error while configuring NameNode High Availabi...

Re: Ranger policies on HDFS

Re: Can we do column value level restriction in Ap...

Hive Jobs Failing by giving "java.net.UnknownHostE...

Fail to install new services using Ambari

Installing components using Ambari is failing whil...

DATANODE high HEAP SIZE alert

App Timeline Service fails to start after HDP upgr...

Timeline server down

Oozie DB Purge Scripts for MySQL

Ambari alert : Connection failed to Journal node

Re: Ranger KMS install failing

Standby NameNode process failing on start-up due t...