Member since
02-08-2016
793
Posts
669
Kudos Received
85
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3067 | 06-30-2017 05:30 PM | |
3988 | 06-30-2017 02:57 PM | |
3309 | 05-30-2017 07:00 AM | |
3884 | 01-20-2017 10:18 AM | |
8403 | 01-11-2017 02:11 PM |
12-24-2016
07:01 AM
3 Kudos
SYMPTOM: Hive jobs failing on Production Aggregation cluster by giving j"ava.net.UnknownHostException: Matrix-Aggr" Error. Matrix-Aggr is the nameservice for Namenode HA. ERROR: Error log is as below - Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: Matrix-Aggr
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:665)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:601)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2619)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2635)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFromFS(AvroSerdeUtils.java:149)
at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:110)
at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.getSchema(AvroGenericRecordReader.java:112)
at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.<init>(AvroGenericRecordReader.java:70)
at org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat.getRecordReader(AvroContainerInputFormat.java:51)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:65)
... 16 more
Caused by: java.net.UnknownHostException: Matrix-Aggr
... 33 more
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
ROOT CAUSE: HDP-2.2.4 has a bug where they reset client configuration with proper HA settings in AvroSerdeUtils.java at below line and hence we get UnknownHostException Schema s = getSchemaFromFS(schemaString, new Configuration());
RESOLUTION: This is fixed in recent version via HIVE-9299. We can workaround it by using file:// for avro.schema.url and keeping the schema file in all NodeManager machines. You might need to request for patch to HWX as workaround else get upgraded HDP to latest version.
... View more
Labels:
12-24-2016
06:38 AM
3 Kudos
SYMPTOM: During processes like adding a service, or upgrading, Ambari UI complains package not found for installation. From the log we can see that Ambari is searching for a repo higher version than cluster's current HDP version
For example:
-- Current version : Ambari Version 1.7.0 and HDP 2.2.0
-- But Ambari is searching in repo version for 2.2.8
ROOT CAUSE: It seems that /var/lib/ambari-server/resources/stacks/HDP/<VERSION>/repos/repoinfo.xml file has been updated with wrong latest version info. RESOLUTION: 1. comment out the below line in file /var/lib/ambari-server/resources/stacks/HDP/<VERSION>/repos/repoinfo.xml '<latest>http://public-repo-1.hortonworks.com/HDP/hdp_urlinfo.json</latest>' 2. Open Ambari database, and check content of table metainfo . For example : In case the metainfo_key "repo:/HDP/2.2/redhat6/HDP-<VERSION>:baseurl" is missing, use following command to add content to table : ' INSERT INTO metainfo VALUES ('repo:/HDP/2.2/redhat6/HDP-<VERSION>:baseurl', 'http://public-repo-1.hortonworks.com/HDP/centos6/2.x/GA/2.2.0.0'); '
3. restart ambari server and agents
4. yum clean all command on service master hosts 5. re-install or re-run upgrade
... View more
Labels:
12-23-2016
07:17 PM
5 Kudos
SYMPTOM: Trying to add components using ambari ui but its failing. We are using RHN satellite repos to download packages. The HDP.repo and HDP_UTILS.repo were configured with "enable=0" On all servers But they always be modified with "enable=1". Below are my repos Output: [HDP-2.5]
name=HDP-2.5
baseurl=http://172.26.64.249/hdp/centos6/HDP-2.5.3.0/
path=/
enabled=0
[HDP-UTILS-1.1.0.21]
name=HDP-UTILS-1.1.0.21
baseurl=http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.21/repos/centos6
path=/
enabled=0
ROOT CAUSE: As ambari uses puppet it will always revert the repo files back to orignal. RESOLUTION: Modified the respective file depending upon os, in my case it was - /var/lib/ambari-server/resources/stacks/HDP/2.0.6/hooks/before-INSTALL/templates/repo_suse_rhel.j2 and replaced - enabled=1 to enabled=0
Restarted ambari server after which services were able to install using RHN Satellite repository.
... View more
Labels:
12-23-2016
06:51 PM
5 Kudos
SYMPTOM: We have some alerts about high heap size in datanode in Ambari for production cluster. The maximum of heap size of the datanode is set to 16G ERROR: Below is the snapshot
ROOT CAUSE: DN operations are IO expensive do not require 16GB of the heap.
RESOLUTION: Tuning GC parameters resolved the issue - 4GB Heap recommendation :
-Xms4096m -Xmx4096m -XX:NewSize=800m
-XX:MaxNewSize=800m -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=70
-XX:ParallelGCThreads=8
... View more
Labels:
12-23-2016
06:26 PM
4 Kudos
SYMPTOM: Upon starting the App Timeline Service after an Ambari & HDP upgrade, the following errors were thrown and the service was unable to start: ERROR: 2015-08-02 22:56:24,311 INFO service.AbstractService (AbstractService.java:noteFailure(272)) - Service org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore failed in state INITED; cause:
org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 116 missing files; e.g.: /tmp/hadoop/yarn/timeline/leveldb-timeline-store.ldb/001052.sst
org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 116 missing files; e.g.: /tmp/hadoop/yarn/timeline/leveldb-timeline-store.ldb/001052.sst
ROOT CAUSE: The issue is with corrupted SST's in the app timeline db path. RESOLUTION: Navigate to /hadoop/yarn/timeline/leveldb-timeline-store.ldb From there you will see a text file named "CURRENT"
Please back this file up in /tmp and then remove the file as such:
cp /hadoop/yarn/timeline/leveldb-timeline-store.ldb/CURRENT /tmp
rm /hadoop/yarn/timeline/leveldb-timeline-store.ldb/CURRENT
Restart the service via Ambari
... View more
12-23-2016
06:14 PM
3 Kudos
SYMPTOM: Yarn timeline logs are growing very fast and the disk is now 100% utilized. Below are my configs set for ATS - Configs: <property>
<name>yarn.timeline-service.ttl-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.timeline-service.ttl-ms</name>
<value>1339200000</value>
/property>
<property>
<name>yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms</name>
<value>150000</value>
</property>
ROOT CAUSE: This config does not affect the semantic of the ATS purging process. However, it affects the concrete behavior of a level-db based storage implementation to do purging. This config decides the time interval between two purges in a level-db based ATS storage (like leveldb storage and rolling leveldb storage). Here in this case, the customer set ttl to 1339200000 ms, 1339200 seconds, or 372 hours or 15.5 days. On a normal cluster with limited disk space budgeted this may cause some problems (13 MB per hour). Reducing this value may help to alleviate the problem. RESOLUTION: In this case the issue was resolved by modifying the value of the property "yarn.timeline-service.ttl-ms" in the Application Timeline configuration from 1339200000, 15.5 days, to 669600000 or 7 days. <property>
<name>yarn.timeline-service.ttl-ms</name>
<value>669600000</value>
/property>
... View more
Labels:
12-23-2016
05:55 PM
4 Kudos
Question: Is it ok to run purge scripts on WF_JOBS & COORD_JOBS tables that are in oozie Database configured in MySQL? Will the purge scripts remove the running workflows and coordinators? Below are the scripts we will be running to purge - DELETE FROM WF_ACTIONS where WF_ID IN (SELECT ID from WF_JOBS where end_time < timestamp('2016-06-01 00:00:00')); DELETE from wf_jobs where end_time < timestamp('2016-06-01 00:00:00');
DELETE from COORD_ACTIONS where JOB_ID in (select ID from COORD_JOBS where END_TIME < timestamp('2016-06-01 00:00:00'));
DELETE from coord_jobs where END_TIME < timestamp('2016-06-01 00:00:00'); Reply: Oozie has feature to purge older jobs from database. by default it's 30 days. Actions related to long running coordinators do not purged until co-ordinator completes, ( example - if you have coordinator running for 6 months, then all the related workflows will be there in database for 6 months )
Will the purge scripts remove the running workflows and coordinators?
--> No it will not.
For Oozie purge scripts -
I think this should be fine. As running coordinators/workflows wont be in DB with ENDTIME. Make sure you have backup oozie DB first and take 10-15 days gap before cleaning the DB
... View more
Labels:
12-23-2016
05:25 PM
4 Kudos
SYMPTOM: Ambari is showing Alert about a connection failed to the journal node service. Below is the alert - 2016-06-30 18:50:39,865 [CRITICAL] [HDFS] [journalnode_process] (JournalNode Process) Connection failed to http://jn1.example.com:8480 (Execution of 'curl -k --negotiate -u : -b /var/lib/ambari-agent/tmp/cookies/f8ed47d4-f63e-482c-be70-36755387ca4b -c /var/lib/ambari-agent/tmp/cookies/f8ed47d4-f63e-482c-be70-36755387ca4b -w '%{http_code}' http://jn.example.com:8480 --connect-timeout 5 --max-time 7 -o /dev/null 1>/tmp/tmpE9v3mg 2>/tmp/tmpKOSncN' returned 28. % Total % Received % Xferd Average Speed Time Time Time Current ERROR: Below are the journal logs 2016-07-01 10:21:29,390
WARN namenode.FSImage (EditLogFileInputStream.java:scanEditLog(350)) - Caught
exception after scanning through 0 ops from
/hadoop/hdfs/journal/phadcluster01/current/edits_inprogress_0000000002510372012
while determining its valid length. Position was 712704
java.io.IOException: Can't scan a pre-transactional edit log.
at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4959)
at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)
at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:346)
at org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:520)
at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:192)
at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:152)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:90)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:99)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.startLogSegment(JournalNodeRpcServer.java:161)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.startLogSegment(QJournalProtocolServerSideTranslatorPB.java:186)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25425)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
ROOT CAUSE: From the log below it seems that the journal node edits were corrupted 2016-07-01 10:21:16,007 WARN namenode.FSImage (EditLogFileInputStream.java:scanEditLog(350)) - Caught exception after scanning through 0 ops from /hadoop/hdfs/journal/phadcluster01/current/edits_inprogress_0000000002510372012 while determining its valid length. Position was 712704 java.io.IOException: Can't scan a pre-transactional edit log. RESOLUTION: Below are steps taken to resolve the issue - 1.stopped journal node
2.backup existing jn directory metadata
3.copied working edits_inprogress from other JN node
4.Modified the permission to hdfs:hadoop
5.Restart the Journal node.
6.JN started successfully and no more errors are seen in the log.
... View more
Labels:
12-23-2016
03:49 PM
@Sami Ahmad Seems the krb5-conf is missing or corrupted. Please try Manually create the kerberos-env and krb5-conf by issuing the Ambari REST API call explained below:
PUT /api/v1/clusters/CLUSER_NAME [
{
"Clusters": {
"desired_config": {
"type": "krb5-conf",
"tag": "version1234",
"properties": {
"domains":"",
"manage_krb5_conf": "true",
"conf_dir":"/etc",
"content" : "[libdefaults]\n renew_lifetime = 7d\n forwardable= true\n default_realm = {{realm|upper()}}\n ticket_lifetime = 24h\n dns_lookup_realm = false\n dns_lookup_kdc = false\n #default_tgs_enctypes = {{encryption_types}}\n #default_tkt_enctypes ={{encryption_types}}\n\n{% if domains %}\n[domain_realm]\n{% for domain in domains.split(',') %}\n {{domain}} = {{realm|upper()}}\n{% endfor %}\n{%endif %}\n\n[logging]\n default = FILE:/var/log/krb5kdc.log\nadmin_server = FILE:/var/log/kadmind.log\n kdc = FILE:/var/log/krb5kdc.log\n\n[realms]\n {{realm}} = {\n admin_server = {{admin_server_host|default(kdc_host, True)}}\n kdc = {{kdc_host}}\n }\n\n{# Append additional realm declarations below #}\n"
}
}
}
},
{
"Clusters": {
"desired_config": {
"type": "kerberos-env",
"tag": "version1234",
"properties": {
"kdc_type": "mit-kdc",
"manage_identities": "false",
"install_packages": "true",
"encryption_types": "aes des3-cbc-sha1 rc4 des-cbc-md5",
"realm" : "EXAMPLE.COM",
"kdc_host" : "hdc.host",
"admin_server_host" : "kadmin.host",
"executable_search_paths" : "/usr/bin, /usr/kerberos/bin, /usr/sbin, /usr/lib/mit/bin, /usr/lib/mit/sbin",
"password_length": "20",
"password_min_lowercase_letters": "1",
"password_min_uppercase_letters": "1",
"password_min_digits": "1",
"password_min_punctuation": "1",
"password_min_whitespace": "0",
"service_check_principal_name" : "${cluster_name}-${short_date}",
"case_insensitive_username_rules" : "false"
}
}
}
}
] Note:
manage_identities is set to false indicating that Ambari is to not interact with the KDC. This is because the customer did not want Ambari to destroy the principals in the KDC. Since Ambari was not managing the Kerberos identities, there was no need to fill in the correct data about the KDC.
TIP
When issuing the API call mentioned above, place the payload into a file and use curl like:
curl -H "X-Requested-By:ambari" -u admin:admin -i -X PUT -d @./payload.json http://AMBARI_SEVER:8080/api/v1/clusters/CLUSTER_NAME
... View more
12-23-2016
12:06 PM
4 Kudos
SYMPTOM: The Standby NameNode process running on our 2nd of four management node servers isn't running. Interrogating the log files, I've found an exception relating to an Oozie job ERROR: Below was the error logs - 2016-12-20 09:20:17,286 INFO namenode.EditLogInputStream (RedundantEditLogInputStream.java:nextOp(176)) - Fast-forwarding stream 'http://node1:8480/getJournal?jid=namenodeha&segmentTxId=16740759&storageInfo=-63%3A1400038789%3A0%3ACID-031f35b2-59c9-42f9-8942-550aee3d39e6, http://node1:8480/getJournal?jid=namenodeha&segmentTxId=16740759&storageInfo=-63%3A1400038789%3A0%3ACID-031f35b2-59c9-42f9-8942-550aee3d39e6' to transaction ID 16713078
2016-12-20 09:20:17,287 INFO namenode.EditLogInputStream (RedundantEditLogInputStream.java:nextOp(176)) - Fast-forwarding stream 'http://node1:8480/getJournal?jid=namenodeha&segmentTxId=16740759&storageInfo=-63%3A1400038789%3A0%3ACID-031f35b2-59c9-42f9-8942-550aee3d39e6' to transaction ID 16713078
2016-12-20 09:20:18,287 INFO namenode.FSEditLogLoader (FSEditLogLoader.java:loadEditRecords(266)) - replaying edit log: 48858/805951 transactions completed. (6%)
2016-12-20 09:20:18,485 ERROR namenode.FSEditLogLoader (FSEditLogLoader.java:loadEditRecords(242)) - Encountered exception on operation DeleteSnapshotOp [snapshotRoot=/apps/hive/warehouse, snapshotName=oozie-snapshot-2016_12_16-08_01, RpcClientId=1f566cee-d0eb-4a84-a615-40cdd31bc772, RpcCallId=1]
2016-12-20 09:20:18,599 ERROR namenode.NameNode (NameNode.java:main(1712)) - Failed to start namenode.
2016-12-20 09:20:18,601 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
2016-12-20 09:20:18,602 INFO namenode.NameNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG:
ROOT CAUSE: Suspected that the edits logs were corrupted and it was causing the issue for Standby namenode to startup. Replicating the metadata from primary namenode to standby didn't worked. This is a BUG - https://issues.apache.org/jira/browse/HDFS-6908 Affected version: HDP - 2.4.0
Ambari - 2.2.1.1 RESOLUTION: This is resolved in HDP2.5 and apache hadoop 2.6.0 for current scenario we need to request a patch from hortonworks dev team.
... View more
Labels: