About rchapin

rchapin · ‎06-16-2017

So, after some more digging, I have managed to answer my own question. The answer is that there is an additional API at the host level that allows you to get the actual current state and the desired state. From there you can compare the two to determine that the component has finished a state transition. First you need to query Ambari to find on out which hosts the component in question is running curl -s -u admin:<PASSWORD> -H "X-Requested-By:ambari" -X GET http://ambari.dv.quasar.local:8080/api/v1/clusters/quasar_dv/services/YARN/components/RESOURCEMANAGER | jq '.host_components' Which will return: [ { "href": "http://ambari.dv.quasar.local:8080/api/v1/clusters/quasar_dv/hosts/nn01.dv.quasar.local/host_components/RESOURCEMANAGER", "HostRoles": { "cluster_name": "quasar_dv", "component_name": "RESOURCEMANAGER", "host_name": "nn01.dv.quasar.local" } }, { "href": "http://ambari.dv.quasar.local:8080/api/v1/clusters/quasar_dv/hosts/nn02.dv.quasar.local/host_components/RESOURCEMANAGER", "HostRoles": { "cluster_name": "quasar_dv", "component_name": "RESOURCEMANAGER", "host_name": "nn02.dv.quasar.local" } } ] From here, you can parse the host_name value from this sub-set of the JSON and then poll Ambari with the following for each host curl -s -u admin:<PASSWORD> -H "X-Requested-By:ambari" -X GET http://ambari.dv.quasar.local:8080/api/v1/clusters/quasar_dv/hosts/nn01.dv.quasar.local/host_components/RESOURCEMANAGER | jq '.HostRoles.state, .HostRoles.desired_state' Once the .state matches the .desired_state, the component has finished it's transition.

rchapin · ‎05-26-2016

You are absolutely correct, that fixing the infrastructure issues is the correct solution, however doing so requires working with a number of other teams and will take quite some time to get sorted out. Luckily, it is in QA, so we can live with it. Thank you very much for the hint. It seems that there are a number of properties that define how the NameNodes manage their various types of connections and timeouts to the JouralManagers. The following is from org.apache.hadoop.hdfs.DFSConfigKeys.java // Quorum-journal timeouts for various operations. Unlikely to need // to be tweaked, but configurable just in case. public static final String DFS_QJOURNAL_START_SEGMENT_TIMEOUT_KEY = "dfs.qjournal.start-segment.timeout.ms"; public static final String DFS_QJOURNAL_PREPARE_RECOVERY_TIMEOUT_KEY = "dfs.qjournal.prepare-recovery.timeout.ms"; public static final String DFS_QJOURNAL_ACCEPT_RECOVERY_TIMEOUT_KEY = "dfs.qjournal.accept-recovery.timeout.ms"; public static final String DFS_QJOURNAL_FINALIZE_SEGMENT_TIMEOUT_KEY = "dfs.qjournal.finalize-segment.timeout.ms"; public static final String DFS_QJOURNAL_SELECT_INPUT_STREAMS_TIMEOUT_KEY = "dfs.qjournal.select-input-streams.timeout.ms"; public static final String DFS_QJOURNAL_GET_JOURNAL_STATE_TIMEOUT_KEY = "dfs.qjournal.get-journal-state.timeout.ms"; public static final String DFS_QJOURNAL_NEW_EPOCH_TIMEOUT_KEY = "dfs.qjournal.new-epoch.timeout.ms"; public static final String DFS_QJOURNAL_WRITE_TXNS_TIMEOUT_KEY = "dfs.qjournal.write-txns.timeout.ms"; public static final int DFS_QJOURNAL_START_SEGMENT_TIMEOUT_DEFAULT = 20000; public static final int DFS_QJOURNAL_PREPARE_RECOVERY_TIMEOUT_DEFAULT = 120000; public static final int DFS_QJOURNAL_ACCEPT_RECOVERY_TIMEOUT_DEFAULT = 120000; public static final int DFS_QJOURNAL_FINALIZE_SEGMENT_TIMEOUT_DEFAULT = 120000; public static final int DFS_QJOURNAL_SELECT_INPUT_STREAMS_TIMEOUT_DEFAULT = 20000; public static final int DFS_QJOURNAL_GET_JOURNAL_STATE_TIMEOUT_DEFAULT = 120000; public static final int DFS_QJOURNAL_NEW_EPOCH_TIMEOUT_DEFAULT = 120000; public static final int DFS_QJOURNAL_WRITE_TXNS_TIMEOUT_DEFAULT = 20000; In my case, I added the following custom properties to hdfs-site.xml dfs.qjournal.start-segment.timeout.ms = 90000 dfs.qjournal.select-input-streams.timeout.ms = 90000 dfs.qjournal.write-txns.timeout.ms = 90000 I also added the following property to core-site.xml ipc.client.connect.timeout = 90000 So far, that seems to have alleviated the problem.

rchapin · ‎05-25-2016

So at this point, I believe the problem was my own making, and I'll answer my own question We had re-configured the cluster to be HA, however, I did not update the Knox configurations for HA. After updating the topology file as follows, adding HA configurations for both WebHDFS, and HIVE, and updating the NAMENODE service to use the HA servicename. <topology> <gateway> <provider> <role>ha</role> <name>HaProvider</name> <enabled>true</enabled> <param> <name>WEBHDFS</name> <value>maxFailoverAttempts=3;failoverSleep=1000;maxRetryAttempts=300;retrySleep=1000;enabled=true</value> </param> <param> <name>HIVE</name> <value>maxFailoverAttempts=3;failoverSleep=1000;maxRetryAttempts=300;retrySleep=1000;enabled=true</value> </param> </provider> <provider> <role>authentication</role> <name>ShiroProvider</name> <enabled>true</enabled> <param> <name>sessionTimeout</name> <value>30</value> </param> <param> <name>main.ldapRealm</name> <value>org.apache.hadoop.gateway.shirorealm.KnoxLdapRealm</value> </param> <param> <name>main.ldapRealm.userDnTemplate</name> <value>CN={0},OU=Network Architecture and Planning,OU=Network Operations Users,DC=qa,DC=hnops,DC=net</value> </param> <param> <name>main.ldapRealm.contextFactory.url</name> <value>ldap://qa.hnops.net:389</value> </param> <param> <name>main.ldapRealm.contextFactory.authenticationMechanism</name> <value>simple</value> </param> <param> <name>urls./**</name> <value>authcBasic</value> </param> </provider> <provider> <role>identity-assertion</role> <name>Default</name> <enabled>true</enabled> </provider> <provider> <role>authorization</role> <name>AclsAuthz</name> <enabled>true</enabled> </provider> </gateway> <service> <role>NAMENODE</role> <url>hdfs://quasar</url> </service> <service> <role>JOBTRACKER</role> <url>rpc://nn01.qa.quasar.local:8050</url> </service> <service> <role>WEBHDFS</role> <url>http://nn02.qa.quasar.local:50070/webhdfs</url> <url>http://nn01.qa.quasar.local:50070/webhdfs</url> </service> <service> <role>WEBHCAT</role> <url>http://sn02.qa.quasar.local:50111/templeton</url> </service> <service> <role>OOZIE</role> <url>http://sn02.qa.quasar.local:11000/oozie</url> </service> <service> <role>WEBHBASE</role> <url>http://None:8080</url> </service> <service> <role>HIVE</role> <url>http://sn02.qa.quasar.local:10001/cliservice</url> <url>http://sn01.qa.quasar.local:10001/cliservice</url> </service> <service> <role>RESOURCEMANAGER</role> <url>http://nn01.qa.quasar.local:8088/ws</url> </service> </topology> Knox is now properly re-writing the Location header and proxying the requests. $ curl -s -i -k -H "Authorization: Basic cmNoYXBpbjphYmMxMjMhQCM=" -X GET 'https://api01.qa:8443/quasar/jupstats/webhdfs/v1/user/rchapin/output_directory/000001_0?op=OPEN'HTTP/1.1 307 Temporary Redirect Set-Cookie: JSESSIONID=jssiado2ozvrd7q2emics1c2;Path=/quasar/jupstats;Secure;HttpOnly Expires: Thu, 01 Jan 1970 00:00:00 GMT Cache-Control: no-cache Expires: Wed, 25 May 2016 15:31:46 GMT Date: Wed, 25 May 2016 15:31:46 GMT Pragma: no-cache Expires: Wed, 25 May 2016 15:31:46 GMT Date: Wed, 25 May 2016 15:31:46 GMT Pragma: no-cache Location: https://api01.qa:8443/quasar/jupstats/webhdfs/data/v1/webhdfs/v1/user/rchapin/output_directory/000001_0?_=AAAACAAAABAAAABwU3P0-gOzsAEYuzLUjs4huLzVPGcVOmcEKqswrQYjnr8m9Uquuz_uy7jaF2paIqVCwaU7PxyuAysTRCyfHRus2qv5yhxd-3WHOkXI2TO0hR50R8J-GIoIbKhvZuAq4pwLI81177O9XsH0fTsBT45EexjWcyF9_Z0tBJhnvTlDpKcx_n0ZTmf_bw Server: Jetty(6.1.26.hwx) Content-Type: application/octet-stream Content-Length: 0

Online	Offline
Last Visited	‎06-16-2017 01:35 PM

Member Since	‎03-07-2016 03:09 PM
Last Visited	‎06-16-2017 01:35 PM
Posts	9
Kudos received	6

Cloudera Community

Re: Is there an Ambari API that exposes the curren...

Re: How do you enable URL rewriting for Knox for W...

Re: Is there an Ambari API that exposes the curren...

Re: What is the HDFS, NameNode configuration for t...

Re: How do you enable URL rewriting for Knox for W...