Member since
03-07-2016
9
Posts
6
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1045 | 06-16-2017 01:35 PM | |
1674 | 05-25-2016 03:31 PM |
06-16-2017
01:35 PM
So, after some more digging, I have managed to answer my own question. The answer is that there is an additional API at the host level that allows you to get the actual current state and the desired state. From there you can compare the two to determine that the component has finished a state transition. First you need to query Ambari to find on out which hosts the component in question is running curl -s -u admin:<PASSWORD> -H "X-Requested-By:ambari" -X GET http://ambari.dv.quasar.local:8080/api/v1/clusters/quasar_dv/services/YARN/components/RESOURCEMANAGER | jq '.host_components' Which will return: [
{
"href": "http://ambari.dv.quasar.local:8080/api/v1/clusters/quasar_dv/hosts/nn01.dv.quasar.local/host_components/RESOURCEMANAGER",
"HostRoles": {
"cluster_name": "quasar_dv",
"component_name": "RESOURCEMANAGER",
"host_name": "nn01.dv.quasar.local"
}
},
{
"href": "http://ambari.dv.quasar.local:8080/api/v1/clusters/quasar_dv/hosts/nn02.dv.quasar.local/host_components/RESOURCEMANAGER",
"HostRoles": {
"cluster_name": "quasar_dv",
"component_name": "RESOURCEMANAGER",
"host_name": "nn02.dv.quasar.local"
}
}
]
From here, you can parse the host_name value from this sub-set of the JSON and then poll Ambari with the following for each host curl -s -u admin:<PASSWORD> -H "X-Requested-By:ambari" -X GET http://ambari.dv.quasar.local:8080/api/v1/clusters/quasar_dv/hosts/nn01.dv.quasar.local/host_components/RESOURCEMANAGER | jq '.HostRoles.state, .HostRoles.desired_state' Once the .state matches the .desired_state, the component has finished it's transition.
... View more
05-26-2016
03:21 PM
You are absolutely correct, that fixing the infrastructure issues is the correct solution, however doing so requires working with a number of other teams and will take quite some time to get sorted out. Luckily, it is in QA, so we can live with it. Thank you very much for the hint. It seems that there are a number of properties that define how the NameNodes manage their various types of connections and timeouts to the JouralManagers. The following is from org.apache.hadoop.hdfs.DFSConfigKeys.java // Quorum-journal timeouts for various operations. Unlikely to need
// to be tweaked, but configurable just in case.
public static final String DFS_QJOURNAL_START_SEGMENT_TIMEOUT_KEY = "dfs.qjournal.start-segment.timeout.ms";
public static final String DFS_QJOURNAL_PREPARE_RECOVERY_TIMEOUT_KEY = "dfs.qjournal.prepare-recovery.timeout.ms";
public static final String DFS_QJOURNAL_ACCEPT_RECOVERY_TIMEOUT_KEY = "dfs.qjournal.accept-recovery.timeout.ms";
public static final String DFS_QJOURNAL_FINALIZE_SEGMENT_TIMEOUT_KEY = "dfs.qjournal.finalize-segment.timeout.ms";
public static final String DFS_QJOURNAL_SELECT_INPUT_STREAMS_TIMEOUT_KEY = "dfs.qjournal.select-input-streams.timeout.ms";
public static final String DFS_QJOURNAL_GET_JOURNAL_STATE_TIMEOUT_KEY = "dfs.qjournal.get-journal-state.timeout.ms";
public static final String DFS_QJOURNAL_NEW_EPOCH_TIMEOUT_KEY = "dfs.qjournal.new-epoch.timeout.ms";
public static final String DFS_QJOURNAL_WRITE_TXNS_TIMEOUT_KEY = "dfs.qjournal.write-txns.timeout.ms";
public static final int DFS_QJOURNAL_START_SEGMENT_TIMEOUT_DEFAULT = 20000;
public static final int DFS_QJOURNAL_PREPARE_RECOVERY_TIMEOUT_DEFAULT = 120000;
public static final int DFS_QJOURNAL_ACCEPT_RECOVERY_TIMEOUT_DEFAULT = 120000;
public static final int DFS_QJOURNAL_FINALIZE_SEGMENT_TIMEOUT_DEFAULT = 120000;
public static final int DFS_QJOURNAL_SELECT_INPUT_STREAMS_TIMEOUT_DEFAULT = 20000;
public static final int DFS_QJOURNAL_GET_JOURNAL_STATE_TIMEOUT_DEFAULT = 120000;
public static final int DFS_QJOURNAL_NEW_EPOCH_TIMEOUT_DEFAULT = 120000;
public static final int DFS_QJOURNAL_WRITE_TXNS_TIMEOUT_DEFAULT = 20000;
In my case, I added the following custom properties to hdfs-site.xml dfs.qjournal.start-segment.timeout.ms = 90000
dfs.qjournal.select-input-streams.timeout.ms = 90000
dfs.qjournal.write-txns.timeout.ms = 90000
I also added the following property to core-site.xml ipc.client.connect.timeout = 90000
So far, that seems to have alleviated the problem.
... View more
05-25-2016
03:31 PM
So at this point, I believe the problem was my own making, and I'll answer my own question We had re-configured the cluster to be HA, however, I did not update the Knox configurations for HA. After updating the topology file as follows, adding HA configurations for both WebHDFS, and HIVE, and updating the NAMENODE service to use the HA servicename. <topology>
<gateway>
<provider>
<role>ha</role>
<name>HaProvider</name>
<enabled>true</enabled>
<param>
<name>WEBHDFS</name>
<value>maxFailoverAttempts=3;failoverSleep=1000;maxRetryAttempts=300;retrySleep=1000;enabled=true</value>
</param>
<param>
<name>HIVE</name>
<value>maxFailoverAttempts=3;failoverSleep=1000;maxRetryAttempts=300;retrySleep=1000;enabled=true</value>
</param>
</provider>
<provider>
<role>authentication</role>
<name>ShiroProvider</name>
<enabled>true</enabled>
<param>
<name>sessionTimeout</name>
<value>30</value>
</param>
<param>
<name>main.ldapRealm</name>
<value>org.apache.hadoop.gateway.shirorealm.KnoxLdapRealm</value>
</param>
<param>
<name>main.ldapRealm.userDnTemplate</name>
<value>CN={0},OU=Network Architecture and Planning,OU=Network Operations Users,DC=qa,DC=hnops,DC=net</value>
</param>
<param>
<name>main.ldapRealm.contextFactory.url</name>
<value>ldap://qa.hnops.net:389</value>
</param>
<param>
<name>main.ldapRealm.contextFactory.authenticationMechanism</name>
<value>simple</value>
</param>
<param>
<name>urls./**</name>
<value>authcBasic</value>
</param>
</provider>
<provider>
<role>identity-assertion</role>
<name>Default</name>
<enabled>true</enabled>
</provider>
<provider>
<role>authorization</role>
<name>AclsAuthz</name>
<enabled>true</enabled>
</provider>
</gateway>
<service>
<role>NAMENODE</role>
<url>hdfs://quasar</url>
</service>
<service>
<role>JOBTRACKER</role>
<url>rpc://nn01.qa.quasar.local:8050</url>
</service>
<service>
<role>WEBHDFS</role>
<url>http://nn02.qa.quasar.local:50070/webhdfs</url>
<url>http://nn01.qa.quasar.local:50070/webhdfs</url>
</service>
<service>
<role>WEBHCAT</role>
<url>http://sn02.qa.quasar.local:50111/templeton</url>
</service>
<service>
<role>OOZIE</role>
<url>http://sn02.qa.quasar.local:11000/oozie</url>
</service>
<service>
<role>WEBHBASE</role>
<url>http://None:8080</url>
</service>
<service>
<role>HIVE</role>
<url>http://sn02.qa.quasar.local:10001/cliservice</url>
<url>http://sn01.qa.quasar.local:10001/cliservice</url>
</service>
<service>
<role>RESOURCEMANAGER</role>
<url>http://nn01.qa.quasar.local:8088/ws</url>
</service>
</topology>
Knox is now properly re-writing the Location header and proxying the requests. $ curl -s -i -k -H "Authorization: Basic cmNoYXBpbjphYmMxMjMhQCM=" -X GET 'https://api01.qa:8443/quasar/jupstats/webhdfs/v1/user/rchapin/output_directory/000001_0?op=OPEN'HTTP/1.1 307 Temporary Redirect
Set-Cookie: JSESSIONID=jssiado2ozvrd7q2emics1c2;Path=/quasar/jupstats;Secure;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Cache-Control: no-cache
Expires: Wed, 25 May 2016 15:31:46 GMT
Date: Wed, 25 May 2016 15:31:46 GMT
Pragma: no-cache
Expires: Wed, 25 May 2016 15:31:46 GMT
Date: Wed, 25 May 2016 15:31:46 GMT
Pragma: no-cache
Location: https://api01.qa:8443/quasar/jupstats/webhdfs/data/v1/webhdfs/v1/user/rchapin/output_directory/000001_0?_=AAAACAAAABAAAABwU3P0-gOzsAEYuzLUjs4huLzVPGcVOmcEKqswrQYjnr8m9Uquuz_uy7jaF2paIqVCwaU7PxyuAysTRCyfHRus2qv5yhxd-3WHOkXI2TO0hR50R8J-GIoIbKhvZuAq4pwLI81177O9XsH0fTsBT45EexjWcyF9_Z0tBJhnvTlDpKcx_n0ZTmf_bw
Server: Jetty(6.1.26.hwx)
Content-Type: application/octet-stream
Content-Length: 0
... View more