Member since
03-07-2016
9
Posts
6
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1045 | 06-16-2017 01:35 PM | |
1673 | 05-25-2016 03:31 PM |
06-16-2017
01:35 PM
So, after some more digging, I have managed to answer my own question. The answer is that there is an additional API at the host level that allows you to get the actual current state and the desired state. From there you can compare the two to determine that the component has finished a state transition. First you need to query Ambari to find on out which hosts the component in question is running curl -s -u admin:<PASSWORD> -H "X-Requested-By:ambari" -X GET http://ambari.dv.quasar.local:8080/api/v1/clusters/quasar_dv/services/YARN/components/RESOURCEMANAGER | jq '.host_components' Which will return: [
{
"href": "http://ambari.dv.quasar.local:8080/api/v1/clusters/quasar_dv/hosts/nn01.dv.quasar.local/host_components/RESOURCEMANAGER",
"HostRoles": {
"cluster_name": "quasar_dv",
"component_name": "RESOURCEMANAGER",
"host_name": "nn01.dv.quasar.local"
}
},
{
"href": "http://ambari.dv.quasar.local:8080/api/v1/clusters/quasar_dv/hosts/nn02.dv.quasar.local/host_components/RESOURCEMANAGER",
"HostRoles": {
"cluster_name": "quasar_dv",
"component_name": "RESOURCEMANAGER",
"host_name": "nn02.dv.quasar.local"
}
}
]
From here, you can parse the host_name value from this sub-set of the JSON and then poll Ambari with the following for each host curl -s -u admin:<PASSWORD> -H "X-Requested-By:ambari" -X GET http://ambari.dv.quasar.local:8080/api/v1/clusters/quasar_dv/hosts/nn01.dv.quasar.local/host_components/RESOURCEMANAGER | jq '.HostRoles.state, .HostRoles.desired_state' Once the .state matches the .desired_state, the component has finished it's transition.
... View more
06-13-2017
02:33 PM
I am trying to automate restarting a component. Specifically, the RESOURCEMANAGER. I see that there is an Ambari API for this that enables setting the desired state of component. However, the API does not show the current state of the component once the API call is made, instead it returns the desired state that was just PUT. This is a problem when trying to stop and then start a component programatically. Try the following: In one terminal, run the following, which every two seconds will display the ServiceComponent.state value returned in the JSON watch -n 2 'curl -s -u admin:admin -H "X-Requested-By:ambari" -X GET http://ambari:8080/api/v1/clusters/rchapindev/services/YARN/components/RESOURCEMANAGER | jq '.ServiceComponentInfo'.state' In a second terminal tail the ambari-server.log. In a third terminal issue a 'stop' command (asking Ambari to transition the component to the INSTALLED state effectively stops the component) curl -u admin:admin -H "X-Requested-By:ambari" -iX PUT -d '{"ServiceComponentInfo":{"state":"INSTALLED"}}' http://rchapin-wrkstn:8080/api/v1/clusters/rchapindev/services/YARN/components/RESOURCEMANAGER Almost instantaneously, the first terminal will show that the state returns 'INSTALLED', while the logs display the following: 13 Jun 2017 14:21:59,462 INFO [qtp-ambari-client-15018] AbstractResourceProvider:622 - Received a updateComponent request, clusterName=rchapindev, serviceName=YARN, componentName=RESOURCEMANAGER, request=org.apache.ambari.server.controller.ServiceComponentRequest@7620bca
13 Jun 2017 14:21:59,466 INFO [qtp-ambari-client-15018] AmbariManagementControllerImpl:2072 - AmbariManagementControllerImpl.createHostAction: created ExecutionCommand for host rchapin-wrkstn, role RESOURCEMANAGER, roleCommand STOP, and command ID 943--1, with cluster-env tags version1
13 Jun 2017 14:21:59,496 INFO [ambari-action-scheduler] ServiceComponentHostImpl:1041 - Host role transitioned to a new state, serviceComponentName=RESOURCEMANAGER, hostName=rchapin-wrkstn, oldState=STARTED, currentState=STOPPING
13 Jun 2017 14:22:13,405 INFO [ambari-heartbeat-processor-0] ServiceComponentHostImpl:1041 - Host role transitioned to a new state, serviceComponentName=RESOURCEMANAGER, hostName=rchapin-wrkstn, oldState=STOPPING, currentState=INSTALLED Notice that it is about 14 seconds after the command was issued that the currentState of the component is truly in the INSTALLED state. This makes trying to poll the API endpoint as to the actual state of the component to then know when you can issue the 'start' command impossible. Is there another API, or something that I am missing where I can query the currentState of a component?
... View more
Labels:
- Labels:
-
Cloudera Manager
05-26-2016
03:21 PM
You are absolutely correct, that fixing the infrastructure issues is the correct solution, however doing so requires working with a number of other teams and will take quite some time to get sorted out. Luckily, it is in QA, so we can live with it. Thank you very much for the hint. It seems that there are a number of properties that define how the NameNodes manage their various types of connections and timeouts to the JouralManagers. The following is from org.apache.hadoop.hdfs.DFSConfigKeys.java // Quorum-journal timeouts for various operations. Unlikely to need
// to be tweaked, but configurable just in case.
public static final String DFS_QJOURNAL_START_SEGMENT_TIMEOUT_KEY = "dfs.qjournal.start-segment.timeout.ms";
public static final String DFS_QJOURNAL_PREPARE_RECOVERY_TIMEOUT_KEY = "dfs.qjournal.prepare-recovery.timeout.ms";
public static final String DFS_QJOURNAL_ACCEPT_RECOVERY_TIMEOUT_KEY = "dfs.qjournal.accept-recovery.timeout.ms";
public static final String DFS_QJOURNAL_FINALIZE_SEGMENT_TIMEOUT_KEY = "dfs.qjournal.finalize-segment.timeout.ms";
public static final String DFS_QJOURNAL_SELECT_INPUT_STREAMS_TIMEOUT_KEY = "dfs.qjournal.select-input-streams.timeout.ms";
public static final String DFS_QJOURNAL_GET_JOURNAL_STATE_TIMEOUT_KEY = "dfs.qjournal.get-journal-state.timeout.ms";
public static final String DFS_QJOURNAL_NEW_EPOCH_TIMEOUT_KEY = "dfs.qjournal.new-epoch.timeout.ms";
public static final String DFS_QJOURNAL_WRITE_TXNS_TIMEOUT_KEY = "dfs.qjournal.write-txns.timeout.ms";
public static final int DFS_QJOURNAL_START_SEGMENT_TIMEOUT_DEFAULT = 20000;
public static final int DFS_QJOURNAL_PREPARE_RECOVERY_TIMEOUT_DEFAULT = 120000;
public static final int DFS_QJOURNAL_ACCEPT_RECOVERY_TIMEOUT_DEFAULT = 120000;
public static final int DFS_QJOURNAL_FINALIZE_SEGMENT_TIMEOUT_DEFAULT = 120000;
public static final int DFS_QJOURNAL_SELECT_INPUT_STREAMS_TIMEOUT_DEFAULT = 20000;
public static final int DFS_QJOURNAL_GET_JOURNAL_STATE_TIMEOUT_DEFAULT = 120000;
public static final int DFS_QJOURNAL_NEW_EPOCH_TIMEOUT_DEFAULT = 120000;
public static final int DFS_QJOURNAL_WRITE_TXNS_TIMEOUT_DEFAULT = 20000;
In my case, I added the following custom properties to hdfs-site.xml dfs.qjournal.start-segment.timeout.ms = 90000
dfs.qjournal.select-input-streams.timeout.ms = 90000
dfs.qjournal.write-txns.timeout.ms = 90000
I also added the following property to core-site.xml ipc.client.connect.timeout = 90000
So far, that seems to have alleviated the problem.
... View more
05-25-2016
03:31 PM
So at this point, I believe the problem was my own making, and I'll answer my own question We had re-configured the cluster to be HA, however, I did not update the Knox configurations for HA. After updating the topology file as follows, adding HA configurations for both WebHDFS, and HIVE, and updating the NAMENODE service to use the HA servicename. <topology>
<gateway>
<provider>
<role>ha</role>
<name>HaProvider</name>
<enabled>true</enabled>
<param>
<name>WEBHDFS</name>
<value>maxFailoverAttempts=3;failoverSleep=1000;maxRetryAttempts=300;retrySleep=1000;enabled=true</value>
</param>
<param>
<name>HIVE</name>
<value>maxFailoverAttempts=3;failoverSleep=1000;maxRetryAttempts=300;retrySleep=1000;enabled=true</value>
</param>
</provider>
<provider>
<role>authentication</role>
<name>ShiroProvider</name>
<enabled>true</enabled>
<param>
<name>sessionTimeout</name>
<value>30</value>
</param>
<param>
<name>main.ldapRealm</name>
<value>org.apache.hadoop.gateway.shirorealm.KnoxLdapRealm</value>
</param>
<param>
<name>main.ldapRealm.userDnTemplate</name>
<value>CN={0},OU=Network Architecture and Planning,OU=Network Operations Users,DC=qa,DC=hnops,DC=net</value>
</param>
<param>
<name>main.ldapRealm.contextFactory.url</name>
<value>ldap://qa.hnops.net:389</value>
</param>
<param>
<name>main.ldapRealm.contextFactory.authenticationMechanism</name>
<value>simple</value>
</param>
<param>
<name>urls./**</name>
<value>authcBasic</value>
</param>
</provider>
<provider>
<role>identity-assertion</role>
<name>Default</name>
<enabled>true</enabled>
</provider>
<provider>
<role>authorization</role>
<name>AclsAuthz</name>
<enabled>true</enabled>
</provider>
</gateway>
<service>
<role>NAMENODE</role>
<url>hdfs://quasar</url>
</service>
<service>
<role>JOBTRACKER</role>
<url>rpc://nn01.qa.quasar.local:8050</url>
</service>
<service>
<role>WEBHDFS</role>
<url>http://nn02.qa.quasar.local:50070/webhdfs</url>
<url>http://nn01.qa.quasar.local:50070/webhdfs</url>
</service>
<service>
<role>WEBHCAT</role>
<url>http://sn02.qa.quasar.local:50111/templeton</url>
</service>
<service>
<role>OOZIE</role>
<url>http://sn02.qa.quasar.local:11000/oozie</url>
</service>
<service>
<role>WEBHBASE</role>
<url>http://None:8080</url>
</service>
<service>
<role>HIVE</role>
<url>http://sn02.qa.quasar.local:10001/cliservice</url>
<url>http://sn01.qa.quasar.local:10001/cliservice</url>
</service>
<service>
<role>RESOURCEMANAGER</role>
<url>http://nn01.qa.quasar.local:8088/ws</url>
</service>
</topology>
Knox is now properly re-writing the Location header and proxying the requests. $ curl -s -i -k -H "Authorization: Basic cmNoYXBpbjphYmMxMjMhQCM=" -X GET 'https://api01.qa:8443/quasar/jupstats/webhdfs/v1/user/rchapin/output_directory/000001_0?op=OPEN'HTTP/1.1 307 Temporary Redirect
Set-Cookie: JSESSIONID=jssiado2ozvrd7q2emics1c2;Path=/quasar/jupstats;Secure;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Cache-Control: no-cache
Expires: Wed, 25 May 2016 15:31:46 GMT
Date: Wed, 25 May 2016 15:31:46 GMT
Pragma: no-cache
Expires: Wed, 25 May 2016 15:31:46 GMT
Date: Wed, 25 May 2016 15:31:46 GMT
Pragma: no-cache
Location: https://api01.qa:8443/quasar/jupstats/webhdfs/data/v1/webhdfs/v1/user/rchapin/output_directory/000001_0?_=AAAACAAAABAAAABwU3P0-gOzsAEYuzLUjs4huLzVPGcVOmcEKqswrQYjnr8m9Uquuz_uy7jaF2paIqVCwaU7PxyuAysTRCyfHRus2qv5yhxd-3WHOkXI2TO0hR50R8J-GIoIbKhvZuAq4pwLI81177O9XsH0fTsBT45EexjWcyF9_Z0tBJhnvTlDpKcx_n0ZTmf_bw
Server: Jetty(6.1.26.hwx)
Content-Type: application/octet-stream
Content-Length: 0
... View more
05-25-2016
01:36 PM
@Mark Petronic and I are building out a QA and Production HA HDP 2.3.4.7 clusters. Our QA cluster is entirely on VMWare virtual machines. We are having some problems with the underlying infrastructure that causes hosts to freeze for, at times up to 30 - 45 seconds. Yes, this is a totally separate problem and beyond the scope of the Hortonworks Community. However, what I am trying to do is up the NameNode processes time out from 20000ms to see if we can alleviate this problem for the time-being. What ends up happening is that once the NameNode times out attempting to connect to a quorum of JournalManager processes, it just shuts down. 2016-05-25 01:46:16,480 INFO client.QuorumJournalManager (QuorumCall.java:waitFor(136)) - Waited 6001 ms (timeout=20000 ms) for a response for startLogSegment(416426). No responses yet.
2016-05-25 01:46:26,577 WARN client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 16098 ms (timeout=20000 ms) for a response for startLogSegment(416426). No responses yet.
2016-05-25 01:46:27,578 WARN client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 17099 ms (timeout=20000 ms) for a response for startLogSegment(416426). No responses yet.
2016-05-25 01:46:28,580 WARN client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 18100 ms (timeout=20000 ms) for a response for startLogSegment(416426). No responses yet.
2016-05-25 01:46:29,580 WARN client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 19101 ms (timeout=20000 ms) for a response for startLogSegment(416426). No responses yet.
2016-05-25 01:46:30,480 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: starting log segment 416426 failed for required journal (JournalAndStream(mgr=QJM to [172.19.64.30:8485, 172.19.64.31:8485, 172.19.64.32:8485], stream=null))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.startLogSegment(QuorumJournalManager.java:403)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:107)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$3.apply(JournalSet.java:222)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:219)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:1237)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1206)
at org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1297)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:5939)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1186)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:142)
at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12025)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
2016-05-25 01:46:30,483 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
2016-05-25 01:46:30,487 INFO provider.AuditProviderFactory (AuditProviderFactory.java:run(454)) - ==> JVMShutdownHook.run()
2016-05-25 01:46:30,487 INFO provider.AuditProviderFactory (AuditProviderFactory.java:run(459)) - <== JVMShutdownHook.run()
2016-05-25 01:46:30,492 INFO namenode.NameNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at nn01.qa.quasar.local/172.19.64.30
************************************************************/
Digging through the documentation I thought that the configuration was ipc.client.connect.timeout in core-site.xml, but that does not seem to be the case. Does anyone know what the configuration parameter is, in which config file that I can update from the 20000ms default?
... View more
Labels:
- Labels:
-
Apache Hadoop
05-24-2016
02:05 AM
We have added Knox to our cluster and are proxying WebHDFS calls through it. We would like to have Knox rewriting the WebHDFS urls so that all subsequent calls to WebHDFS can be proxied through Knox, but it is unclear how to enable the URL rewriting. Currently, I can send an https request, via curl, to Knox with a request to 'OPEN' a file and it will return the Location header which I can then use to download the file from HDFS. $ curl -s -i -k -H "Authorization: Basic c3NpdGFyYW06elNtM0JvVyE=" -X GET 'https://api01.qa:8443/quasar/jupstats/webhdfs/v1/user/rchapin/output_directory/000001_0?op=OPEN'
HTTP/1.1 307 Temporary Redirect
Set-Cookie: JSESSIONID=1qbldz84z20s9li4l0nz4hdkw;Path=/quasar/jupstats;Secure;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Cache-Control: no-cache
Expires: Tue, 24 May 2016 01:53:57 GMT
Date: Tue, 24 May 2016 01:53:57 GMT
Pragma: no-cache
Expires: Tue, 24 May 2016 01:53:57 GMT
Date: Tue, 24 May 2016 01:53:57 GMT
Pragma: no-cache
Location: http://dn04.qa.quasar.local:50075/webhdfs/v1/user/rchapin/output_directory/000001_0?op=OPEN&user.name=rchapin&namenoderpcaddress=quasar&offset=0
Server: Jetty(6.1.26.hwx)
Content-Type: application/octet-stream
Content-Length: 0
The problem is that the URL returned in the Location header is a direct link to one of the data nodes and is not a URL to the Knox server. Based on the Knox documentation here, Knox should be rewriting the Location header to proxy that request through itself and it should be encrypting the original query parameters. In my attempts to figure out how to enable rewriting I read the section regarding Provider Configuration, however I was unable to find any further information about how to configure the rewrite provider, or find an example of what a provider configuration block for rewrites looks like. Any assistance on how to configure Knox to enable URL rewriting would be greatly appreciated. The Knox topology file is as follows: <topology>
<gateway>
<provider>
<role>authentication</role>
<name>ShiroProvider</name>
<enabled>true</enabled>
<param>
<name>sessionTimeout</name>
<value>30</value>
</param>
<param>
<name>main.ldapRealm</name>
<value>org.apache.hadoop.gateway.shirorealm.KnoxLdapRealm</value>
</param>
<param>
<name>main.ldapRealm.userDnTemplate</name>
<value>uid={0},ou=people,dc=hadoop,dc=apache,dc=org</value>
</param>
<param>
<name>main.ldapRealm.contextFactory.url</name>
<value>ldap://{{knox_host_name}}:33389</value>
</param>
<param>
<name>main.ldapRealm.contextFactory.authenticationMechanism</name>
<value>simple</value>
</param>
<param>
<name>urls./**</name>
<value>authcBasic</value>
</param>
</provider>
<provider>
<role>identity-assertion</role>
<name>Default</name>
<enabled>true</enabled>
</provider>
<provider>
<role>authorization</role>
<name>XASecurePDPKnox</name>
<enabled>true</enabled>
</provider>
</gateway>
<service>
<role>NAMENODE</role>
<url>hdfs://{{namenode_host}}:{{namenode_rpc_port}}</url>
</service>
<service>
<role>JOBTRACKER</role>
<url>rpc://{{rm_host}}:{{jt_rpc_port}}</url>
</service>
<service>
<role>WEBHDFS</role>
<url>http://{{namenode_host}}:{{namenode_http_port}}/webhdfs</url>
</service>
<service>
<role>WEBHCAT</role>
<url>http://{{webhcat_server_host}}:{{templeton_port}}/templeton</url>
</service>
<service>
<role>OOZIE</role>
<url>http://{{oozie_server_host}}:{{oozie_server_port}}/oozie</url>
</service>
<service>
<role>WEBHBASE</role>
<url>http://{{hbase_master_host}}:{{hbase_master_port}}</url>
</service>
<service>
<role>HIVE</role>
<url>http://{{hive_server_host}}:{{hive_http_port}}/{{hive_http_path}}</url>
</service>
<service>
<role>RESOURCEMANAGER</role>
<url>http://{{rm_host}}:{{rm_port}}/ws</url>
</service>
</topology>
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Knox
05-04-2016
08:24 PM
2 Kudos
I know that this question already has an answer and I do not mean to troll or demean anyone's answer. I came across this post while searching for information about this very same thing and came up with a similar solution, but one that does not distribute the Knox server trustStore or the master secret key.
To achieve the same thing, do the following 1. Export a server certificate from the Knox self-signed cert that you will distribute to users/clients. On the Knox server: # cd /usr/hdp/current/knox-server/data/security/keystores
# keytool -exportcert -file knox.crt -keystore ./gateway.jks -storepass <master-secret-password> 2. On the client machines (from which you will be connecting to hive through beeline) import the Knox cert into a user specific trustStore. If the .jks file into which you are importing this cert already exists you will need to enter the password that you used when you created it. If the jks file into which are importing does not yet exist it will ask you for a new password. DO NOT LOSE THIS PASSWORD you will need it when including the trustStore in the beeline connection string. $ keytool -import -keystore myLocalTrustStore.jks -file knox.crt Now, you can connect to beeline as follows and it will prompt you for the username and password for the authentication implementation that you used when configuring Knox. $ beeline -u 'jdbc:hive2://knox-server-hostname:8443/database-name/;ssl=true;sslTrustStore=/path/to/myLocalTrustStore.jks;trustStorePassword=<your-trust-store-passwd>;transportMode=http;httpPath=gateway/default/hive'
... View more
03-20-2016
05:12 PM
1 Kudo
I should have been more specific in my last comment; the result of trying to do too many things at once and not taking the time to properly craft a comment. So it seems obvious that this is an issue with the ORC SerDe code, but specifically it seems to be related to that which reads each of the records in a given column. It /seems/ that the metadata for the stripes is valid. With only the 'corrupt' file in place, doing a SELECT COUNT(1) FROM vsat_lmtd WHERE year=2016 AND month=3 AND day=8; results in: 1810465 records Dumping the metadata about the same file with the command hive --orcfiledump <path-to-file> Indicates the same number of records for the file: File Statistics:
Column 0: count: 1810465 hasNull: false Grepping through the output of the aforementioned command indicates that the column for which we are having the problem /seems/ to have the same number of records, per stripe, that every other column in each stripe has. Also, looking at the overall average number of bytes per record in the files in this same partition shows only a few percentage points difference between each of the files, so I am assuming that number of records reflected in the stripe metadata is an accurate account of what is actually in the file. Does anyone here know how to parse an ORC file to separate out the data in each stripe to its own file? Doing so might help us to isolate the problem to a specific record or records.
... View more
03-19-2016
10:22 PM
3 Kudos
Just to add some additional information to this: We also ran the same hive query using MR as the hive execution engine and it behaved the same way. Perhaps it is a problem with the ORC related serialization classes?
... View more