Member since
07-31-2013
1924
Posts
462
Kudos Received
311
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1543 | 07-09-2019 12:53 AM | |
9311 | 06-23-2019 08:37 PM | |
8056 | 06-18-2019 11:28 PM | |
8681 | 05-23-2019 08:46 PM | |
3477 | 05-20-2019 01:14 AM |
08-23-2018
07:38 PM
1 Kudo
There will not be any operational problems such as crashes or errors when running a HDFS balancer on a cluster with HBase running, but there can potentially be a performance impact depending on what the balancer decides to move based on its space thresholds. The performance impact would come from loss of locality - the RegionServers' required HFiles may find their blocks to be remote, so a slightly higher network usage can be observed until the next major compaction rewrites a block replica locally. If you'd like to narrow down the time-frame of impact, you can run the HDFS balancer with the desired balancing threshold, and then once it is complete, immediately follow up with a major compaction command on your latency-sensitive HBase tables.
... View more
08-22-2018
05:44 PM
HBase authz will not consult NameNode for groups but rather will check it local to each serving RegionServer handling the request. Ensure your Linux user and groups are consistent across _all_ cluster hosts for a predicable result with any authorization feature.
... View more
08-22-2018
06:37 AM
Yes, but is your client able to (a) resolve the hostname of the DN/NN (you seem to be using an IP in your code) and (b) does it have permission (firewall, etc.) to connect to the DN web port?
... View more
08-22-2018
04:12 AM
It appears as though your remote (client) machine has network access and/or DNS resolution only for the NameNode host, but not to the DataNode hosts. When using the WebHDFS protocol at the NameNode, a CREATE call or a READ call will typically result in the NameNode sending back a 30x (307 typically) code to redirect your client to a chosen target DataNode service that will handle the rest of the data-oriented work. The NameNode only handles metadata requests, and does not desire to be burdened with actual data streaming overheads so it redirects the clients to one of the 'worker' WebHDFS servlet hosts (i.e. DataNodes). This is documented at http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-project-dist/hadoop-hdfs/WebHDFS.html and you should be able to verify this in your error - the HOST_NAME that you've masked away for port 50075 is a DataNode service host/port. Ensure your client can connect to and name-resolve all DataNode hostnames/port besides just the NameNode for the WebHDFS client to work. If you need a more one-stop-gateway solution, run a HTTPFS service and point your client code to just that web host:port, instead of using the NameNode web address. The HTTPFS service's WebHDFS API will not require redirection, as it would act as a 'proxy' and handle all calls for you from one location.
... View more
08-18-2018
02:30 AM
With the id command failing this is really a problem at a lower level than CDH and requires troubleshooting further at the OS and its group configuration layers. CDH components rely on a successful run of id, but the exit code of 1 indicates that's not the case, at least not for this user. I'd recommend taking this up with a Linux support team if the command prints nothing useful in its stderr that could help trace what the problem is for this specific account. You could also try to see which underlying subsystem is failing by running it under strace and debugging further, and/or look at the sssd/other logs to catch the failure after you run it.
... View more
08-11-2018
01:10 AM
1 Kudo
The documentation of CM metrics for HBase Master states the below: """ The number of times the balancer was invoked outside a balance cluster operation. The balancer might be invoked when a new table is created, for example, to place the newly created regions. """ - https://www.cloudera.com/documentation/enterprise/latest/topics/cm_metrics_master.html If you want more granularity on when this metric receives an increment, look for the method calls named 'incrMiscInvocations' inside the balancer base class within HBase Master: https://github.com/cloudera/hbase/blob/cdh5.15.0-release/hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/BaseLoadBalancer.java -- The _across_masters suffix just averages the value across all HBase Masters running in the cluster. The _across_roletype generally applies the same way to all other role type metrics in CM. You can also query them without the across suffixes. The total_ prefix sums instead of averaging the value, across all relevant role types for your chosen metric. The rate bit basically appears for all counter style metrics. It represents the change over time (dx/dt). You can perform an integral(…) over any rate-named metrics to access actual value growth. These are tsquery specifics documented further at https://www.cloudera.com/documentation/enterprise/latest/topics/cm_dg_metric_aggregation.html#cmug_topic_11_8
... View more
07-30-2018
07:48 PM
Have you followed the solution made above? Depending on where you are trying to write into your cluster, unless you have full access to communicating with all your DataNode hosts and its ports, you will face this error.
... View more
07-30-2018
07:33 PM
The 1-factor should work. Setting it higher slows the job initialization phase a bit, but has better task startup time due to quicker localization of its files. Interesting that you observe a "Replica not found" message for files needed during localization. Do you actively/frequently run the HDFS balancer, or were running the balancer when you experienced this error? Its likely that the block changed locations between the point of write and the localizer downloading it when the job tasks begin. That'd cause the WARN you see, which forces the client to re-fetch new locations from NameNode and proceed normally after that.
... View more
07-29-2018
08:02 PM
1 Kudo
What version(s) of JDK/JRE are installed on the host that runs your NFS Gateway? Is it consistent with the other hosts? CDH/CM requires recent version(s) of Oracle JDK version 1.7 or version 1.8 to run: https://www.cloudera.com/documentation/enterprise/release-notes/topics/rn_consolidated_pcm.html#pcm_jdk and it is recommended to not keep multiple different version(s) of Java JRE/JDK installed.
... View more
07-29-2018
07:54 PM
Your OS seems to be running out of free port numbers in the ephemeral range. Typically on Linux this is in range 32k to 64k, which is quite a lot of ports. A common reason is abuse of software clients (due to excessive connections being created without use of shared connection pools, or a leak of connections due to non-closure in the code), or lower level problems with the socket closure (such as the FIN stage of TCP not being correctly processed, causing the OS to hold the port open for an extended period of time waiting for the final close to complete). Are you perhaps executing a lot of concurrent programs on your cluster, or use a multi-threaded app that builds a new network client (for HDFS, etc.) under each thread? When you experience this, you could run an lsof check on the host of the failing task to find which PID(s) are occupying most of the network client ephemeral ports and if there is a pattern to their destination(s). This can help figure out where the problem specifically lies, and what category (in the above) it may belong to.
... View more