About Harsh J

Harsh J · ‎08-23-2018

There will not be any operational problems such as crashes or errors when running a HDFS balancer on a cluster with HBase running, but there can potentially be a performance impact depending on what the balancer decides to move based on its space thresholds. The performance impact would come from loss of locality - the RegionServers' required HFiles may find their blocks to be remote, so a slightly higher network usage can be observed until the next major compaction rewrites a block replica locally. If you'd like to narrow down the time-frame of impact, you can run the HDFS balancer with the desired balancing threshold, and then once it is complete, immediately follow up with a major compaction command on your latency-sensitive HBase tables.

Harsh J · ‎08-22-2018

HBase authz will not consult NameNode for groups but rather will check it local to each serving RegionServer handling the request. Ensure your Linux user and groups are consistent across _all_ cluster hosts for a predicable result with any authorization feature.

Harsh J · ‎08-22-2018

Yes, but is your client able to (a) resolve the hostname of the DN/NN (you seem to be using an IP in your code) and (b) does it have permission (firewall, etc.) to connect to the DN web port?

Harsh J · ‎08-22-2018

It appears as though your remote (client) machine has network access and/or DNS resolution only for the NameNode host, but not to the DataNode hosts. When using the WebHDFS protocol at the NameNode, a CREATE call or a READ call will typically result in the NameNode sending back a 30x (307 typically) code to redirect your client to a chosen target DataNode service that will handle the rest of the data-oriented work. The NameNode only handles metadata requests, and does not desire to be burdened with actual data streaming overheads so it redirects the clients to one of the 'worker' WebHDFS servlet hosts (i.e. DataNodes). This is documented at http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-project-dist/hadoop-hdfs/WebHDFS.html and you should be able to verify this in your error - the HOST_NAME that you've masked away for port 50075 is a DataNode service host/port. Ensure your client can connect to and name-resolve all DataNode hostnames/port besides just the NameNode for the WebHDFS client to work. If you need a more one-stop-gateway solution, run a HTTPFS service and point your client code to just that web host:port, instead of using the NameNode web address. The HTTPFS service's WebHDFS API will not require redirection, as it would act as a 'proxy' and handle all calls for you from one location.

Harsh J · ‎08-18-2018

With the id command failing this is really a problem at a lower level than CDH and requires troubleshooting further at the OS and its group configuration layers. CDH components rely on a successful run of id, but the exit code of 1 indicates that's not the case, at least not for this user. I'd recommend taking this up with a Linux support team if the command prints nothing useful in its stderr that could help trace what the problem is for this specific account. You could also try to see which underlying subsystem is failing by running it under strace and debugging further, and/or look at the sssd/other logs to catch the failure after you run it.

Harsh J · ‎08-11-2018

The documentation of CM metrics for HBase Master states the below: """ The number of times the balancer was invoked outside a balance cluster operation. The balancer might be invoked when a new table is created, for example, to place the newly created regions. """ - https://www.cloudera.com/documentation/enterprise/latest/topics/cm_metrics_master.html If you want more granularity on when this metric receives an increment, look for the method calls named 'incrMiscInvocations' inside the balancer base class within HBase Master: https://github.com/cloudera/hbase/blob/cdh5.15.0-release/hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/BaseLoadBalancer.java -- The _across_masters suffix just averages the value across all HBase Masters running in the cluster. The _across_roletype generally applies the same way to all other role type metrics in CM. You can also query them without the across suffixes. The total_ prefix sums instead of averaging the value, across all relevant role types for your chosen metric. The rate bit basically appears for all counter style metrics. It represents the change over time (dx/dt). You can perform an integral(…) over any rate-named metrics to access actual value growth. These are tsquery specifics documented further at https://www.cloudera.com/documentation/enterprise/latest/topics/cm_dg_metric_aggregation.html#cmug_topic_11_8

Harsh J · ‎07-30-2018

Have you followed the solution made above? Depending on where you are trying to write into your cluster, unless you have full access to communicating with all your DataNode hosts and its ports, you will face this error.

Harsh J · ‎07-30-2018

The 1-factor should work. Setting it higher slows the job initialization phase a bit, but has better task startup time due to quicker localization of its files. Interesting that you observe a "Replica not found" message for files needed during localization. Do you actively/frequently run the HDFS balancer, or were running the balancer when you experienced this error? Its likely that the block changed locations between the point of write and the localizer downloading it when the job tasks begin. That'd cause the WARN you see, which forces the client to re-fetch new locations from NameNode and proceed normally after that.

Harsh J · ‎07-29-2018

What version(s) of JDK/JRE are installed on the host that runs your NFS Gateway? Is it consistent with the other hosts? CDH/CM requires recent version(s) of Oracle JDK version 1.7 or version 1.8 to run: https://www.cloudera.com/documentation/enterprise/release-notes/topics/rn_consolidated_pcm.html#pcm_jdk and it is recommended to not keep multiple different version(s) of Java JRE/JDK installed.

Harsh J · ‎07-29-2018

Your OS seems to be running out of free port numbers in the ephemeral range. Typically on Linux this is in range 32k to 64k, which is quite a lot of ports. A common reason is abuse of software clients (due to excessive connections being created without use of shared connection pools, or a leak of connections due to non-closure in the code), or lower level problems with the socket closure (such as the FIN stage of TCP not being correctly processed, causing the OS to hold the port open for an extended period of time waiting for the final close to complete). Are you perhaps executing a lot of concurrent programs on your cluster, or use a multi-threaded app that builds a new network client (for HDFS, etc.) under each thread? When you experience this, you could run an lsof check on the host of the failing task to find which PID(s) are occupying most of the network client ephemeral ports and if there is a pattern to their destination(s). This can help figure out where the problem specifically lies, and what category (in the above) it may belong to.

Member Since	‎07-31-2013 07:21 AM
Last Visited
Posts	1,924
Kudos received	461

Cloudera Community

Re: S3Guard Suggested to help fix Consistency

Re: Failed to start namenode. java.io.FileNotFound...

Re: sqoop import issue

Re: Efficient ways to store many images files

Re: S3 loading into HDFS

Re: Can I run the balancer for hdfs

Re: HBase simple authorization group permissions n...

Re: Unable to access HDFS Namenode from Python lib...

Re: Unable to access HDFS Namenode from Python lib...

Re: User not returning any groups for hdfs groups ...

Re: total_balancer_misc_invocations_rate_across_ma...

Re: only be replicated to 0 nodes instead of minRe...

Re: YARN - occasional Error message

Re: NFS Gateway failed to start Caused by: java.la...

Re: java.net.BindException: Cannot assign requeste...