About Harsh J

Harsh J · ‎09-27-2015

> How can I have only 68 blocks? That depends on how much data your HDFS is carrying. Is the number much less than expected, and not match the output of 'hadoop fs -ls -R /' list of all files? The space report says only about 23 MB used by HDFS, so the number of blocks look OK to me. > Also, when I run hive job, it does not go beyond "Running job: job_1443147339086_0002". Could it be related? This would be unrelated, but to resolve the issue consider raising the values under YARN -> Configuration -> Container Memory (NodeManager) and Container Virtual CPUs (NodeManager)

Harsh J · ‎09-24-2015

(1) - While the CDH codebase does carry the 2.6 initial node-label implementation features, a lot many more changes and enhancements for node-labelling made it upstream only to 2.8, so its a feature still under some development. You can utilise the 2.6 features for certain in CDH 5.4.x, but only via the CapacityScheduler (following the upstream docs), cause the code support does exist in the sources: https://github.com/cloudera/hadoop-common/tree/cdh5.4.7-release/ (2) FairScheduler support is not in upstream yet. We do have node-labelling for FairScheduler on our roadmap for a future release, but I don't have a shareable ETA for it yet.

Harsh J · ‎09-24-2015

The TTL values are stored as Cell-level tags [1]. To retrieve them back from their Cell, fetch the Cell via Get/etc. and then use the available Tags-relevant APIs on the Cell object: http://archive.cloudera.com/cdh5/cdh/5/hbase/apidocs/org/apache/hadoop/hbase/Cell.html#getTagsArray(), and deserialise the array of tags via http://archive.cloudera.com/cdh5/cdh/5/hbase/apidocs/org/apache/hadoop/hbase/CellUtil.html#tagsIterator(byte[],%20int,%20int) [1] - https://github.com/cloudera/hbase/blob/cdh5.4.5-release/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L3483-L3486

Harsh J · ‎09-24-2015

You can edit this via API just as you would with the Pools API, by using JSON edits to the existing property the UI itself writes into. See https://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_props_cdh540_yarn_mr2included_.html#concept_5v9_49n_yk_unique_1__table_kqj_eb1_wk_unique_1, specifically the entry with name "Fair Scheduler Allocations" and description "JSON representation of all the configurations that the Fair Scheduler can take on across all schedules. Typically edited using the Pools configuration UI.". Use http://cloudera.github.io/cm_api/apidocs/v10/path__clusters_-clusterName-_services_-serviceName-_config.html to update this like any other config. Subsequently, call http://cloudera.github.io/cm_api/apidocs/v10/path__clusters_-clusterName-_commands_poolsRefresh.html, to update the pool configs (if you use Dynamic Resource Pools).

Harsh J · ‎09-24-2015

Does your WebSphere app load a custom set of configs to talk to the remote cluster? Are the JHS configs part of the config set, if so? The below properties are all necessary in having the MR2 job register itself with the JHS for post-job persistence - get these property values to precisely match with the working 'hadoop jar' command host's /etc/hadoop/conf/mapred-site.xml: mapreduce.jobhistory.address mapreduce.jobhistory.webapp.address (OR) mapreduce.jobhistory.webapp.https.address yarn.app.mapreduce.am.staging-dir

Harsh J · ‎09-23-2015

> Wildcard addresses is being used on datanode/namenode > dfs.client.use.datanode.hostname This is your solution here, iff your client hosts will resolve the very same DN hostname but over a different IP. Is that true in your environment? You mention you've tried this - could you elaborate? This setting needs to be applied at the HDFS client configuration, for it to be properly in effect. Is your 'edge host' that lies out of the cluster, or your Java application (if it is run standalone), configured with this set to true in its hdfs-site.xm/Configuration object?

Harsh J · ‎09-22-2015

Glad to know! Please consider marking the thread resolved, so others with a similar question can find a solution quicker. Feel free to post a new thread with any further questions.

Harsh J · ‎09-22-2015

Right, it was suggested as an optimisation aside of the summing question, given the described example. Does the bc command not solve your original question?

Harsh J · ‎09-22-2015

Yes that could work too (or a file with them, passed via -f or such).

Harsh J · ‎09-21-2015

To add onto Wilfred's response, what is your CDH version? HDFS does cache all positive entries for 5 minutes, but negative caching wasn't supported until CDH 5.2.0 onwards (via HADOOP-10755). See also http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-project-dist/hadoop-common/core-default.xml#hadoop.security.groups.negative-cache.secs (which lists negative caching's TTL default being 30s, vs. positive's 300s). NSCD does also do negative caching by default, which could explain why the problem is gone, depending on how many negative, WARN group-lookup failure entries you observe in the log.

Member Since	‎07-31-2013 07:21 AM
Last Visited
Posts	1,924
Kudos received	461

Cloudera Community

Re: S3Guard Suggested to help fix Consistency

Re: Failed to start namenode. java.io.FileNotFound...

Re: sqoop import issue

Re: Efficient ways to store many images files

Re: S3 loading into HDFS

Re: HDFS issue

Re: Node label support in CDH 5.4.x?

Re: Get ttl from HBase record

Re: adding resource pool using Cloudera Manager RE...

Re: CDH 5.3.3 jobhistory removed on RPC call

Re: HDFS put failing due to internal IP address us...

Re: How to get the total number of rows from multi...

Re: How to get the total number of rows from multi...

Re: How to get the total number of rows from multi...

Re: Yarn MR overloads Active Directory domain cont...