About Harsh J

Harsh J · ‎04-20-2016

The reduce phase of a bulk load preparation job is used to align the output files against the # of regions under the targeted table. You will always see the number of reducers equal the number of regions in the targeted table during the time of launching the job. If you desire more reducers, you will need to pre-split your table appropriately. Read up more on pre-splitting at http://hbase.apache.org/book.html#manual_region_splitting_decisions

Harsh J · ‎03-31-2016

You'll need a HCat credential for regular hive actions, not the Hive2 credential. The Hive2 credential is only for the hive2 actions. I'd also recommend using Hue to design workflows, as it auto-adds credentials when it finds that necessary, easing your work.

Harsh J · ‎03-29-2016

The point about having the job.properties on the local filesystem is just within the context of using Oozie CLI - it is explicitly mentioned to avoid confusion in the process of writing a workflow XML (the XML needs to be on HDFS, but the properties are a local file read by the Oozie CLI when invoked - people often get confused by this relationship). When you use Hue, which in turn simply uses the Oozie REST API, you can use whatever mechanism Hue offers you to manage your job.properties (as a HDFS file, defined within-workflow, etc.). The properties are something used to resolve the workflow's variables and supply it some necessary parameters. Once submitted, the properties file or list of properties do not matter anymore to Oozie.

Harsh J · ‎03-21-2016

If you've previously set manual overrides to MaxPermSize options in your configurations, you can remove them away safely having switched to using JDK8. If you still have parts that use JDK7, leave them be and ignore the warning. The warnings themselves do not pose a problem, as JDK8 will simply note that it won't be using them anymore and continue to start up normally. However, a future Java version (JDK9 or 10) may choose to interpret it as an invalid command and fail. Read more on this JDK level change at https://dzone.com/articles/java-8-permgen-metaspace

Harsh J · ‎03-18-2016

In HDFS, the permissions model for owner and group follow the BSD rule. The owner is set to the authenticated user, but the group is inherited from the parent directory. This is documented in the Permissions Guide: http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html#Overview """ When a file or directory is created, its owner is the user identity of the client process, and its group is the group of the parent directory (the BSD rule). """ The Group Mapping is purely used at the authorisation side, not at the creation side as you are expecting it to be. Since your /user/username directory's group is by default the username itself, that's the value you will naturally see for all groups. If you'd like that changed, you will need to chgrp the /user/username directory to be username:user-group instead of username:username. Subsequent files will now be created with username:user-group under it.

Harsh J · ‎03-11-2016

For the timeout part of the question, please take a look at http://www.cloudera.com/documentation/enterprise/latest/topics/admin_hbase_scanner_heartbeat.html

Harsh J · ‎03-03-2016

Thank you for following up as always, Srini!

Harsh J · ‎03-03-2016

One part of HBase Replication is that of turning on configs on the Server end to enable the feature. This can be done via the API exactly like Marcell described. The other part of replication configs is the peer configuration, but this is not doable via CM API as it is not a service-passed configuration, but more of a runtime one. You will need to use HBase's Java API (ReplicationAdmin class) directly for this: http://archive.cloudera.com/cdh5/cdh/5/hbase/apidocs/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.html. To do this in Python, I'd guess you will need to use Jython or such.

Harsh J · ‎03-03-2016

Thanks for reporting this. Upon investigating, there's a small bug with how we load/respect lib_dir in the agent. There's two ways to supply it, one via the CMF_AGENT_ARGS (--lib_dir=/path) inside /etc/default/cloudera-scm-agent and another via the line inside /etc/cloudera-scm-agent/config.ini. The small bug is that for the UUID file we seem to look at the default path before we load up the config. I'd have expected some form of error in your case vs. a blatant ignore, especially if /var/lib/cloudera-scm-agent no longer exists, so I'd first make sure you have lib_dir configured without the prefix # comment-marking keyword in your config.ini, and if that is so, apply the workaround to the bug (specify both in CMF_AGENT_ARGS and in config.ini). The internal bug report is OPSAPS-32501 P.s. Make sure to copy over the uuid file before you erase /var/lib/cloudera-scm-agent. If you've lost it, no worries, you can create the uuid file under your lib_dir with the content matching the one shown on the CM -> Hosts -> (the specific host's page) -> "Host ID" field on top. Failure in keeping the same UUID will result in duplicated hosts (though that can be fixed trivially by fixing the uuid file and restarting agent yet again, then deleting the role-less duped host).

Harsh J · ‎03-01-2016

Thank you for explaining the need Rmutsaers! The AM's IPC port is indeed used directly by clients and are controllable on the serving AM via the yarn.app.mapreduce.am.job.client.port-range config. It still has to be a range though, and the range must be chosen by keeping in mind that it will also effectively limit the number of AMs you can run on the host. The AM's web port is also served on an ephemeral port, but this is a non-concern cause clients do not access the AM web port directly; they go via the RM's proxy service (wherein the RM makes the GET HTTP requests to the actual AM port, within the cluster). Does yarn.app.mapreduce.am.job.client.port-range not solve your need? There's no IPC proxying today to eliminate the range requirement, unfortunately. The con of not having the IPC port ranges open is not too fatal, as the job can still get a completed notification once it gets moved to the job history server (and the RM redirects the client to it).

Member Since	‎07-31-2013 07:21 AM
Last Visited
Posts	1,924
Kudos received	461

Cloudera Community

Re: S3Guard Suggested to help fix Consistency

Re: Failed to start namenode. java.io.FileNotFound...

Re: sqoop import issue

Re: Efficient ways to store many images files

Re: S3 loading into HDFS

Re: HBase increase num of reducers for bulk loadin...

Re: Not able to enter Hive Shell

Re: Oozie job.properties - Local or HDFS

Re: Ignored Java option due to Java 8.0

Re: HDFS User to Group Mapping

Re: Hbase Soket TimeOut Exception

Re: Hbase ACLs will apply for the subgroup members

Re: configure hbase replication using python api (...

Re: Changes to /etc/cloudera-scm-agent/config.ini ...

Re: Where is the setting for the port-range used b...