About bsaini

bsaini · ‎12-04-2015

Here is how I would do it but I am missing any requirement, please feel free to add more details (without revealing any secret sauce of your logic 😉 - Assumptions first - Job 1 - Executes Step A and Step B at 00:01 AM every morning (two step job) Job 2 - Executes Step B every hour between 01:01 - 23:01 through out the day. (single step job) Note: The timings can obviously be adjusted but assumption here is that the time of execution of two step is fixed and is mutually exclusive with the other 23 executions of the single step job. These two steps could be any action supported by Oozie like Hive, Pig, Email, SSH etc. So the workflow definitions will have duplicate Step B action in both jobs. Coordinator Definitions - The exact time of execution and frequency can be controlled by specifying the values of validity and frequency. For Job1, Validity = 00:00 hours of the day when you want the job to start executing. Frequency = ${coord:days(int n)} See section 4.4.1. The coord:days(int n) and coord:endOfDays(int n) EL functionsat - http://oozie.apache.org/docs/4.2.0/CoordinatorFunctionalSpec.html For Job 2, Validity = 01:00 hours of the same day as Job 1 Frequency = frequency="* 1-23 * * *" Note: that instead of using fixed frequency we are using cron type syntax, which is super cool See section 4.4.3. Cron syntax in coordinator frequency at - http://oozie.apache.org/docs/4.2.0/CoordinatorFunctionalSpec.html Hope this helps.

bsaini · ‎12-03-2015

Without spending too much time, it appears to be a defect to me. @Balu any thoughts?

bsaini · ‎12-03-2015

Hi Ravi, As you know that property dfs.datanode.handler.count defines the number of server threads for the datanode, this property is at the datanode level. In other words, this property value is driven more by the I/O requests to the datanode rather than the size of the cluster. So, hypothetically speaking, if you have a cluster (large or small) being used for online archiving use case such that the data is not read very often, you do not need a large number of parallel threads. As the traffic / I/O goes up, there may be benefit in increasing the number of parallel threads in datanode. Here is the code that uses this property. If there is a way to isolate the heavy workers from light workers then you can create Ambari configuration groups to have different values for these properties.

bsaini · ‎12-03-2015

Moved the question to "Governance and Lifecycle" track.

bsaini · ‎12-03-2015

Ambari can only manage the principals and keytabs for the services managed by it. The pricipal and keytabs are actually provided as part of the configuration files with the stack definition. For e.g. for Storm, looking at the stack , you can see - .... "name": "storm_components", "principal": { "value": "${storm-env/storm_user}-${cluster_name}@${realm}", "type": "user", "configuration": "storm-env/storm_principal_name" }, "keytab": { "file": "${keytab_dir}/storm.headless.keytab", "owner": { "name": "${storm-env/storm_user}", "access": "r" }, .... Ambari does not support managing principals and keytabs of other components that are outside its purview.

bsaini · ‎12-03-2015

There is no clean way to do this within the same oozie job. If the time, when Step A and B have to be executed together, if fixed then IMHO it would be a better approach to set up two different oozie jobs - 1 with both steps that runs once a day and the other one with Step B only that runs 23 times.

bsaini · ‎12-02-2015

I've seen very little ext3 and mostly ext4 for the on prem deployments. AWS EBS is xfs by default. XFS has its advantages but in a JBOD setup, it doesn't really provide lot of benefits.

bsaini · ‎12-01-2015

This post by Lester Martin sums up really well - https://martin.atlassian.net/wiki/pages/viewpage.action?pageId=36044812 Here is a summary - Since most OS patching / upgrade requires reboot, it is best to schedule such an activity around a scheduled outage. It is also recommended to go through the exercise in a lower level environment prior to applying changes in a PROD environment. In order to be apply the changes while the cluster is up, the patch/upgrade will have to be applied in rolling manner by first stopping the components in the host from Ambari, then applying changes, rebooting host and then starting the Hadoop services from Ambari. Repeat for each host. This process will have to be scripted for a large size cluster. The steps of stopping and starting the cluster can be performed by uzing Ambari APIs.

bsaini · ‎11-18-2015

You mean DDL? Yeah, agreed. But OP is asking "load it into an existing Hive table" - so just insert.

bsaini · ‎11-18-2015

Great, looking forward to hearing the results.

Online	Offline
Last Visited	‎04-06-2018 07:42 PM

Member Since	‎09-24-2015 03:23 PM
Last Visited	‎04-06-2018 07:42 PM
Posts	178
Kudos received	103

Cloudera Community

Re: Which is better to create Hadoop accounts in L...

Re: Last step of Ambari HDP installation fails for...

Re: How to create falcon entity dependencies?

Re: Where is the output of an Oozie workflow store...

Re: Hi I am new to falcon , can anyone help me wit...

Re: Can I have an oozie coordinator that runs once...

Re: Atlas generating huge log files

Re: What is the recommended value for dfs.datanode...

Re: Can I have an oozie coordinator that runs once...

Re: MIT principal and keytab management via Amabar...

Re: Can I have an oozie coordinator that runs once...

Re: Ext4 vs XFS Filesystem - Survey of Popularity

Re: HDP OS Upgrade/Patching Best practices

Re: Can Microsoft/Hortonworks ODBC Driver provide ...

Re: Can Microsoft/Hortonworks ODBC Driver provide ...