Member since
09-24-2015
178
Posts
113
Kudos Received
28
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3498 | 05-25-2016 02:39 AM | |
3709 | 05-03-2016 01:27 PM | |
871 | 04-26-2016 07:59 PM | |
14720 | 03-24-2016 04:10 PM | |
2173 | 02-02-2016 11:50 PM |
12-04-2015
01:09 AM
1 Kudo
Here is how I would do it but I am missing any requirement, please feel free to add more details (without revealing any secret sauce of your logic 😉 - Assumptions first -
Job 1 - Executes Step A and Step B at 00:01 AM every morning (two step job)
Job 2 - Executes Step B every hour between 01:01 - 23:01 through out the day. (single step job) Note: The timings can obviously be adjusted but assumption here is that the time of execution of two step is fixed and is mutually exclusive with the other 23 executions of the single step job. These two steps could be any action supported by Oozie like Hive, Pig, Email, SSH etc. So the workflow definitions will have duplicate Step B action in both jobs. Coordinator Definitions - The exact time of execution and frequency can be controlled by specifying the values of validity and frequency. For Job1,
Validity = 00:00 hours of the day when you want the job to start executing.
Frequency = ${coord:days(int n)} See section 4.4.1. The coord:days(int n) and coord:endOfDays(int n) EL functionsat - http://oozie.apache.org/docs/4.2.0/CoordinatorFunctionalSpec.html For Job 2,
Validity = 01:00 hours of the same day as Job 1
Frequency = frequency="* 1-23 * * *" Note: that instead of using fixed frequency we are using cron type syntax, which is super cool See section 4.4.3. Cron syntax in coordinator frequency at - http://oozie.apache.org/docs/4.2.0/CoordinatorFunctionalSpec.html Hope this helps.
... View more
12-03-2015
09:45 PM
Without spending too much time, it appears to be a defect to me. @Balu any thoughts?
... View more
12-03-2015
09:13 PM
2 Kudos
Hi Ravi, As you know that property dfs.datanode.handler.count defines the number of server threads for the datanode, this property is at the datanode level. In other words, this property value is driven more by the I/O requests to the datanode rather than the size of the cluster. So, hypothetically speaking, if you have a cluster (large or small) being used for online archiving use case such that the data is not read very often, you do not need a large number of parallel threads. As the traffic / I/O goes up, there may be benefit in increasing the number of parallel threads in datanode. Here is the code that uses this property. If there is a way to isolate the heavy workers from light workers then you can create Ambari configuration groups to have different values for these properties.
... View more
12-03-2015
08:57 PM
Moved the question to "Governance and Lifecycle" track.
... View more
12-03-2015
08:53 PM
Ambari can only manage the principals and keytabs for the services managed by it. The pricipal and keytabs are actually provided as part of the configuration files with the stack definition.
For e.g. for Storm, looking at the stack , you can see - ....
"name": "storm_components",
"principal": {
"value": "${storm-env/storm_user}-${cluster_name}@${realm}",
"type": "user",
"configuration": "storm-env/storm_principal_name"
},
"keytab": {
"file": "${keytab_dir}/storm.headless.keytab",
"owner": {
"name": "${storm-env/storm_user}",
"access": "r"
},
.... Ambari does not support managing principals and keytabs of other components that are outside its purview.
... View more
12-03-2015
08:10 PM
1 Kudo
There is no clean way to do this within the same oozie job. If the time, when Step A and B have to be executed together, if fixed then IMHO it would be a better approach to set up two different oozie jobs - 1 with both steps that runs once a day and the other one with Step B only that runs 23 times.
... View more
12-02-2015
09:32 PM
I've seen very little ext3 and mostly ext4 for the on prem deployments. AWS EBS is xfs by default. XFS has its advantages but in a JBOD setup, it doesn't really provide lot of benefits.
... View more
12-01-2015
01:12 AM
4 Kudos
This post by Lester Martin sums up really well - https://martin.atlassian.net/wiki/pages/viewpage.action?pageId=36044812 Here is a summary - Since most OS patching / upgrade requires reboot, it is best to schedule such an activity around a scheduled outage. It is also recommended to go through the exercise in a lower level environment prior to applying changes in a PROD environment. In order to be apply the changes while the cluster is up, the patch/upgrade will have to be applied in rolling manner by first stopping the components in the host from Ambari, then applying changes, rebooting host and then starting the Hadoop services from Ambari. Repeat for each host. This process will have to be scripted for a large size cluster. The steps of stopping and starting the cluster can be performed by uzing Ambari APIs.
... View more
11-18-2015
05:49 PM
You mean DDL? Yeah, agreed. But OP is asking "load it into an existing Hive table" - so just insert.
... View more
11-18-2015
05:44 PM
Great, looking forward to hearing the results.
... View more