Member since
09-24-2015
22
Posts
31
Kudos Received
6
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1676 | 05-31-2017 02:18 PM | |
2043 | 06-09-2016 08:19 PM | |
3219 | 02-03-2016 08:37 PM | |
15338 | 02-03-2016 08:26 PM | |
2344 | 12-09-2015 06:54 PM |
12-09-2015
06:54 PM
2 Kudos
There are a couple considerations that need to be taken into account when using NN HA with Falcon and Oozie. In all cases, you need to use the Namenode service ID when referring to the Namenode in the cluster xml. This value can be found in hdfs-site.xml in the property dfs.ha.namenodes.[nameservice ID]. For multi-cluster installs, you need to setup all cluster Namenode HA nameservice ID details in all clusters. For example, if you have two clusters, hdfs-site.xml for both cluster one and cluster two will have 2 nameservice IDs. Likewise, for three clusters, all three clusters would have three nameservice IDs. A two-cluster implementation would look similar to the following: <property>
<name>dfs.ha.namenodes.hacluster1</name>
<value>c1nn1,c1nn2</value>
</property>
<property>
<name>dfs.ha.namenodes.hacluster2</name>
<value>c2nn1,c2nn2</value>
</property>
Now, when you setup Falcon, provide both cluster definitions on both clusters.
... View more
11-26-2015
01:09 AM
For those upvoting this answer, this is the correct answer for increasing mem for mapper Yarn containers, but will not work in cases where Hive is optimizing by creating a local task. What happens is that it generates a hash table of values for the map-side join first on a local node, then uploads this to HDFS for distribution to all mappers that need the fast lookup table. It's the local task that is the problem here, and the only way to fix this is to bail on the map-side join optimization, or change your HADOOP_HEAPSIZE on a global level through Ambari. Not elegant, but it is a workaround.
... View more
11-26-2015
01:03 AM
@Guilherme Braccialli, that doesn't increase memory allocation for the local task. It's a percentage threshold before the job is automatically killed. It's already at 90% by default, so at this point the only option is to increase the local mem allocation. I tested the "HADOOP_HEAPSIZE" option from Ambari, and it works, but it's global.
... View more
11-25-2015
06:23 PM
Doesn't seem to work. Did the following: $ export HADOOP_OPTS="-Xmx1024m" $ hive -f test.hql > results.txt ... Starting to launch local task to process map join;maximum memory = 511180800 = 0.5111808GB ...
... View more
11-24-2015
10:27 PM
2 Kudos
Is there a way in HDP >= v2.2.4 to increase the local task memory? I'm aware of disabling/limiting map-only join sizes, but we want to increase, not limit it. Depending on the environment, the memory allocation will shift, but it appears to be entirely to Yarn and Hive's discretion. "Starting to launch local task to process map join;maximum memory = 255328256 => ~ 0.25 GB" I've looked at/tried:
hive.mapred.local.mem hive.mapjoin.localtask.max.memory.usage - this is simply a percentage of the local heap. I want to increase, not limit the mem.
mapreduce.map.memory.mb - only effective for non-local tasks
I found documentation suggesting 'export HADOOP_HEAPSIZE="2048"' to change from the default, but this applied to the nodemanager. Any way to configure this on a per-job basis? EDIT To avoid duplication, the info I'm referencing comes from here: https://support.pivotal.io/hc/en-us/articles/207750748-Unable-to-increase-hive-child-process-max-heap-when-attempting-hash-join Sounds like a per-job solution is not currently available with this bug.
... View more
Labels:
- Labels:
-
Apache Hive
10-28-2015
12:28 AM
1 Kudo
Ok, so it's a 1-to-1 mapping of the DistCP functionality that we currently choose to expose (I added the features for maxMaps and mapBandwidth 🙂 ). Incidentally, in HDP 2.3 the Falcon UI does not have a way to include mirror job parameters. You can do it with the traditional feed definitions.
... View more
10-28-2015
12:15 AM
3 Kudos
First, get rid of the hashtag in your path "#startday," assuming that's not a typo. The folder name examples you're referring to are actually showing sample token replacement patterns. For example, this: <location type="data" path="/user/falcon/retentiondata/startday=${YEAR}-${MONTH}-${DAY}"/> will resolve to something like this: /user/falcon/retentiondata/startday=2015-10-27 for a daily feed that begins on 10/27 and runs. The next day's "instance" (using Falcon terms) would resolve to: /user/falcon/retentiondata/startday=2015-10-28
... View more
10-27-2015
04:58 PM
1 Kudo
Do we have a detailed technical write-up on Falcon mirroring? It uses distcp under the hood, and I can only assume it uses the -update option, but are there any exceptions to how precisely it follows the distcp docs/functionality? I'm mostly concerned with partially-completed jobs that might have tmp files hanging around when the copy kicks off. I have a use case where the user would like to use mirroring to replicate 1..n feeds within a directory instead of setting up fine-grained feed replication, e.g. mirror job 1= - /data/cust/cust1 - /feed-1 - /feed-n mirror job 2= - /data/cust/cust2 - /feed-1 - /feed-n Any info is appreciated.
... View more
Labels:
- Labels:
-
Apache Falcon
10-27-2015
03:01 PM
@Anderw Ahn, @Balu I have an additional question/point to Mayank's question about cluster layout. I understand DR as definitely requiring Oozie to be configured in both locations because distcp will run on the destination cluster, and Hive replication will run on the source cluster. Isn't it also valid that a minimal Falcon install could be achieved by *only* setting up Falcon on the primary/source cluster? In this way, you define 2 clusters (primary, backup) and then simply schedule feeds and processes to run on the appropriate cluster. Falcon can schedule the job to run on Oozie either locally or remote. Please confirm. TL;DR - a single Falcon install can control 2 clusters but requires Oozie installed on both clusters.
... View more
09-24-2015
04:58 PM
3 Kudos
DB – MySQL worked great for an install at a large customer. There is some work to swap out the default after Ambari has already been configured. See the following KB article for more details: Moving Oozie to MySQL with Ambari I haven’t setup HA for Oozie, but I believe @dstreever@hortonworks.com was recently working on this. You’ll need Zookeeper for HA. We had over 1000 various bundles/coordinators/workflows running without any noticeable performance impact using default mem settings.
... View more
- « Previous
-
- 1
- 2
- Next »