About mmiklavcic

mmiklavcic · ‎12-09-2015

There are a couple considerations that need to be taken into account when using NN HA with Falcon and Oozie. In all cases, you need to use the Namenode service ID when referring to the Namenode in the cluster xml. This value can be found in hdfs-site.xml in the property dfs.ha.namenodes.[nameservice ID]. For multi-cluster installs, you need to setup all cluster Namenode HA nameservice ID details in all clusters. For example, if you have two clusters, hdfs-site.xml for both cluster one and cluster two will have 2 nameservice IDs. Likewise, for three clusters, all three clusters would have three nameservice IDs. A two-cluster implementation would look similar to the following: <property> <name>dfs.ha.namenodes.hacluster1</name> <value>c1nn1,c1nn2</value> </property> <property> <name>dfs.ha.namenodes.hacluster2</name> <value>c2nn1,c2nn2</value> </property> Now, when you setup Falcon, provide both cluster definitions on both clusters.

mmiklavcic · ‎11-26-2015

For those upvoting this answer, this is the correct answer for increasing mem for mapper Yarn containers, but will not work in cases where Hive is optimizing by creating a local task. What happens is that it generates a hash table of values for the map-side join first on a local node, then uploads this to HDFS for distribution to all mappers that need the fast lookup table. It's the local task that is the problem here, and the only way to fix this is to bail on the map-side join optimization, or change your HADOOP_HEAPSIZE on a global level through Ambari. Not elegant, but it is a workaround.

mmiklavcic · ‎11-26-2015

@Guilherme Braccialli, that doesn't increase memory allocation for the local task. It's a percentage threshold before the job is automatically killed. It's already at 90% by default, so at this point the only option is to increase the local mem allocation. I tested the "HADOOP_HEAPSIZE" option from Ambari, and it works, but it's global.

mmiklavcic · ‎11-25-2015

Doesn't seem to work. Did the following: $ export HADOOP_OPTS="-Xmx1024m" $ hive -f test.hql > results.txt ... Starting to launch local task to process map join;maximum memory = 511180800 = 0.5111808GB ...

mmiklavcic · ‎11-24-2015

Is there a way in HDP >= v2.2.4 to increase the local task memory? I'm aware of disabling/limiting map-only join sizes, but we want to increase, not limit it. Depending on the environment, the memory allocation will shift, but it appears to be entirely to Yarn and Hive's discretion. "Starting to launch local task to process map join;maximum memory = 255328256 => ~ 0.25 GB" I've looked at/tried: hive.mapred.local.mem hive.mapjoin.localtask.max.memory.usage - this is simply a percentage of the local heap. I want to increase, not limit the mem. mapreduce.map.memory.mb - only effective for non-local tasks I found documentation suggesting 'export HADOOP_HEAPSIZE="2048"' to change from the default, but this applied to the nodemanager. Any way to configure this on a per-job basis? EDIT To avoid duplication, the info I'm referencing comes from here: https://support.pivotal.io/hc/en-us/articles/207750748-Unable-to-increase-hive-child-process-max-heap-when-attempting-hash-join Sounds like a per-job solution is not currently available with this bug.

mmiklavcic · ‎10-28-2015

Ok, so it's a 1-to-1 mapping of the DistCP functionality that we currently choose to expose (I added the features for maxMaps and mapBandwidth 🙂 ). Incidentally, in HDP 2.3 the Falcon UI does not have a way to include mirror job parameters. You can do it with the traditional feed definitions.

mmiklavcic · ‎10-28-2015

First, get rid of the hashtag in your path "#startday," assuming that's not a typo. The folder name examples you're referring to are actually showing sample token replacement patterns. For example, this: <location type="data" path="/user/falcon/retentiondata/startday=${YEAR}-${MONTH}-${DAY}"/> will resolve to something like this: /user/falcon/retentiondata/startday=2015-10-27 for a daily feed that begins on 10/27 and runs. The next day's "instance" (using Falcon terms) would resolve to: /user/falcon/retentiondata/startday=2015-10-28

mmiklavcic · ‎10-27-2015

Do we have a detailed technical write-up on Falcon mirroring? It uses distcp under the hood, and I can only assume it uses the -update option, but are there any exceptions to how precisely it follows the distcp docs/functionality? I'm mostly concerned with partially-completed jobs that might have tmp files hanging around when the copy kicks off. I have a use case where the user would like to use mirroring to replicate 1..n feeds within a directory instead of setting up fine-grained feed replication, e.g. mirror job 1= - /data/cust/cust1 - /feed-1 - /feed-n mirror job 2= - /data/cust/cust2 - /feed-1 - /feed-n Any info is appreciated.

mmiklavcic · ‎10-27-2015

@Anderw Ahn, @Balu I have an additional question/point to Mayank's question about cluster layout. I understand DR as definitely requiring Oozie to be configured in both locations because distcp will run on the destination cluster, and Hive replication will run on the source cluster. Isn't it also valid that a minimal Falcon install could be achieved by *only* setting up Falcon on the primary/source cluster? In this way, you define 2 clusters (primary, backup) and then simply schedule feeds and processes to run on the appropriate cluster. Falcon can schedule the job to run on Oozie either locally or remote. Please confirm. TL;DR - a single Falcon install can control 2 clusters but requires Oozie installed on both clusters.

mmiklavcic · ‎09-24-2015

DB – MySQL worked great for an install at a large customer. There is some work to swap out the default after Ambari has already been configured. See the following KB article for more details: Moving Oozie to MySQL with Ambari I haven’t setup HA for Oozie, but I believe @dstreever@hortonworks.com was recently working on this. You’ll need Zookeeper for HA. We had over 1000 various bundles/coordinators/workflows running without any noticeable performance impact using default mem settings.

Online	Offline
Last Visited	‎06-09-2023 01:54 PM

Member Since	‎09-24-2015 04:55 PM
Last Visited	‎06-09-2023 01:54 PM
Posts	22
Kudos received	31

Cloudera Community

Re: Metron Statistics Documentation Error

Re: how to use the metron ui to see the pcap data?

Re: Work putHDFS on HDFS in HA?

Re: Hive increase map join local task memory

Re: How do you specify a highly-available HDFS nam...

Re: How do you specify a highly-available HDFS nam...

Re: Hive increase map join local task memory

Re: Hive increase map join local task memory

Re: Hive increase map join local task memory

Hive increase map join local task memory

Re: Falcon mirroring assumptions and guarantees

Re: How to set falcon retention policy for unconve...

Falcon mirroring assumptions and guarantees

Re: How do we plan Falcon deployments for replicat...

Re: What are Oozie Production Recommendations?