About MilesYao

MilesYao · ‎05-16-2016

Running CDH 5.6. Thought setting up email alerting would be a straightforward process, but was stymied by the CMS config at "Alerts: Mail Server Username" and "Alerts: Main Server Password". If Alert Publisher is using standard SMTP, they shouldn't be required, no? sendmail is running on every node. Manually running mailx at command line works. Why can't CMS send mail as the process owner? Hortonworks/Ambari does not have this issue. Thanks, Miles

MilesYao · ‎05-13-2016

Hi Jim! 🙂 Our project is still around and getting bigger. We are using both Cloudera and Hortonworks and building more dataflows. With increased complexity, we are finding that Ambari more and more inadequate compared to Cloudera's full-featured commercial counterpart, Cloudera Manager. For Flume, there are only six metrics, four basic config attributes, and one big textbox for pasting in the config file. I have to hand-edit flume-env.sh to change the agent heap allocation. (With apology to our hosts) While Hortonworks offers a goodie bag of latest Apache applications, the primitive state of the management console is a deal-breaker. If Ambari cannot be improved soon, I strongly recommend you consider Cloudera (we are using the free version).

MilesYao · ‎04-07-2016

Related topic: Jobs fail in Yarn with out of Java heap memory error ... where your colleague bcwalrus said, "That [yarn.nodemanager.vmem-check-enabled] shouldn't matter though. You said that the job died due to OOME. It didn't die because it got killed by NM." Is it what happened here, too? And what's the reason to set mapreduce.*.java.opts.max.heap in addition to mapreduce.*.memory.mb? Wouldn't it just introduce more potential conflict w/o much benefit? Thanks, Miles

MilesYao · ‎04-07-2016

Thanks for the clarification. As others pointed out, CM doesn't list yarn.nodemanager.vmem-check-enabled as a configurable parameter, but seems to default it to false (I can see it in my Oozie action job metadata). But then, this means the error "Diagnostics report from attempt_1459358870111_0185_m_000054_3: Container [pid=18971,containerID=container_1459358870111_0185_01_000210] is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 2.7 GB of 2.1 GB virtual memory used. Killing container." wasn't triggered by virtual memory overage, but actually by physical memory? Which parameter should I tune? As experiment, I am setting mapreduce.map.memory.mb = 3000 manually in the failing Hive2 action. It runs slowly but seems to work. Job counters show max physical usage per task at ~2.4GB, committed heap at 2.3GB, and virtual memory at 4.3GB. Reducer consumption all trail mapper by varying amounts. Do you have a better suggestion? Thanks again, Miles

MilesYao · ‎04-06-2016

First, the attribute name looks like a typo - you guys mean to say yarn.nodemanager.vmem-check-enabled , no? Second, your recommend contracdicts the specific advice given in your own 2014 engineering blog, Apache Hadoop YARN: Avoiding 6 Time-Consuming "Gotchas". If that is no longer valid, please mark the article accordingly. Thanks, Miles

MilesYao · ‎03-01-2016

Same issue with Hive action. It is frustrating and inefficient that such a common use case cannot be accommodated.

MilesYao · ‎02-11-2016

Exporting the variables in flume-env.sh seems to make them visible, and solves my immediate problem. Also, passing the variables in JAVA (JAVA_OPTS="... -Dvar=x") doesn't seem to make a difference in this case (but required if you want to use them in log4j.properties): export ROLE=`<some script or command>` export RACK=`<some script or command>` ... JAVA_OPTS="-Xms60m -Xmx360m -Dhostname=${HOSTNAME} " My larger questions still stand, though, and I'd welcome any comments.

MilesYao · ‎02-11-2016

Hi - I've checked the doc, read the O'Reilly book, Googled, and searched this forum, but did not find much useful for what looks like should be a common Flume use case: I want to ingest log files of some distributed application that runs on multiple hosts. They behave like typical Unix or web server logs - in fixed directories and roll infrequently. I cannot modify the application nor the log files themselves - the ingestion has to be totally non-invasive. So far so good: 1. Current Flume documentation recommends Spooling Dir Source over Exec Source for tailing logs, yet does not explain how to do that in a streaming fashion without modifying source file. Spooling Dir Source requires that the source file be completed for update and closed - it's batch- rather than stream-oriented. So we can't use it for typical actively-updated log files. 2. Now, using Exec Source should work, except that these log files don't have constant file names - e.g. app-<role>-<rack>.log. The log directory is NFS-mounted and can be shared by multiple hosts, so it can contain: app-role1-rack1.log app-role2-rack1.log app-role1-rack2.log app-role2-rack2.log .... Hadoop itself has such examples: $ ls -l /var/log/hbase/ total 40836 -rw-r--r-- 1 hbase hbase 139032 Nov 20 17:40 hbase-cmf-hbase-HBASERESTSERVER-hou76072.log.out -rw-r--r-- 1 hbase hbase 27859661 Feb 11 15:20 hbase-cmf-hbase-REGIONSERVER-hou76072.log.out I would like to configure these values in flume-env.sh and pass it into the agent config file, or have the command itself call some script to derive dynamically. Exec Source has a shell option that seems to support this: The ‘shell’ config is used to invoke the ‘command’ through a command shell (such as Bash or Powershell). The ‘command’ is passed as an argument to ‘shell’ for execution. This allows the ‘command’ to use features from the shell such as wildcards, back ticks, pipes, loops, conditionals etc. In the absence of the ‘shell’ config, the ‘command’ will be invoked directly. Common values for ‘shell’ : ‘/bin/sh -c’, ‘/bin/ksh -c’, ‘cmd /c’, ‘powershell -Command’, etc. a1.sources.tailsource-1.type = exec a1.sources.tailsource-1.shell = /bin/bash -c a1.sources.tailsource-1.command = for i in /path/*.txt; do cat $i; done However, I could not get it to work: (flume-env.sh) ROLE=`<some script or command>` RACK=`<some script or command>` ... JAVA_OPTS="-Xms60m -Xmx360m -Drole=${ROLE} -Drack=${RACK} -Dhostname=${HOSTNAME} " (flume_spool_avro_agent.conf) .... spool_avro_agent.sources.s2.channels = c1 spool_avro_agent.sources.s2.type = exec spool_avro_agent.sources.s2.shell = /bin/bash -c spool_avro_agent.sources.s2.command = tail -F /var/log/app-${role}-${rack}.log .... I verified that $JAVA_OPTS is correct, but the values don't seem to be passed to the command line: 2016-02-11 15:46:14,175 |INFO |org.apache.flume.source.ExecSource$StderrReader | |- StderrLogger[1] = '+ tail -F /var/log/app--.log' 2016-02-11 15:46:14,176 |INFO |org.apache.flume.source.ExecSource$StderrReader | |- StderrLogger[2] = 'tail: cannot open `/var/log/app--.log' for reading: No such file or directory' So, what exactly is the runtime environment for this Exec Source shell? What kind of constraints does it have (compared to, say, ssh)? Any insights from the trenches would be appreciated. Thanks!

MilesYao · ‎12-18-2015

I have the same concern. Cloudera is being used to build enterprise applications in large organizations. Typically these customers have well-established software development processes and policies. One of them is source control and build/release versioning. Oozie workflows are major software artifacts and as such, need to be tracked in standard source control systems, and be managed by automated build/deployment tools. Instead of reinventing the wheel, it would be more useful for Hue to provide integration with industry-standard tools like SVN, Git, Ant and Maven. In a way becoming the Eclipse of Hadoop (ok, perhaps not the best analogy). Another related point is workflow ownership and access. When a workflow is developed, it is owned by the individual developer. When we build complex workflow or bundles, we may need to incorporate workflows from multiple developers - suggesting a group-level access. When development completes, the workflows are checked into the repository. Then when deployment tool checks out the latest build and installs the workflows onto UAT or production environment, the ownership would be the production application account. And in case of problems, we need a way to roll back. Can Hue support this process?

MilesYao · ‎08-27-2015

I retried the LCE configs after upgrading to CDH 5.4.4 - still doesn't work. Both Hive-on-Hue and Oozie failed in MR. One of the nodes have cache directory reconfigured: drwxr-s--- 4 myao yarn 37 Mar 20 11:30 /s0/yarn/nm/usercache/myao drwxr-s--- 4 myao yarn 37 Mar 20 11:30 /s1/yarn/nm/usercache/myao drwxr-s--- 4 myao yarn 37 Mar 20 11:30 /s2/yarn/nm/usercache/myao So, I will stick with the $HADOOP_USER_NAME workaround for now. Cheers, Miles

Online	Offline
Last Visited	‎03-25-2021 12:17 PM

Member Since	‎03-04-2015 03:05 PM
Last Visited	‎03-25-2021 12:17 PM
Posts	96
Kudos received	10

Cloudera Community

Re: Spark 2

Re: Cannot pass value from Hive query output direc...

Why does Alert Publisher require mail server login...

Re: Flume in Production - To Ambari or not to Amb...

Re: How to set yarn.nodemanager.pmem-check-enabled...

Re: How to set yarn.nodemanager.pmem-check-enabled...

Re: How to set yarn.nodemanager.pmem-check-enabled...

Re: Oozie - Is there a way to capture output of Sq...

Re: How to pass parameters into Flume Exec Source ...

How to pass parameters into Flume Exec Source comm...

Re: Where does Hue Oozie editor store the workflow...

Re: Cannot pass value from Hive query output direc...