Created 01-28-2016 05:26 PM
We have setup a falcon process that reads data from a HDFS location and saves the o/p thru pig process into another HDFS location. The Feeds and Processes are running in the cluster but I cannot see any output generated.
My XML for process is as below :
<?xml version="1.0"
encoding="UTF-8" standalone="yes"?> <process name="demo1Process" xmlns="uri:falcon:process:0.1"> <tags>processName=demo1Process</tags> <clusters> <cluster name="Atlas-Demo1"> <validity start="2016-01-28T20:51Z" end="2017-02-02T20:51Z"/> </cluster> </clusters> <parallel>2</parallel> <order>FIFO</order> <frequency>minutes(5)</frequency> <timezone>GMT+05:50</timezone> <inputs> <input name="inputfeed" feed="demo1Feed" start="yesterday(0,0)" end="today(-1,0)"/> </inputs> <outputs> <output name="outoutfeed" feed="demo1OutputFeed" instance="yesterday(0,0)"/> </outputs> <workflow name="select_airlines_data" version="pig-0.12.0" engine="pig" path="/falcon/demo1/code/demo1.pig"/> <retry policy="exp-backoff" delay="minutes(3)" attempts="2"/> <ACL owner="falcon" group="falcon" permission="0755"/> </process> |
My XML for Input Feed is as below
<feed xmlns='uri:falcon:feed:0.1' name='demo1InputFeed' description='demo1 input feed'> <tags>feed_name=demo1InputFeed</tags> <groups>input</groups> <frequency>minutes(1)</frequency> <timezone>GMT+05:50</timezone> <late-arrival cut-off='minutes(3)'/> <clusters> <cluster name='demo1cluster' type='source'> <validity start='2016-01-28T07:49Z' end='2017-02-01T07:49Z'/> <retention limit='days(2)' action='delete'/> <locations> <location type='data'> </location> <location type='stats'> </location> <location type='meta'> </location> </locations> </cluster> </clusters> <locations> <location type='data' path='/falcon/demo1/data/${YEAR}-${MONTH}'> </location> <location type='stats' path='/falcon/demo1/status'> </location> <location type='meta' path='/falcon/demo1/meta'> </location> </locations> <ACL owner='falcon' group='falcon' permission='0755'/> <schema location='none' provider='none'/> <properties> <property name='jobPriority' value='HIGH'> </property> </properties> </feed> |
My Input folder is (in HDFS)
/falcon/demo1/data/2016-01
Created 01-28-2016 07:02 PM
Nayan Paul: There are couple of issues in your entity xml's.
1> The granularity of date pattern in the location path should be at least that of a frequency of a feed.
2> yesterday(hours,minutes): As the name suggest EL yesterday picks up feed instances with respect to start of day yesterday. Hours and minutes are added to the 00 hours starting yesterday, Example: yesterday(24,30) will actually correspond to 00:30 am of today, for 2010-01-02T01:30Z this would mean 2010-01-02:00:30 feed.
Input location path in the feed xml is /falcon/demo1/data/${YEAR}-${MONTH} but frequency is in minutes. Also if you want to process data of the month please use lastMonth or currentMonth EL expression.
Please refer EL expression doc for more details. Refer this doc for entity specification details. Thanks!
Created 01-28-2016 05:42 PM
Perhaps some error in the pig script? You will find the logs in the logs of the oozie launcher action ( the map task log ) or the pig action that gets spawned off. Hue is convenient to get the logs or you can go directly to the resourcemanager UI.
When a process is executed you will have one job that is the launcher it will contain the parameters for the pig script and any error that is returned by the pig command.
You will have a second job that is the actual pig execution.
You should find the problems in one or the other.
If these jobs don't exist you can also go to the oozie ui and see why these actions are not spawned off.
Created 01-28-2016 06:06 PM
If you run Pig Scripts manually, outside of Falcon, do you get an error?
Created 01-28-2016 06:40 PM
Thanks for the quick reply.
I just tested the pig replacing $input and $output with actual HDFS path and pig job is running fine.
Also my feed has input as path as /falcon/demo1/data/${YEAR}-${MONTH} where as my actual HDFS path is /falcon/demo1/data/2016-01. Can this be a probable mismatch.
Created 01-28-2016 08:33 PM
? The path looks good. ${YEAR} is replaced with the current year and so on.. However what do you see when you look into ResourceManager as described above.
Created 01-28-2016 06:47 PM
I would cross-check the following:
If you want data to be picked up for a particular process instance, they feed must be valid (read this as the feed is expected to be populated) during that time, and the data must be in a directory that matches the expected pattern. Look at your Oozie coordinator actions for details on what HDFS paths are being waited for.
Created 01-28-2016 07:02 PM
Nayan Paul: There are couple of issues in your entity xml's.
1> The granularity of date pattern in the location path should be at least that of a frequency of a feed.
2> yesterday(hours,minutes): As the name suggest EL yesterday picks up feed instances with respect to start of day yesterday. Hours and minutes are added to the 00 hours starting yesterday, Example: yesterday(24,30) will actually correspond to 00:30 am of today, for 2010-01-02T01:30Z this would mean 2010-01-02:00:30 feed.
Input location path in the feed xml is /falcon/demo1/data/${YEAR}-${MONTH} but frequency is in minutes. Also if you want to process data of the month please use lastMonth or currentMonth EL expression.
Please refer EL expression doc for more details. Refer this doc for entity specification details. Thanks!
Created 02-01-2016 01:25 PM
Thanks for the help. I am able to run the falcon process now
Created 01-28-2016 07:10 PM
Can you hardcode the path to /falcon/demo1/data/2016-01?
Created 01-28-2016 07:49 PM
Can you look for any error codes/messages in the Oozie console (via Ambari) or perhaps provide the full stack trace which usually has an output like: "Causedby: org.apache.falcon.FalconException:” Confirm the scripts and directories (absolute paths) are chmoded to ‘777’ or at least ‘775’.