Support Questions

Find answers, ask questions, and share your expertise

Falcon Process running but not processing Data in HDP2.3

avatar

We have setup a falcon process that reads data from a HDFS location and saves the o/p thru pig process into another HDFS location. The Feeds and Processes are running in the cluster but I cannot see any output generated.

My XML for process is as below :

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<process name="demo1Process" xmlns="uri:falcon:process:0.1">

<tags>processName=demo1Process</tags>

<clusters>

<cluster name="Atlas-Demo1">

<validity start="2016-01-28T20:51Z" end="2017-02-02T20:51Z"/>

</cluster>

</clusters>

<parallel>2</parallel>

<order>FIFO</order>

<frequency>minutes(5)</frequency>

<timezone>GMT+05:50</timezone>

<inputs>

<input name="inputfeed" feed="demo1Feed" start="yesterday(0,0)" end="today(-1,0)"/>

</inputs>

<outputs>

<output name="outoutfeed" feed="demo1OutputFeed" instance="yesterday(0,0)"/>

</outputs>

<workflow name="select_airlines_data" version="pig-0.12.0" engine="pig" path="/falcon/demo1/code/demo1.pig"/>

<retry policy="exp-backoff" delay="minutes(3)" attempts="2"/>

<ACL owner="falcon" group="falcon" permission="0755"/>

</process>

My XML for Input Feed is as below

<feed xmlns='uri:falcon:feed:0.1' name='demo1InputFeed' description='demo1 input feed'> <tags>feed_name=demo1InputFeed</tags> <groups>input</groups> <frequency>minutes(1)</frequency> <timezone>GMT+05:50</timezone> <late-arrival cut-off='minutes(3)'/> <clusters> <cluster name='demo1cluster' type='source'> <validity start='2016-01-28T07:49Z' end='2017-02-01T07:49Z'/> <retention limit='days(2)' action='delete'/> <locations> <location type='data'> </location> <location type='stats'> </location> <location type='meta'> </location> </locations> </cluster> </clusters> <locations> <location type='data' path='/falcon/demo1/data/${YEAR}-${MONTH}'> </location> <location type='stats' path='/falcon/demo1/status'> </location> <location type='meta' path='/falcon/demo1/meta'> </location> </locations> <ACL owner='falcon' group='falcon' permission='0755'/> <schema location='none' provider='none'/> <properties> <property name='jobPriority' value='HIGH'> </property> </properties> </feed>

My Input folder is (in HDFS)

/falcon/demo1/data/2016-01

1 ACCEPTED SOLUTION

avatar

Nayan Paul: There are couple of issues in your entity xml's.

1> The granularity of date pattern in the location path should be at least that of a frequency of a feed.

2> yesterday(hours,minutes): As the name suggest EL yesterday picks up feed instances with respect to start of day yesterday. Hours and minutes are added to the 00 hours starting yesterday, Example: yesterday(24,30) will actually correspond to 00:30 am of today, for 2010-01-02T01:30Z this would mean 2010-01-02:00:30 feed.

Input location path in the feed xml is /falcon/demo1/data/${YEAR}-${MONTH} but frequency is in minutes. Also if you want to process data of the month please use lastMonth or currentMonth EL expression.

Please refer EL expression doc for more details. Refer this doc for entity specification details. Thanks!

View solution in original post

21 REPLIES 21

avatar
@Sowmya Ramesh

Thank you for the response. I am trying to recreate the feed XML with monthly feeds. Please find below my XML

<feed xmlns='uri:falcon:feed:0.1' name='demo1InputFeed' description='demo1 input feed'> <tags>feed_name=demo1InputFeed</tags> <groups>input</groups> <frequency>months(1)</frequency> <timezone>GMT+05:50</timezone> <late-arrival cut-off='days(3)'/> <clusters> <cluster name='demo1cluster' type='source'> <validity start='2016-01-28T06:59Z' end='2017-02-01T06:59Z'/> <retention limit='months(2)' action='delete'/> <locations> <location type='data'> </location> <location type='stats'> </location> <location type='meta'> </location> </locations> </cluster> </clusters> <locations> <location type='data' path='/falcon/demo1/data/${YEAR}-${MONTH}'> </location> <location type='stats' path='/falcon/demo1/status'> </location> <location type='meta' path='/falcon/demo1/meta'> </location> </locations> <ACL owner='falcon' group='falcon' permission='0755'/> <schema location='none' provider='none'/> <properties> <property name='jobPriority' value='HIGH'> </property> </properties> </feed>

However everytime I try to run the same I get below exception

Error: Feed demo1InputFeed's frequency: months(1), path pattern: FileSystemStorage{storageUrl='${nameNode}', locations=[org.apache.falcon.entity.v0.feed.Location@4b8f5469, org.apache.falcon.entity.v0.feed.Location@189f3fa7, org.apache.falcon.entity.v0.feed.Location@dfc9857]} does not match with group: input's frequency: minutes(1), date pattern: [${MONTH}, ${YEAR}] (FalconWebException:83)

Kindly let me know if you can help me in this.

avatar

Thanks for the responses.

I have created a new feed with currentMonth interval for feeds . However when I setup my process with the feeds I get the following error. Any help woule be greatly appreciated

Error: Start instance currentMonth(0,0,0) of feed demo1FeedInputNew is before the start of feed Tue Dec 01 05:19:00 EST 2015 for cluster demo1Cluster-New (FalconWebException:83) 2016-01-29 07:48:56,050 ERROR - [1495236611@qtp-2044215423-59 - 2c35b9f5-049a-4636-bff1-4d7a951c6151:falcon:POST//entities/submit/process] ~ Action failed: Bad Request Error: default/org.apache.falcon.FalconWebException::org.apache.falcon.FalconException: Start instance currentMonth(0,0,0) of feed demo1FeedInputNew is before the start of feed Tue Dec 01 05:19:00 EST 2015 for cluster demo1Cluster-New

avatar
Super Collaborator

@Nayan PaulCan you share the process entity as well the feed entity . From the exception it looks like that in the input element of process entity EL value for "start" attribute has been defined before the actual input feed entity datetime.

avatar
Master Mentor

@Nayan Paul are you doing this in Sandbox? Can you confirm you're not running into issues described in this thread

avatar

Nayan Paul: Can you provide the input feed and process entities used when the error was thrown? Falcon throws this error if process input feed start is before the input feed validity start.

currentMonth(day,hour,minute): Current month takes the reference to start of the month with respect to process instance start time. One thing to keep in mind is that day is added to the first day of the month. So the value of day is the number of days you want to add to the first day of the month. For example: for instance start time 2010-01-12T01:30Z and El as currentMonth(3,2,40) will correspond to feed created at 2010-01-04T02:40Z and currentMonth(0,0,0) will mean 2010-01-01T00:00Z.

Looks like currentMonth is evaluating date which is before input feed validity start.

avatar

Hi Thanks for the replies. I have I have changed all my feeds to monthly. Please find the feed and process XML.

Now again the problem is back to square 1. I can run the feeds and processes but process is not picking up the data from folder and processing.

InputFeed Data Path = /falcon/demo1/data/${YEAR}-${MONTH}

InputFeed Frequency = 1 month

Process Instance Start = currentMonth(0,0,0) End = currentMonth(31,0,0)

Actual Data path = /falcon/demo1/data/2016-01

pig script=

A = LOAD '$inputgroup' using PigStorage(',') AS (trnid:chararray, custid:chararray,age:int,trndt:chararray,trntm:chararray,mcc:chararray,mcccode:int,amt:chararray); B = FILTER A BY (mcc == 'Airlines'); STORE B INTO '$outputgroup' ;

avatar

@Nayan Paul: Your process xml has the validity as <validity start="2015-12-01T23:33Z" end="2018-01-03T23:33Z"/>

and frequency is every 5 minutes.

Instance Number | Process Instance start Time | Feeds to process[currentMonth(0,0,0)-currentMonth(31,0,0)]


1	           2015-12-01T23:33Z             2015-12-01T00:00Z - 2015-12-31T00:00Z


2                  2015-12-01T28:33Z             2015-12-01T00:00Z - 2015-12-31T00:00Z


3                  2015-12-01T33:33Z             2015-12-01T00:00Z - 2015-12-31T00:00Z


...

As you can see, process instances generated are for 2015-12 and feeds generated in 2016-01 will be processed when the process instance start time is 2016-01-*. Please change the process instance validity start in process xml to "2016-01-01T00:00Z". Thanks!

avatar
@Sowmya RameshHi Thank you for the patience . I am very new to falcon and having issues sorting this out. I have changed my process time as mentioned by you , however I still do not see any output folder generated. Please find mt XML for process as below
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <process name="demo1processNew" xmlns="uri:falcon:process:0.1"> <tags>process_name=demo1processNew</tags> <clusters> <cluster name="demo1Cluster-New"> <validity start="2016-01-01T00:00Z" end="2018-01-04T14:00Z"/> </cluster> </clusters> <parallel>1</parallel> <order>FIFO</order> <frequency>minutes(5)</frequency> <timezone>GMT+05:50</timezone> <inputs> <input name="inputgroup" feed="demo1FeedInputNew" start="currentMonth(0,0,0)" end="currentMonth(31,0,0)"/> </inputs> <outputs> <output name="outputgroup" feed="demo1OutputFeedNew" instance="currentMonth(0,0,0)"/> </outputs> <workflow name="demo1processNew" engine="pig" path="/falcon/demo1/code/demo1.pig"/> <retry policy="exp-backoff" delay="minutes(5)" attempts="1"/> <ACL owner="falcon" group="supoergroup" permission="0755"/> </process>

Kindly advice. I would like to see one flow where my data is picked from /falcon/demo1/data/2016-01 folder and executed.

avatar

@Nayan Paul: If you look at the oozie job launched when the falcon process entity is scheduled do you see any errors? Does running pig script outside Falcon works as intended?

Can you please attach the Falcon, Oozie and MR logs to debug this issue?

avatar
Expert Contributor

@Nayan Paul : I might have found the problem. Here is what you have for the process.

<validity start="2015-12-01T23:33Z" end="2018-01-03T23:33Z"/> </cluster>

...

<parallel>1</parallel> <order>FIFO</order> <frequency>minutes(5)</frequency>

The input feed you have is 2016-01, almost a month after the validity start 2015-12. You have one process instance running every 5 minutes, so that is approximately 72x31 instances that should be processed before getting to 2016-01. So your process instances in 2015-12 are waiting for input feed 2015-12.... and will wait for almost a month since you said FIFO order should be used. Create input feed 2015-12 and you will see data being processed end to end. Let me know if this solves the issue.