Support Questions

Find answers, ask questions, and share your expertise
Announcements
Welcome to the upgraded Community! Read this blog to see What’s New!

Falcon Process running but not processing Data in HDP2.3

avatar

We have setup a falcon process that reads data from a HDFS location and saves the o/p thru pig process into another HDFS location. The Feeds and Processes are running in the cluster but I cannot see any output generated.

My XML for process is as below :

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<process name="demo1Process" xmlns="uri:falcon:process:0.1">

<tags>processName=demo1Process</tags>

<clusters>

<cluster name="Atlas-Demo1">

<validity start="2016-01-28T20:51Z" end="2017-02-02T20:51Z"/>

</cluster>

</clusters>

<parallel>2</parallel>

<order>FIFO</order>

<frequency>minutes(5)</frequency>

<timezone>GMT+05:50</timezone>

<inputs>

<input name="inputfeed" feed="demo1Feed" start="yesterday(0,0)" end="today(-1,0)"/>

</inputs>

<outputs>

<output name="outoutfeed" feed="demo1OutputFeed" instance="yesterday(0,0)"/>

</outputs>

<workflow name="select_airlines_data" version="pig-0.12.0" engine="pig" path="/falcon/demo1/code/demo1.pig"/>

<retry policy="exp-backoff" delay="minutes(3)" attempts="2"/>

<ACL owner="falcon" group="falcon" permission="0755"/>

</process>

My XML for Input Feed is as below

<feed xmlns='uri:falcon:feed:0.1' name='demo1InputFeed' description='demo1 input feed'> <tags>feed_name=demo1InputFeed</tags> <groups>input</groups> <frequency>minutes(1)</frequency> <timezone>GMT+05:50</timezone> <late-arrival cut-off='minutes(3)'/> <clusters> <cluster name='demo1cluster' type='source'> <validity start='2016-01-28T07:49Z' end='2017-02-01T07:49Z'/> <retention limit='days(2)' action='delete'/> <locations> <location type='data'> </location> <location type='stats'> </location> <location type='meta'> </location> </locations> </cluster> </clusters> <locations> <location type='data' path='/falcon/demo1/data/${YEAR}-${MONTH}'> </location> <location type='stats' path='/falcon/demo1/status'> </location> <location type='meta' path='/falcon/demo1/meta'> </location> </locations> <ACL owner='falcon' group='falcon' permission='0755'/> <schema location='none' provider='none'/> <properties> <property name='jobPriority' value='HIGH'> </property> </properties> </feed>

My Input folder is (in HDFS)

/falcon/demo1/data/2016-01

1 ACCEPTED SOLUTION

avatar

Nayan Paul: There are couple of issues in your entity xml's.

1> The granularity of date pattern in the location path should be at least that of a frequency of a feed.

2> yesterday(hours,minutes): As the name suggest EL yesterday picks up feed instances with respect to start of day yesterday. Hours and minutes are added to the 00 hours starting yesterday, Example: yesterday(24,30) will actually correspond to 00:30 am of today, for 2010-01-02T01:30Z this would mean 2010-01-02:00:30 feed.

Input location path in the feed xml is /falcon/demo1/data/${YEAR}-${MONTH} but frequency is in minutes. Also if you want to process data of the month please use lastMonth or currentMonth EL expression.

Please refer EL expression doc for more details. Refer this doc for entity specification details. Thanks!

View solution in original post

21 REPLIES 21

avatar

Perhaps some error in the pig script? You will find the logs in the logs of the oozie launcher action ( the map task log ) or the pig action that gets spawned off. Hue is convenient to get the logs or you can go directly to the resourcemanager UI.

When a process is executed you will have one job that is the launcher it will contain the parameters for the pig script and any error that is returned by the pig command.

You will have a second job that is the actual pig execution.

You should find the problems in one or the other.

If these jobs don't exist you can also go to the oozie ui and see why these actions are not spawned off.

avatar
New Contributor

If you run Pig Scripts manually, outside of Falcon, do you get an error?

avatar

Thanks for the quick reply.

I just tested the pig replacing $input and $output with actual HDFS path and pig job is running fine.

Also my feed has input as path as /falcon/demo1/data/${YEAR}-${MONTH} where as my actual HDFS path is /falcon/demo1/data/2016-01. Can this be a probable mismatch.

avatar

? The path looks good. ${YEAR} is replaced with the current year and so on.. However what do you see when you look into ResourceManager as described above.

avatar
Cloudera Employee

I would cross-check the following:

  • process validity start/end dates
  • input start/end dates
  • feed validity start/end dates
  • input path pattern
  • timezone

If you want data to be picked up for a particular process instance, they feed must be valid (read this as the feed is expected to be populated) during that time, and the data must be in a directory that matches the expected pattern. Look at your Oozie coordinator actions for details on what HDFS paths are being waited for.

avatar

Nayan Paul: There are couple of issues in your entity xml's.

1> The granularity of date pattern in the location path should be at least that of a frequency of a feed.

2> yesterday(hours,minutes): As the name suggest EL yesterday picks up feed instances with respect to start of day yesterday. Hours and minutes are added to the 00 hours starting yesterday, Example: yesterday(24,30) will actually correspond to 00:30 am of today, for 2010-01-02T01:30Z this would mean 2010-01-02:00:30 feed.

Input location path in the feed xml is /falcon/demo1/data/${YEAR}-${MONTH} but frequency is in minutes. Also if you want to process data of the month please use lastMonth or currentMonth EL expression.

Please refer EL expression doc for more details. Refer this doc for entity specification details. Thanks!

avatar

Thanks for the help. I am able to run the falcon process now

avatar
New Contributor

Can you hardcode the path to /falcon/demo1/data/2016-01?

avatar
New Contributor

Can you look for any error codes/messages in the Oozie console (via Ambari) or perhaps provide the full stack trace which usually has an output like: "Causedby: org.apache.falcon.FalconException:” Confirm the scripts and directories (absolute paths) are chmoded to ‘777’ or at least ‘775’.

avatar
@Sowmya Ramesh

Thank you for the response. I am trying to recreate the feed XML with monthly feeds. Please find below my XML

<feed xmlns='uri:falcon:feed:0.1' name='demo1InputFeed' description='demo1 input feed'> <tags>feed_name=demo1InputFeed</tags> <groups>input</groups> <frequency>months(1)</frequency> <timezone>GMT+05:50</timezone> <late-arrival cut-off='days(3)'/> <clusters> <cluster name='demo1cluster' type='source'> <validity start='2016-01-28T06:59Z' end='2017-02-01T06:59Z'/> <retention limit='months(2)' action='delete'/> <locations> <location type='data'> </location> <location type='stats'> </location> <location type='meta'> </location> </locations> </cluster> </clusters> <locations> <location type='data' path='/falcon/demo1/data/${YEAR}-${MONTH}'> </location> <location type='stats' path='/falcon/demo1/status'> </location> <location type='meta' path='/falcon/demo1/meta'> </location> </locations> <ACL owner='falcon' group='falcon' permission='0755'/> <schema location='none' provider='none'/> <properties> <property name='jobPriority' value='HIGH'> </property> </properties> </feed>

However everytime I try to run the same I get below exception

Error: Feed demo1InputFeed's frequency: months(1), path pattern: FileSystemStorage{storageUrl='${nameNode}', locations=[org.apache.falcon.entity.v0.feed.Location@4b8f5469, org.apache.falcon.entity.v0.feed.Location@189f3fa7, org.apache.falcon.entity.v0.feed.Location@dfc9857]} does not match with group: input's frequency: minutes(1), date pattern: [${MONTH}, ${YEAR}] (FalconWebException:83)

Kindly let me know if you can help me in this.

avatar

Thanks for the responses.

I have created a new feed with currentMonth interval for feeds . However when I setup my process with the feeds I get the following error. Any help woule be greatly appreciated

Error: Start instance currentMonth(0,0,0) of feed demo1FeedInputNew is before the start of feed Tue Dec 01 05:19:00 EST 2015 for cluster demo1Cluster-New (FalconWebException:83) 2016-01-29 07:48:56,050 ERROR - [1495236611@qtp-2044215423-59 - 2c35b9f5-049a-4636-bff1-4d7a951c6151:falcon:POST//entities/submit/process] ~ Action failed: Bad Request Error: default/org.apache.falcon.FalconWebException::org.apache.falcon.FalconException: Start instance currentMonth(0,0,0) of feed demo1FeedInputNew is before the start of feed Tue Dec 01 05:19:00 EST 2015 for cluster demo1Cluster-New

avatar
Expert Contributor

@Nayan PaulCan you share the process entity as well the feed entity . From the exception it looks like that in the input element of process entity EL value for "start" attribute has been defined before the actual input feed entity datetime.

avatar
Mentor

@Nayan Paul are you doing this in Sandbox? Can you confirm you're not running into issues described in this thread

avatar

Nayan Paul: Can you provide the input feed and process entities used when the error was thrown? Falcon throws this error if process input feed start is before the input feed validity start.

currentMonth(day,hour,minute): Current month takes the reference to start of the month with respect to process instance start time. One thing to keep in mind is that day is added to the first day of the month. So the value of day is the number of days you want to add to the first day of the month. For example: for instance start time 2010-01-12T01:30Z and El as currentMonth(3,2,40) will correspond to feed created at 2010-01-04T02:40Z and currentMonth(0,0,0) will mean 2010-01-01T00:00Z.

Looks like currentMonth is evaluating date which is before input feed validity start.

avatar

Hi Thanks for the replies. I have I have changed all my feeds to monthly. Please find the feed and process XML.

Now again the problem is back to square 1. I can run the feeds and processes but process is not picking up the data from folder and processing.

InputFeed Data Path = /falcon/demo1/data/${YEAR}-${MONTH}

InputFeed Frequency = 1 month

Process Instance Start = currentMonth(0,0,0) End = currentMonth(31,0,0)

Actual Data path = /falcon/demo1/data/2016-01

pig script=

A = LOAD '$inputgroup' using PigStorage(',') AS (trnid:chararray, custid:chararray,age:int,trndt:chararray,trntm:chararray,mcc:chararray,mcccode:int,amt:chararray); B = FILTER A BY (mcc == 'Airlines'); STORE B INTO '$outputgroup' ;

avatar

@Nayan Paul: Your process xml has the validity as <validity start="2015-12-01T23:33Z" end="2018-01-03T23:33Z"/>

and frequency is every 5 minutes.

Instance Number | Process Instance start Time | Feeds to process[currentMonth(0,0,0)-currentMonth(31,0,0)]


1	           2015-12-01T23:33Z             2015-12-01T00:00Z - 2015-12-31T00:00Z


2                  2015-12-01T28:33Z             2015-12-01T00:00Z - 2015-12-31T00:00Z


3                  2015-12-01T33:33Z             2015-12-01T00:00Z - 2015-12-31T00:00Z


...

As you can see, process instances generated are for 2015-12 and feeds generated in 2016-01 will be processed when the process instance start time is 2016-01-*. Please change the process instance validity start in process xml to "2016-01-01T00:00Z". Thanks!

avatar
@Sowmya RameshHi Thank you for the patience . I am very new to falcon and having issues sorting this out. I have changed my process time as mentioned by you , however I still do not see any output folder generated. Please find mt XML for process as below
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <process name="demo1processNew" xmlns="uri:falcon:process:0.1"> <tags>process_name=demo1processNew</tags> <clusters> <cluster name="demo1Cluster-New"> <validity start="2016-01-01T00:00Z" end="2018-01-04T14:00Z"/> </cluster> </clusters> <parallel>1</parallel> <order>FIFO</order> <frequency>minutes(5)</frequency> <timezone>GMT+05:50</timezone> <inputs> <input name="inputgroup" feed="demo1FeedInputNew" start="currentMonth(0,0,0)" end="currentMonth(31,0,0)"/> </inputs> <outputs> <output name="outputgroup" feed="demo1OutputFeedNew" instance="currentMonth(0,0,0)"/> </outputs> <workflow name="demo1processNew" engine="pig" path="/falcon/demo1/code/demo1.pig"/> <retry policy="exp-backoff" delay="minutes(5)" attempts="1"/> <ACL owner="falcon" group="supoergroup" permission="0755"/> </process>

Kindly advice. I would like to see one flow where my data is picked from /falcon/demo1/data/2016-01 folder and executed.

avatar

@Nayan Paul: If you look at the oozie job launched when the falcon process entity is scheduled do you see any errors? Does running pig script outside Falcon works as intended?

Can you please attach the Falcon, Oozie and MR logs to debug this issue?

avatar
Rising Star

@Nayan Paul : I might have found the problem. Here is what you have for the process.

<validity start="2015-12-01T23:33Z" end="2018-01-03T23:33Z"/> </cluster>

...

<parallel>1</parallel> <order>FIFO</order> <frequency>minutes(5)</frequency>

The input feed you have is 2016-01, almost a month after the validity start 2015-12. You have one process instance running every 5 minutes, so that is approximately 72x31 instances that should be processed before getting to 2016-01. So your process instances in 2015-12 are waiting for input feed 2015-12.... and will wait for almost a month since you said FIFO order should be used. Create input feed 2015-12 and you will see data being processed end to end. Let me know if this solves the issue.

avatar

@Balu: I already replied with same analysis. I asked him to change the process start time to 2016-01 instead https://community.hortonworks.com/answers/12696/view.html

Labels