Member since
04-29-2016
192
Posts
20
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1636 | 07-14-2017 05:01 PM | |
2775 | 06-28-2017 05:20 PM |
08-05-2017
01:37 PM
@rich @William Gonzalez any updates on the Certified Professional Data Engineer (HCPDE) ?
... View more
07-14-2017
05:01 PM
This is a known issue with GetHDFS - https://issues.apache.org/jira/browse/NIFI-2956, which is resolved in NiFi 1.1.0
... View more
07-14-2017
01:36 PM
Once I stop and start the GetHDFS processor, it appears the expression for 'Directory' is getting re-evaluated and it is then correctly pointing to previous day's directory and processes the files from that directory. This behavior further confirms that the expression is getting evaluated only for the first scheduled run and not for all subsequent runs; so, is there a work around to force the expression to evaluate for each run ?
... View more
07-12-2017
05:57 PM
Hello, When a NiFi processor property includes expression language and the processor is scheduled to run at certain intervals, does the expression in the property get evaluated for each scheduled run or only once for the first run ? The reason I'm asking is, I've a GetHDFS processor that's scheduled to run once daily; the 'Directory' property of the processor includes expression language; since I want the processor to point to previous day's directory, I have set the directory property as follows: /user/nifitest/${now():toNumber():minus(86400000):format('yyyy')}/${now():toNumber():minus(86400000):format('MM')}/${now():toNumber():minus(86400000):format('yyyy_MM_dd')} The above expression evaluates correctly to a directory that points to one that was created the previous day; for example, today's run (7-12-2017) would point to this directory - /user/nifitest/2017/07/2017_07_11; After it is scheduled, for the first run, the GetHDFS processor starts at the scheduled time and works perfectly, it processes all the files in the directory from the previous day, but it is not finding any files on subsequent scheduled runs; in the nifi log, I was not able to find the exact directory path to which the processor points to, but below is what it shows in the log; 2017-06-30 08:18:00,000 ERROR [NiFi logging handler] org.apache.nifi.StdErr [Timer-Driven Process Thread-10] INFO org.apache.nifi.processors.hadoop.GetHDFS - GetHDFS[id=b0d21ab8-1001-1159-15dd-4d380d420cab] Kerber
os ticket age exceeds threshold [14400 seconds] attempting to renew ticket for user nifitest/dcdrlhadoop1a.mdanderson.edu@MDANDERSON.EDU
2017-06-30 08:18:00,057 ERROR [NiFi logging handler] org.apache.nifi.StdErr [Timer-Driven Process Thread-10] INFO org.apache.nifi.processors.hadoop.GetHDFS - GetHDFS[id=b0d21ab8-1001-1159-15dd-4d380d420cab] Kerber
os relogin successful or ticket still valid
2017-06-30 08:18:00,154 ERROR [NiFi logging handler] org.apache.nifi.StdErr [Timer-Driven Process Thread-6] INFO org.apache.nifi.processors.standard.GetHTTP - GetHTTP[id=19a2140b-1178-102e-de2f-9e978bc6b90a] conte
nt not retrieved because server returned HTTP Status Code 304: Not Modified
2017-06-30 08:18:00,182 ERROR [NiFi logging handler] org.apache.nifi.StdErr [Timer-Driven Process Thread-10] INFO org.apache.nifi.processors.hadoop.GetHDFS - GetHDFS[id=b0d21ab8-1001-1159-15dd-4d380d420cab] Obtain
ed file listing in 181 milliseconds; listing had 0 items, 0 of which were new
the fact that the first run of the processor (after it was scheduled to run) works perfectly (it processes all the files in the directory from the previous day), but not the subsequent runs, makes me suspicious that the 'Directory' property is evaluated once and that the same value is used for each subsequent scheduled run, essentially pointing to the same directory during each run; the log says - "Obtained file listing in 181 milliseconds; listing had 0 items, 0 of which were new", that's what makes me think it's pointing to the same directory as the first run's. I was expecting the processor to evaluate the 'Directory' property for each scheduled run; does it do that ? if not, how do I make this work? Since Get* processors do not accept any inbound connections, I'm not able to calculate/evaluate the 'Directory' property first in a UpdateAttribute property and pass the correct value to GetHDFS. Thanks in advance.
... View more
Labels:
- Labels:
-
Apache NiFi
06-28-2017
05:20 PM
Thanks to @Bryan Bende, I needed to change the batch size property in GetHDFS, to read all files in the directory. https://community.hortonworks.com/questions/108547/need-clarification-on-how-nifi-processors-run-with.html#answer-109798
... View more
06-28-2017
05:13 PM
Thank you, it's the batch schedule that needed to be changed in my case.
... View more
06-28-2017
04:50 PM
@Bryan Bende thank you. this is my use case: GetHDFS is on CRON schedule to run daily at 12:30 am, to process files that were inserted in a HDFS directory; these files would be created the previous day. The GetHDFS processor does start at 12:30 am, as expected, but not all files from the directory are processed. So, ti seems the processor is not staying in the running state until all the files are processed. Is that the expected behavior since you are saying "the processor does not remain running." So, 1) at what time (how long after starting) does the processor stop 2) how do you control how long the processor should stay running after it was triggered to start (in my case, to let all files be processed) 3) when the processor is not running, does the icon on the processor(in the NiFi UI) change to the stopped state ? Thanks.
... View more
06-28-2017
02:45 PM
Hello, I'm trying to understand how NiFi processor "runs" work with CRON scheduler. I understand, by default the processor is running all the time (which is the 0 sec "Timer driven" schedule in all processors by default). When a processor is scheduled to run on a CRON driven schedule, I understand that the schedule dictates when the processor is triggered to run. But once the processor is triggered to run, how long does it stay running? does it stop after a certain amount of time? the CRON run schedule only specifies when and how often the processor should be triggered to start, but where do you specify how long it should run for and then stop; For example, let's say I set a Get* processor to run daily at 1 am; once the current system time is 1 am, the processor starts running, but does it ever stop once it is started by the scheduler, or it will stay running ? if it stays running, then it doesn't need to be triggered by the scheduler again the next day at 1 am, because it would already be running, right ? if it does stop after the scheduler triggers it to start, how long after starting does the processor stop and where do you specify how long should the processor run for. Thank you.
... View more
Labels:
- Labels:
-
Apache NiFi
06-26-2017
03:09 PM
Not sure why I need to schedule the GetHDFS processor to run continuously (I set to run every 15 seconds), but this schedule exhausts all files from the directory - 0/15 * * * * ? In my case since I'm loading files the next day (GetHDFS directory path points to previous day's directory), this resolves the issue I was facing.
... View more
06-26-2017
03:06 PM
@Shashank Chandhok actually, the files I'm trying to process are from the day before; in my directory path in GetHDFS processor, I'm using expression language to point to the directory that was created yesterday and the files in that directory are from yesterday. So when the CRON scheduler starts at 12:30 am, all files that would need to be processed should all be there already in that directory.
... View more