Member since
10-31-2018
4
Posts
0
Kudos Received
0
Solutions
02-04-2019
03:48 AM
I have a NiFi flow which goes thusly: ListFile ->
FetchFile ->
HashContent ->
DetectDuplicate ->
UpdateAttributes -> (several of them)
PutS3Object
I'm ingesting some files daily, and I'd like it to only send new files through the pipeline. Hence my duplicate problem. ListFile config: screen-shot-2019-02-03-at-45106-pm.png Multiple duplicates are getting pulled in by the file getting processors. For example, there are multiple MBA0001.txt or MBA0023.txt coming through. Those are also the file names of each flow file. I've set DetectDuplicate to detect duplicates off ${filename} . But, the processor does not filter anything out, and sends the same number of files on to the next stage. DetectDuplicate config: screen-shot-2019-02-02-at-22259-pm.png So if 50 files go in to DetectDuplicate , and say 25 are duplicates, 50 still go through. I don't get it. Any idea why? Documentation has not been helpful. For FetchFile , when the not.found connection is cleared, the processor does not send on duplicates. But, if I don't manually empty that connection, it sends on everything, duplicates and all. And this is scheduled to run once a day. I set it to delete files once pulled. Here's that `FetchFile` config: screen-shot-2019-02-03-at-42321-pm.png
... View more
Labels:
01-30-2019
11:23 PM
I fully typed out this question here: https://stackoverflow.com/questions/54450058/nifi-running-python-web-scraper-through-executecommandstream-executeprocess-p But, the overall gist is that I have a python web-scraping script in my docker container, and I'm trying to have the processor scrape what I need, and send it on down my pipeline. Problem is, I can't get it to scrape without throwing some "command not found" errors, and I have no idea how to get the system to recognize my python script. Python3 is downloaded in the container. The SO link above fully explains my issue. I've taken a look at this: https://community.hortonworks.com/questions/178561/can-anyone-provide-an-example-of-a-python-script-e.html, a good starting place, but not truly germane to the issue.
... View more
12-11-2018
10:00 PM
This question is really a follow up to @Timothy Spann's guide series for the Stanford NLP and its use in NiFi. Problem: I have NiFi up in AWS, and I also have the Stanford Core NLP jar file running in an ECS task. I can't get them connected. My current flow is this: 1) GenerateFlowFile - with custom text: "Testing because I have no idea how this works?" (just under 50B) 2) InvokeHTTP - POST, and url = http://xx.xxx.xx.xxx:port (ip and port, throws no errors) 3) ???? - I currently have the original and response connected to a LogAttribute, to see what comes out. For response, when I check the list queue, the flowfile has nothing in it, upon viewer inspection, and when I download the file, it just gives me the Apache Tika license agreement. Original just puts that message as an attribute. How do I call *entity* analysis? I know the NLP is running over in that ECS. I have no idea how to input a correct url call, or what type of processor must come after InvokeHTTP. If I am asking the wrong question/a dumb question, please let me know. Thanks
... View more
Labels:
12-05-2018
05:11 PM
I have a connected SFTP server, and I am trying to route files based on type: `.csv`, `.tsv`, and `.xlsx`. For now, I'm just uploading test files through the command line. My flow is: GetSFTP (with correct hostname, etc.) -> RouteOnAttribute -> LogAttribute (will dump elsewhere soon, this is just for testing) My problem, I think, is that I created a property in `RouteOnAttribute` incorrectly: screen-shot-2018-12-05-at-120805-pm.png Am I correct in assuming that this does not actually pick up on the `.csv` because it is not technically part of the filename? What would be the correct expression to route on the file type? Thanks!
... View more
Labels: