Member since
04-08-2019
5
Posts
0
Kudos Received
0
Solutions
04-10-2019
07:16 PM
I think I understood all that. Thanks for taking the time to write it all out.
... View more
04-10-2019
05:20 PM
@Matt Clarke @Shu Thanks. You and Shu are correct, but I still don't understand why my DetectDuplicate processor didn't work. If you could continue to indulge me, I have a few more questions for you. ** 1.) For testing purposes, I left the ListS3 processor the same and set the DetectDuplicate processor to Primary Node. DetectDuplicate only sent one copy of each file further on, but it simply ignored the other 3 copies of the messages - left them in the queue. So I guess it processed only messages produced on the local server. I had assumed the queues work like Message Queues and any of the nodes could pull off any messages that are in the proceeding queue. So I'm confused. Is the DetectDuplicate processor different in this regard? Or maybe it has to do with my DistributedMapCacheClientService being set to localhost? Does each node keep its own copy of the Cache? ** 2.) More questions about the DistributedMapCaches. I found a reply from 2016 by Pierre Villard. He said: It is advised to run the DistributedMapCacheServer on the NCM, then, in DistributedMapCacheClientService, instead of localhost, you can use the IP address of your NCM. OK. I guess this is saying that I should set up one CacheServer on one node, the NCM, and then point the CacheClientService to the NCM? How do I get the CacheServer on a specific node? Is that what clicking in the upper right corner and selecting Controller Settings does? If I put the CacheServer there it puts it on the NCM? And then I set up the CacheClientService in the Processor Group? I'm just guessing here. If I put the CacheServer in the Processor Group, will that cause nifi to make separate caches on each node? ** 3.) I guess I should also ask how I figure out which server or IP address is the NCM? Probably set in a config file somewhere. That is probably in the documentation. ** 4.) If I have several different Processor Groups running, do I have to worry about port conflicts between groups? If I set up DistributedMapCacheServers in more than one Processor Group should I set up the CacheServer and ClientServers to use a different port for each processor group? (Of course each pair must have the same port number.) If I set up a CacheSever by clicking in the upper left hand corner and selecting Controller Settings, do I want that to use a port different than any CacheServer I set up in a Processor Group? Am I barking up the wrong tree and in fact I normally only want to set up one CacheServer for the entire cluster? ********* I understand that is a lot of questions. I at least tried to be specific to the things I encountered. In any case, I suspect a lot of people might find such answers useful so if you or anyone else could find the time to answer at least some of them, it would be most appreciated. Thanks again
... View more
04-09-2019
05:49 PM
@Shu Thanks. That was it. I had assumed that since this was a development instance it was stand alone and just ignored all the mentions of clusters and Primary Nodes in the other posts. A quick call to nifi-api/flow/cluster/summary confirmed this is a cluster. I do have a few more questions about DetectDuplicate processors which I'll post in reply to Matt Clarkes post. If you could take a look I'd appreciate it. Thanks again.
... View more
04-09-2019
12:46 AM
I sure hope someone can help me, because otherwise I'm sunk. I'm using the ListS3 processor to feed a FetchS3 processor, but when I list my test directory, which contains one file, I get 4 messages in my queue instead of one. (I also get 4 messages with the directory name, but I filter that out.) The 4 messages have the same filename and I can't find any reason why this is happening. I searched the site but can find no mention of duplicate messages (or flows) from the ListS3 processor. Which is odd because it has been so easy to create this condition. Multiple developers encountered it. It wasn't just one flow. I encountered it when creating flows on a Linux server, version 1.6, and on my local Windows 10 machine, version 1.8. Does it every time with various buckets. If I examine the queue in a connection between processors, I can see the 4 messages and that they all have the same filename. The weirdest thing is that position is the same for each of the 4 messages. When I list the directory, I see 4 messages with position 1 for the directory. And 4 messages with position 2 for the file. I had wondered if this meant that it really was only 1 message that was displayed in a strange way, but I don't think so. First off, each message has a unique UUID. Second, after the FetchS3 I do a PutSFTP and the first message succeeds and then the next 3 fail with failure to rename dot-file and then failure to delete it when attempting to clean up. So it must be trying to copy the same file 4 times in parallel and the last 3 are locked/conflicted and fail. It is strange though. I changed the Back Pressure to 1 and saw that 4 messages show up in the queue. I have to start the next processor to get the next group of 4 into the queue. It never stops at just one message. So OK, I'm assuming the messages are real. I try using the DetectDuplicate processor. It takes me a few tries to get set up, but it eventually says it is ready to go. For Cache Entry Identifier I put ${filename}. Seems pretty straight forward. If any message come in with the same filename before the Age Off Duration has passed, it should be marked as duplicate. But they always go through as non-duplicate. Always. I tried Cache The Entry Identifier as both true and false. No effect. I wondered if I had the Distributed Cache Service set up correctly. My first attempt creating that at the global scope said there was a conflict with the port, so after more reading I saw that I could make Services in my processor group scope. In my processor group, I create controller services DistrubutedMapCacheClientService and DistributedMapCacheServer and enable them. I've tried them at the default port 4557. I even tried changing the port on both of them to 4558 to see if that helped. No effect. Is there something I'm missing? Seems pretty straightforward. Define one of each type on the same port. I thought it was a little strange that I don't tell the two services about each other. I guess just giving them the same port connects them? I don't understand why I always get 4 messages per file. I even tried setting the global Nifi settings to 1 for both Maximum Timer Driven Thread Count and Maximum Even Driven Thread Count, but it still always ran with a 4 next to the symbol - the symbol in the processor that says it is doing something. And I don't understand why my DetectDuplicate doesn't filter out the duplicates. It seems quite simple. Put the ${filename} in that field and every message marked non-duplicate should have a unique filename. But I get nothing in the duplicate queue. Does anyone have any idea what is going on?
... View more
Labels:
- Labels:
-
Apache NiFi