Created 05-06-2019 09:32 PM
We have a data flow wherein standardized transaction logs are generated as xml files in a particular directory in VM. We want to publish these logs to a Kafka topic. Following NiFi/MiNiFi flow is what we are using:
ListFile ---> FetchFile (completion strategy - Move File) ---> PublishKafka
Each transaction translates to single xml file, at peak hours, over a million files are getting generated per hour, i.e. approximately 300 files per second. Our goal is achieve the above flow using MiNiFi.
ListFile uses two listing strategies, tracking timestamps and tracking entities. Initially data flow was created with by selecting tracking timestamps option. Here we observed that several files were not picked by NiFi / MiNiFi. The files that were picked were eventually moved to a different directory but some files were not picked. To put a number to this, 2% - 5% of files were not picked (thus not published to Kafka topic). The behavior was observed in both NiFi and corresponding YAML file in MiNiFi.
We then tried using tracking entities option in ListFile's listing strategy, created DistributedMapCacheClientService in Entity Tracking State Cache property and configured DistributedMapCacheServer with default ports. This configuration worked in NiFi flow, we tested with by generating a million files in span of one hour and all file contents were published to Kafka topic. Then we attempted the same by converting NiFi flow to MiNiFi yaml and there it failed with errors like DistributedMapCacheClientService is unable to connect to localhost:4557 (default hostname and port for DistributedMapCacheServer). We tried to create controller service using REST API but that seems to work in NiFi but not in MiNiFi.
So my question are,
1) Is there a way to configure and start DistributedMapCacheServer controller service in MiNiFi instance?
2) Is there way to host DistributedMapCacheServer separately (by running some command on its NAR file)?
3) If there exists a different approach to transfer file contents to Kafka without losing out any transaction files, kindly suggest the same.
Created on 05-07-2019 01:32 PM - edited 08-17-2019 03:35 PM
1. The DistibuteMapCacheServer (DMC server) has not direct linkage to a dataflow built in NiFi. Thus when creating a template of your flow from NiFi for the purpose of generating your MiNiFi yaml file, the DMC Server will not be included in the template. Only the DMC client service will be included. I have not tried to manually add the DMC server to the yaml file, but is likely possible. MiNiFi does not have a rest-api where you can send command to add additional components like you can do it NiFi.
2. The only way to host a DMC server is via a NIFi instance. There is not way to execute a NAR file.
3. A major limitation to using the DMC server is lack of HA. If the NiFi instance hosting the DMC server crashes you lost all your cached data (assuming crash is not recoverable. The recommended path is to configure you ListFile processor to use one of the other available external cache server that offer HA capability.
These alternate cache service are setup independent of NiFi or MiNiFi and will offer you what you need to support using Entity tracking in your ListFile processor running on MiNiFi.
Thank you,
Matt
If you found this answer addressed your question, please take a moment to login in and click the "ACCEPT" link.
Created on 05-07-2019 01:32 PM - edited 08-17-2019 03:35 PM
1. The DistibuteMapCacheServer (DMC server) has not direct linkage to a dataflow built in NiFi. Thus when creating a template of your flow from NiFi for the purpose of generating your MiNiFi yaml file, the DMC Server will not be included in the template. Only the DMC client service will be included. I have not tried to manually add the DMC server to the yaml file, but is likely possible. MiNiFi does not have a rest-api where you can send command to add additional components like you can do it NiFi.
2. The only way to host a DMC server is via a NIFi instance. There is not way to execute a NAR file.
3. A major limitation to using the DMC server is lack of HA. If the NiFi instance hosting the DMC server crashes you lost all your cached data (assuming crash is not recoverable. The recommended path is to configure you ListFile processor to use one of the other available external cache server that offer HA capability.
These alternate cache service are setup independent of NiFi or MiNiFi and will offer you what you need to support using Entity tracking in your ListFile processor running on MiNiFi.
Thank you,
Matt
If you found this answer addressed your question, please take a moment to login in and click the "ACCEPT" link.