About merlin1

merlin1 · ‎05-25-2018

@Bryan Bende I tried the multiple files thing before GetFile, and it worked really fast. Although I have 4 additional questions. Two are observations and 2 research questions. 1) Now that I have a flow now like "GetFile-> PublishKafka0_11->PutFile"(see attached pic) horton.png , the folder that contains 2000 files(originally one large file was csv, now it does not seem to have a csv extension or any extension at all) is read in GetFile processor. And then after reading straightaway published to Kafka topic called new_profiler_testing and if it is successful, it should send it to PutFile, where it puts all these files read into a Folder called output. This happens to generate the log files in kafka-logs, as I wanted to check the topic "new_profiler_testing" Now, if I count the number of lines in this log file they are 1231264 00000000000000062200.log and if I check for number of files written into the output folder they are 1561. You may have observed from the picture, that there is congestion that again happens at the PutFile end after success message is delivered from PublishKafka0_11. I want to check for my 2million records in kafka topic. How do I do that? Because when I open the log file in kafka-logs, it seems to have gibberrish content as well. Do you think I should simultaneously open a consumer console and pipe it via wc -l? or is there a way by which I can do it in Nifi 2) I ran this process twice in order to make sure its running, and the first time something strange happened. The output folder contained files like these xzbcl.json xzbcm.json xzbcn.json xzbco.json xzbcp.json xzbcq.json xzbcr.json xzbcs.json xzbcy.json xzbcz.json and also xzbcl, xzbcm, xzbcn, xzbco, xzbcp, xzbcq, xzbcr, xzbcs, xzbcy, xzbcz, along with other normal files. And they were in the format of json when I opened them. Here is a snippet [{"timestamp":1526590200,"tupvalues":[0.1031287352688188,0.19444490419773347,0.06724761719024923,0.008715105948727752,0.273251449860885,0.09916421288937546,0.12308943665971132,0.017852488055395015,0.05039765141148139,0.11335172723104833,0.03305334889471589,0.041821925451222756,0.08485309865154911,0.09606502178530299,0.06843417769071786,0.024991363178388175, 0.2800309262376106,0.1926730050165331,0.2785879089696489,0.211383486088693...]}] Why did this happen and how? Also the size of the then created log was 785818 00000000000000060645.log. Is it possible that the number of records written into a topic varies over time and is susceptible to change? Also, this is the format I would ideally want my kafka topic to be in(ie. json format). But have not been able to get around that, as mentioned in this post by me https://community.hortonworks.com/questions/191753/csv-to-json-conversion-error.html?childToView=191615#answer-191615 3) If I have ten files in Nifi being read from same folder, how is the data read? Is it read one after the other in the order and pushed to kafka? Or is it randomly sent? I want to know because I have a program written in kafka-streams that needs to group by on timestamp values. For eg. today 10am-11am data from all ten folders to be averaged for their CPU usages. 4) Is there a way to time my output into kafka topic. I would like to know how much time it takes for GetFile to read the files and then send to kafka topic completely till it has 2million records?

bbende · ‎05-17-2018

The Run Schedule is the schedule of when the NiFi framework will execute a processor. The default of timer driver 0 seconds means to execute as fast as possible when there is data available in the incoming queue, if no data is there then it doesn't execute. The rate of the data depends on what the processor does during one execution... for example, lets say a queue has 100 flow files in it and you set the processor to run every 5 minutes. Some processors may grab a batch of files during one execution, so even tough the processor executes once, it may grab 50 of those flow files. It also depends if your flows files have multiple logical messages in the content. If you have 1 record per flow file, and if the processor only grabs 1 flow file at a time (most only take one at a time), then the run schedule does control the rate. You can look at ControlRate processor as well.

merlin1 · ‎05-21-2018

Hi @Matt Burgess, the output is still in one line instead of multiple lines. Even though I have tried using what you mentioned above. I have used Replace text and in place of regex : (\[)(\{\"timestamp\"\:15123[0-9]+),(\"tupvalues\"\:\[([0-9]+\.[a-zA-Z0-9-]+),([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)\]\})(\,) For replacement values $2, $3. Followed by Split text in order to split line by line. But the output is still the same. I even tried the solution given by you on post https://community.hortonworks.com/questions/109064/nifi-replace-text-how-t0-replace-string-with.html And tried substituting the expression [\[\]](\{|\}) but this gives me an output which has no square brackets in the beginning and inside the array. I know its been a week almost, but still have not got a hang of it.

merlin1 · ‎04-16-2018

I too tried something in the meanwhile. Here is the screenshot of the flow. In this flow I simple split them using regular expression and then extracted what was needed using the success connectors. Will surely try one of the methods mentioned by you as well and get back here. @Shu

Online	Offline
Last Visited	‎02-01-2019 10:25 AM

Member Since	‎09-04-2017 09:09 AM
Last Visited	‎02-01-2019 10:25 AM
Posts	19
Kudos received	1

Cloudera Community

Re: Error in publishing to Kafka topic in Nifi

Re: Rate of Publishing - kafka processor

Re: Csv to json conversion error

Re: Split Attributes and pass into different Kafka...