About merlin1

merlin1 · ‎05-25-2018

@Bryan Bende I tried the multiple files thing before GetFile, and it worked really fast. Although I have 4 additional questions. Two are observations and 2 research questions. 1) Now that I have a flow now like "GetFile-> PublishKafka0_11->PutFile"(see attached pic) horton.png , the folder that contains 2000 files(originally one large file was csv, now it does not seem to have a csv extension or any extension at all) is read in GetFile processor. And then after reading straightaway published to Kafka topic called new_profiler_testing and if it is successful, it should send it to PutFile, where it puts all these files read into a Folder called output. This happens to generate the log files in kafka-logs, as I wanted to check the topic "new_profiler_testing" Now, if I count the number of lines in this log file they are 1231264 00000000000000062200.log and if I check for number of files written into the output folder they are 1561. You may have observed from the picture, that there is congestion that again happens at the PutFile end after success message is delivered from PublishKafka0_11. I want to check for my 2million records in kafka topic. How do I do that? Because when I open the log file in kafka-logs, it seems to have gibberrish content as well. Do you think I should simultaneously open a consumer console and pipe it via wc -l? or is there a way by which I can do it in Nifi 2) I ran this process twice in order to make sure its running, and the first time something strange happened. The output folder contained files like these xzbcl.json xzbcm.json xzbcn.json xzbco.json xzbcp.json xzbcq.json xzbcr.json xzbcs.json xzbcy.json xzbcz.json and also xzbcl, xzbcm, xzbcn, xzbco, xzbcp, xzbcq, xzbcr, xzbcs, xzbcy, xzbcz, along with other normal files. And they were in the format of json when I opened them. Here is a snippet [{"timestamp":1526590200,"tupvalues":[0.1031287352688188,0.19444490419773347,0.06724761719024923,0.008715105948727752,0.273251449860885,0.09916421288937546,0.12308943665971132,0.017852488055395015,0.05039765141148139,0.11335172723104833,0.03305334889471589,0.041821925451222756,0.08485309865154911,0.09606502178530299,0.06843417769071786,0.024991363178388175, 0.2800309262376106,0.1926730050165331,0.2785879089696489,0.211383486088693...]}] Why did this happen and how? Also the size of the then created log was 785818 00000000000000060645.log. Is it possible that the number of records written into a topic varies over time and is susceptible to change? Also, this is the format I would ideally want my kafka topic to be in(ie. json format). But have not been able to get around that, as mentioned in this post by me https://community.hortonworks.com/questions/191753/csv-to-json-conversion-error.html?childToView=191615#answer-191615 3) If I have ten files in Nifi being read from same folder, how is the data read? Is it read one after the other in the order and pushed to kafka? Or is it randomly sent? I want to know because I have a program written in kafka-streams that needs to group by on timestamp values. For eg. today 10am-11am data from all ten folders to be averaged for their CPU usages. 4) Is there a way to time my output into kafka topic. I would like to know how much time it takes for GetFile to read the files and then send to kafka topic completely till it has 2million records?

merlin1 · ‎05-23-2018

hortonworks2.png Here is the node A processor picture that I have attached. Ideally I want one input topic to receive 20 million records from a local file or sent via nifi processor. I think your idea of splitting it into chunks of multiple files should do too.

merlin1 · ‎05-23-2018

I have waited overnight, and still has been stuck in this state only. Should I increase the said value of 1GB in the back pressure to 2GB and then check?

merlin1 · ‎05-23-2018

hortonworks-1.png Hey @Bryan Bende thanks for replying. This is the flow in the image, you may be able to tell better if you see I thought. Before taking the dump, I tried to start the publish kafka processor but could not do so, as I receive the error "No eligible components are selected. Please select the components to be started and ensure they are no longer running." And the start option is not available when I right click on the processor. However I still took the dump as asked. Attaching the dump file here. dump.txt Please suggest me a method to send the data.

merlin1 · ‎05-22-2018

I have a node "A" with RPG process that reads from a file of size 1.2GB roughly containing 20 million records and at the node "B" this file is received via an input port to pass on further to PublishKafka0_11. However, as soon as I do this the data gets sent from A and received till B but appears permanently in the state of queued before the "PublishKafka" processor. To check if my flow was right I tried to read a file of 53.4kB and that gets sent to the processor succesfully and also into the topic named "INPUT_TOPIC" . Here are the problems: 1) with 1.2 GB sized file, it does not seem to send the data into topic 2) After using 1.2 GB, the input port hangs or stops responding, also the processor "PublishKafka0_11" stops responding. 3) I used cat command manually to right into the topic "INPUT_TOPIC" to read into the consumer in the command line interface. However, when I check the logs for that "INPUT_TOPIC logs, there are two logs created both of which contain different texts in between (Almost binary gibberish) and the wc -l reads different numbers on both logs, adding to more than 20 million lines. I have tried this by removing the topic and doing afresh as well. But still the same type of output. Can someone help me in this situation. My purpose is to load an input topic of kafka with my 20 million records. NO more, no less than 20 million.

merlin1 · ‎05-21-2018

Hi @Matt Burgess, the output is still in one line instead of multiple lines. Even though I have tried using what you mentioned above. I have used Replace text and in place of regex : (\[)(\{\"timestamp\"\:15123[0-9]+),(\"tupvalues\"\:\[([0-9]+\.[a-zA-Z0-9-]+),([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)([0-9]+\.[a-zA-Z0-9-]+)\]\})(\,) For replacement values $2, $3. Followed by Split text in order to split line by line. But the output is still the same. I even tried the solution given by you on post https://community.hortonworks.com/questions/109064/nifi-replace-text-how-t0-replace-string-with.html And tried substituting the expression [\[\]](\{|\}) but this gives me an output which has no square brackets in the beginning and inside the array. I know its been a week almost, but still have not got a hang of it.

merlin1 · ‎05-17-2018

@Matt Burgess I was able to do this and worked perfectly. However, there is just one small request. The data that I finally receive in PUTfile, is all in one line. I tried to insert newlines after each record ends. However in PublishKafka_0_11 there is message demarcator where Shift+Ctrl is also not helping my situation. I figured that it is because [{"timestamp":1512.., "tupvalues":[1,2,3,4...]}, {[{"timestamp":1512.., "tupvalues":[1,2,3,4...]}, [{"timestamp":1512.., "tupvalues":[1,2,3,4...]}, [{"timestamp":1512.., "tupvalues":[1,2,3,4...]}.....] The square bracket is right at the very end. Whereas the required output is somewhat like this: {"timestamp":"1512312021","tupvalues":[0.8,0.0,18244.0,3176.0,0.0,122.0,11.0,0.0,0.0,100052.0,1783.0,4.0,59.0,1.0,3252224.0,1.8681856E7,2777088.0,999424.0,0.0,524288.0,0.0,487424.0,740352.0,0.0,1.0,0.04,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0]} {"timestamp":"1512312022","tupvalues":[207.8,0.2,3778460.0,309000.0,0.0,22342.0,27.0,0.0,0.0,1.06732936E8,25623.0,36.0,749.0,110.0,3.19459328E8,3.87224371E9,1.17956608E8,7110656.0,0.0,2.87654298E9,0.0,2.0957184E8,2.46372352E8,0.0,3.0,1.95,1.23,0.0,0.0,3.0,6.0,0.0,0.0,0.0,0.0]} any suggestions? Do u think split json or split record should be now be introduced?

merlin1 · ‎05-17-2018

I would like to know if the Run schedule stands for "rate at which the processor is publishing or writing into another processor like "Put File" I am publishing kafka into a topic from where kafka streams is called and then so on. For performance testing, I would like to fix the rate at which the log is written into topic. Can anybody suggest me how? For eg. 100 records/log lines per second.

merlin1 · ‎05-12-2018

I have a csv file of the format - 1512340821, 26,576.09, 39824, 989459.009.. and so on 35 total fields Each of these columns is a long or double format in avro format. Now I have used the Convert Record processor in nifi, which forst converts or uses avro schema and then produces the json format data. My goal is to have data coming out of json format like the following {"timestamp":"1512312024","tupvalues":[112.5,0.0,1872296.0,134760.0,0.0,7134.0,19.0,0.0,0.0,3.8136152E7,13703.0,18.0,111.0,37.0,1.38252288E8,1.91762842E9,5.9564032E7,4055040.0,0.0,1.41528269E9,0.0,8.0539648E7,9.5470592E7,0.0,2.0,0.76,0.44,0.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0]} My original data does not have headers, but I expect the data to be of the format {key, value} where key is the timestamp and values are the other column numbers. Here is the avro schema that i put in the avro registry { "type": "record", "namespace": "testavro.schema", "name": "test", "fields": [ { "type": "double", "name": "timestamp" }, { "name": "tupvalues", "type" : { "type": "array", "items": "double" } } ] } I used this website -- " https://json-schema-validator.herokuapp.com/avro.jsp" to check the conversion and it reads success. But when applied in the avro registry, the data is not picked. I get an error of the following order-- "Cannot create value [26] of type java.lang.string to object array for field tupvalues. Any sort of help is appreciated. I am a newbie to avro schema writing, I have a feeling thats where I am wrong.

merlin1 · ‎04-16-2018

I too tried something in the meanwhile. Here is the screenshot of the flow. In this flow I simple split them using regular expression and then extracted what was needed using the success connectors. Will surely try one of the methods mentioned by you as well and get back here. @Shu

Online	Offline
Last Visited	‎02-01-2019 10:25 AM

Member Since	‎09-04-2017 09:09 AM
Last Visited	‎02-01-2019 10:25 AM
Posts	19
Kudos received	1

Cloudera Community

Re: Error in publishing to Kafka topic in Nifi

Re: Error in publishing to Kafka topic in Nifi

Re: Error in publishing to Kafka topic in Nifi

Re: Error in publishing to Kafka topic in Nifi

Error in publishing to Kafka topic in Nifi

Re: Csv to json conversion error

Re: Csv to json conversion error

Rate of Publishing - kafka processor

Csv to json conversion error

Re: Split Attributes and pass into different Kafka...