Support Questions

Find answers, ask questions, and share your expertise

NiFi: JSON to CSV to Hive

avatar
Expert Contributor

I have a use case where JSON files are read from an API, transformed to CSV and imported to Hive tables, however my flow fails at the replace text processor. Can you give some advice on the configuration of the processor or on where my approach fails?

InvokeHTTP --> EvaluateJsonPath --> ReplaceText --> MergeContent --> UpdateAttribute --> PutHDFS

My flow does several HTTP calls with InvokeHTTP (Each call with different ID), extracts attributes from each JSON that is returned (each JSON is unique) and then creates the csv's in the ReplaceText processor as following:

${attribute1},${attribute2},${attribute3},${attribute4},${attribute5},${attribute6},${attribute7}

However after the MergeContent processor inthe merged CSV there is really a lot of duplicate data while all incoming JSONs contain unique data.

45405-capture.png

45404-repltext.png

45403-capture.png

45401-capture.png

1 ACCEPTED SOLUTION

avatar
Master Guru
@balalaika

I suspect duplicates are from Replace Text processor you have configured

Evaluation Mode

Line-by-Line

That means let's take the your json having more than 1 new line, Replace text processor is going to be Replace the whole line with

Replacement Value

${attribute1}${attribute2}${attribute3}

Example:-

Input:-

{
"features": [{
"feature": {
"paths": [[[214985.27600000054,
427573.33100000024],
[215011.98900000006,
427568.84200000018],
[215035.35300000012,
427565.00499999896],
[215128.48900000006,
427549.4290000014],
[215134.43699999899,
427548.65599999949],
[215150.86800000072,
427546.87900000066],
[215179.33199999854,
427544.19799999893]]]
},
"attributes": {
"attribute1": "value",
"attribute2": "value",
"attribute3": "value",
"attribute4": "value"

}
}]
}

In this input json message we are having 27 lines and My evaluate Json Path configs are same as you mentioned in comments.

44436-eval-json-attribute.png

Replace Text Configs:-

44437-replace.png

Output:-

44438-output.png

As output we got 27 lines because we are having evaluation mode as line by line.

If you change the Evaluation mode to Entire text then

Output:-

44439-output.png

And you are having json message in one line i.e

{"features":[{"feature":{"paths":[[[214985.27600000054,427573.33100000024],[215011.98900000006,427568.84200000018],[215035.35300000012,427565.00499999896],[215128.48900000006,427549.4290000014],[215134.43699999899,427548.65599999949],[215150.86800000072,427546.87900000066],[215179.33199999854,427544.19799999893]]]},"attributes":{"attribute1":"value","attribute2":"value","attribute3":"value","attribute4":"value",}}]}

Then if you keep replace text configs as line by line or entire text it doesn't matter because we are having just one line as input to the processor and we will get result from replace text as

44440-output.png

Try to change the configs as per your Input Json Message and run again the processor.

Let us know if the processor still resulting duplicate data.

View solution in original post

6 REPLIES 6

avatar
Expert Contributor

i ve no idea why my screenshots are doubleposted, whatever i tried to fix it fails 🙂

avatar
Master Guru

Can you share an example or two of incoming JSON data, your config for EvaluateJSONPath, and an example of the flow file after MergeContent (perhaps setting number of entries much lower to fit here)?

avatar
Expert Contributor

Hi @Matt Burgess, here is an example of the incoming JSON files, all have same attributes:

{
 "features": [
  {
   "feature": {
    "paths": [
     [
      [
       214985.27600000054,
       427573.33100000024
      ],
      [
       215011.98900000006,
       427568.84200000018
      ],
      [
       215035.35300000012,
       427565.00499999896
      ],
      [
       215128.48900000006,
       427549.4290000014
      ],
      [
       215134.43699999899,
       427548.65599999949
      ],
      [
       215150.86800000072,
       427546.87900000066
      ],
      [
       215179.33199999854,
       427544.19799999893
      ]
     ]
    ]
   },
   "attributes": {
    "attribute1": "value",
    "attribute2": "value",
    "attribute3": "value",
    "attribute4": "value",
   }
  }
 ]
}

EvaluateJSONpath:

44434-path.png

Where i add properties for each attribute i want to parse:

attribute1: $.features[0].attributes.attribute1 etc. etc.

ReplaceText:

44435-repltext.png

I think something goes wrong in my configuration here, because even before the MergeContent the single CSVs created per JSON file contain hundreds of duplicate rows, whereas it should be just one row per CSV that they are gonna be later merged into a big CSV file.

avatar
Master Guru
@balalaika

I suspect duplicates are from Replace Text processor you have configured

Evaluation Mode

Line-by-Line

That means let's take the your json having more than 1 new line, Replace text processor is going to be Replace the whole line with

Replacement Value

${attribute1}${attribute2}${attribute3}

Example:-

Input:-

{
"features": [{
"feature": {
"paths": [[[214985.27600000054,
427573.33100000024],
[215011.98900000006,
427568.84200000018],
[215035.35300000012,
427565.00499999896],
[215128.48900000006,
427549.4290000014],
[215134.43699999899,
427548.65599999949],
[215150.86800000072,
427546.87900000066],
[215179.33199999854,
427544.19799999893]]]
},
"attributes": {
"attribute1": "value",
"attribute2": "value",
"attribute3": "value",
"attribute4": "value"

}
}]
}

In this input json message we are having 27 lines and My evaluate Json Path configs are same as you mentioned in comments.

44436-eval-json-attribute.png

Replace Text Configs:-

44437-replace.png

Output:-

44438-output.png

As output we got 27 lines because we are having evaluation mode as line by line.

If you change the Evaluation mode to Entire text then

Output:-

44439-output.png

And you are having json message in one line i.e

{"features":[{"feature":{"paths":[[[214985.27600000054,427573.33100000024],[215011.98900000006,427568.84200000018],[215035.35300000012,427565.00499999896],[215128.48900000006,427549.4290000014],[215134.43699999899,427548.65599999949],[215150.86800000072,427546.87900000066],[215179.33199999854,427544.19799999893]]]},"attributes":{"attribute1":"value","attribute2":"value","attribute3":"value","attribute4":"value",}}]}

Then if you keep replace text configs as line by line or entire text it doesn't matter because we are having just one line as input to the processor and we will get result from replace text as

44440-output.png

Try to change the configs as per your Input Json Message and run again the processor.

Let us know if the processor still resulting duplicate data.

avatar
Expert Contributor

Hi @Shu yes that was exactly the problem, now the individual CSVs are created just fine but in the meantime another problem occured. When the individual CSVs are merged with the MergedContent processor then the Merged CSV is all in one line instead of seperate lines. Is there a way to bypass this?

MergeContent:

44442-merge.png

avatar
Master Guru

@balalaika
For that case you need to specify Demarcator property as

Shift+enter
Configs:-

44443-merge.png

For merge content reference

https://community.hortonworks.com/questions/149047/nifi-how-to-handle-with-mergecontent-processor.ht...