Support Questions

foivos · ‎12-13-2017

I have a use case where JSON files are read from an API, transformed to CSV and imported to Hive tables, however my flow fails at the replace text processor. Can you give some advice on the configuration of the processor or on where my approach fails?

InvokeHTTP --> EvaluateJsonPath --> ReplaceText --> MergeContent --> UpdateAttribute --> PutHDFS

My flow does several HTTP calls with InvokeHTTP (Each call with different ID), extracts attributes from each JSON that is returned (each JSON is unique) and then creates the csv's in the ReplaceText processor as following:

${attribute1},${attribute2},${attribute3},${attribute4},${attribute5},${attribute6},${attribute7}

However after the MergeContent processor inthe merged CSV there is really a lot of duplicate data while all incoming JSONs contain unique data.

Shu_ashu · ‎12-14-2017

@balalaika

I suspect duplicates are from Replace Text processor you have configured

Evaluation Mode

Line-by-Line

That means let's take the your json having more than 1 new line, Replace text processor is going to be Replace the whole line with

Replacement Value

${attribute1}${attribute2}${attribute3}

Example:-

Input:-

{
"features": [{
"feature": {
"paths": [[[214985.27600000054,
427573.33100000024],
[215011.98900000006,
427568.84200000018],
[215035.35300000012,
427565.00499999896],
[215128.48900000006,
427549.4290000014],
[215134.43699999899,
427548.65599999949],
[215150.86800000072,
427546.87900000066],
[215179.33199999854,
427544.19799999893]]]
},
"attributes": {
"attribute1": "value",
"attribute2": "value",
"attribute3": "value",
"attribute4": "value"

}
}]
}

In this input json message we are having 27 lines and My evaluate Json Path configs are same as you mentioned in comments.

Replace Text Configs:-

Output:-

As output we got 27 lines because we are having evaluation mode as line by line.

If you change the Evaluation mode to Entire text then

Output:-

And you are having json message in one line i.e

{"features":[{"feature":{"paths":[[[214985.27600000054,427573.33100000024],[215011.98900000006,427568.84200000018],[215035.35300000012,427565.00499999896],[215128.48900000006,427549.4290000014],[215134.43699999899,427548.65599999949],[215150.86800000072,427546.87900000066],[215179.33199999854,427544.19799999893]]]},"attributes":{"attribute1":"value","attribute2":"value","attribute3":"value","attribute4":"value",}}]}

Then if you keep replace text configs as line by line or entire text it doesn't matter because we are having just one line as input to the processor and we will get result from replace text as

Try to change the configs as per your Input Json Message and run again the processor.

Let us know if the processor still resulting duplicate data.

View solution in original post

foivos · ‎12-13-2017

i ve no idea why my screenshots are doubleposted, whatever i tried to fix it fails 🙂

mburgess · ‎12-13-2017

Can you share an example or two of incoming JSON data, your config for EvaluateJSONPath, and an example of the flow file after MergeContent (perhaps setting number of entries much lower to fit here)?

foivos · ‎12-14-2017

Hi @Matt Burgess, here is an example of the incoming JSON files, all have same attributes:

{
 "features": [
  {
   "feature": {
    "paths": [
     [
      [
       214985.27600000054,
       427573.33100000024
      ],
      [
       215011.98900000006,
       427568.84200000018
      ],
      [
       215035.35300000012,
       427565.00499999896
      ],
      [
       215128.48900000006,
       427549.4290000014
      ],
      [
       215134.43699999899,
       427548.65599999949
      ],
      [
       215150.86800000072,
       427546.87900000066
      ],
      [
       215179.33199999854,
       427544.19799999893
      ]
     ]
    ]
   },
   "attributes": {
    "attribute1": "value",
    "attribute2": "value",
    "attribute3": "value",
    "attribute4": "value",
   }
  }
 ]
}

EvaluateJSONpath:

Where i add properties for each attribute i want to parse:

attribute1: $.features[0].attributes.attribute1 etc. etc.

ReplaceText:

I think something goes wrong in my configuration here, because even before the MergeContent the single CSVs created per JSON file contain hundreds of duplicate rows, whereas it should be just one row per CSV that they are gonna be later merged into a big CSV file.

Shu_ashu · ‎12-14-2017

@balalaika

I suspect duplicates are from Replace Text processor you have configured

Evaluation Mode

Line-by-Line

That means let's take the your json having more than 1 new line, Replace text processor is going to be Replace the whole line with

Replacement Value

${attribute1}${attribute2}${attribute3}

Example:-

Input:-

{
"features": [{
"feature": {
"paths": [[[214985.27600000054,
427573.33100000024],
[215011.98900000006,
427568.84200000018],
[215035.35300000012,
427565.00499999896],
[215128.48900000006,
427549.4290000014],
[215134.43699999899,
427548.65599999949],
[215150.86800000072,
427546.87900000066],
[215179.33199999854,
427544.19799999893]]]
},
"attributes": {
"attribute1": "value",
"attribute2": "value",
"attribute3": "value",
"attribute4": "value"

}
}]
}

In this input json message we are having 27 lines and My evaluate Json Path configs are same as you mentioned in comments.

Replace Text Configs:-

Output:-

As output we got 27 lines because we are having evaluation mode as line by line.

If you change the Evaluation mode to Entire text then

Output:-

And you are having json message in one line i.e

{"features":[{"feature":{"paths":[[[214985.27600000054,427573.33100000024],[215011.98900000006,427568.84200000018],[215035.35300000012,427565.00499999896],[215128.48900000006,427549.4290000014],[215134.43699999899,427548.65599999949],[215150.86800000072,427546.87900000066],[215179.33199999854,427544.19799999893]]]},"attributes":{"attribute1":"value","attribute2":"value","attribute3":"value","attribute4":"value",}}]}

Then if you keep replace text configs as line by line or entire text it doesn't matter because we are having just one line as input to the processor and we will get result from replace text as

Try to change the configs as per your Input Json Message and run again the processor.

Let us know if the processor still resulting duplicate data.

foivos · ‎12-14-2017

Hi @Shu yes that was exactly the problem, now the individual CSVs are created just fine but in the meantime another problem occured. When the individual CSVs are merged with the MergedContent processor then the Merged CSV is all in one line instead of seperate lines. Is there a way to bypass this?

MergeContent:

Shu_ashu · ‎12-14-2017

@balalaika
For that case you need to specify Demarcator property as

Shift+enter

Configs:-

For merge content reference

https://community.hortonworks.com/questions/149047/nifi-how-to-handle-with-mergecontent-processor.ht...

Cloudera Community

Support Questions

NiFi: JSON to CSV to Hive