Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How can I convert a fixed width file into Json using controller services?

avatar
New Contributor

I have an input fixed width file as below
John 32 M New York 100
I want to convert it in to a Json like below
{
"Name": "John",
"Age": 32,
"Gender": "M",
"City": "New York",
"Country": 100
}


1 ACCEPTED SOLUTION

avatar
Super Guru

There is no magic solution for those scenarios and no one solution fits all out of Nifi that I can think of. You have to understand the nature of the input before you start consuming it and you have to provide the solution catered to this input. Sometimes if you are lucky you can combine multiple scenarios into one flow but that still depends on the complexity of the input. Even thought in your first scenario the second option I proposed seem to be simple enough and it did the job, for your second example its more complex and I dont think the out of the box GrokReader will be able to handle such complexity, therefore the first option of using the ExtractText Processor will work better because you can customize your regex as needed. For example, based on the text you provided:

 

 

 

JohnCena32 Male New York USA813668

 

 

 

I can use the following regex:

 

 

 

[A-Z][a-z]+[A-Z][a-z]+\d+\s(?:Male|Female|M|F)\s[A-Z][a-z]+(?:\s[A-Z][a-z]+)?\s[A-Za-z]+\d+

 

 

 

In the ExtractText processor I will define a dynamic property for each attribute (city, age, firstname...etc.) and surround the segment of the pattern that corresponds to the value with a parenthesis to extract as matching group. For Example:

Age:

[A-Z][a-z]+[A-Z][a-z]+(\d+)\s(?:Male|Female|M|F)\s[A-Z][a-z]+(?:\s[A-Z][a-z]+)?\s[A-Za-z]+\d+

FirstName:

([A-Z][a-z]+)[A-Z][a-z]+\d+\s(?:Male|Female|M|F)\s[A-Z][a-z]+(?:\s[A-Z][a-z]+)?\s[A-Za-z]+\d+

Gender:

[A-Z][a-z]+[A-Z][a-z]+\d+\s((?:Male|Female|M|F))\s[A-Z][a-z]+(?:\s[A-Z][a-z]+)?\s[A-Za-z]+\d+

Country:

[A-Z][a-z]+[A-Z][a-z]+\d+\s(?:Male|Female|M|F)\s[A-Z][a-z]+(?:\s[A-Z][a-z]+)?\s([A-Za-z]+)\d+

And so on...

This should give you the attribute you need. Then you can use the AttributeToJson processor to get the json output and finally if you want to convert the data to the proper type you can either user JoltTransformation  or QueryRecord with cast as shown above.

One final note: If you know how to use some external libraries in python for example or groovy or any of the supported code script in the ExecuteScript processor then you can use that to write your custom code to create the required fllowfile\attributes that will help you downstream to generate the final output.

If that helps please accept solution.

Thanks

 

View solution in original post

5 REPLIES 5

avatar
Community Manager

@Chai09, Welcome to our community! To help you get the best possible answer, I have tagged our NiFi experts @cotopaul @SAMSAL @MattWho @steven-matison  who may be able to assist you further.

Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
New Contributor

@VidyaSargur  Thank you for the warm welcome! I appreciate being a part of this community.

avatar
Super Guru

Hi @Chai09 ,

There are different options to do this:

1- You can do ExtractText -> AttributeToJson. In the ExtractText you specify the attribute (Name, Age...) and the regex to capture each attribute value. The AttributeToJson processor you can list the attribute that you want to convert into Json in the AttributeList property and set the Destination to "flowfile-content" to get the Json you need.  To learn how you to use the ExtractText  please refer to:

https://community.cloudera.com/t5/Support-Questions/How-to-ExtractText-from-flow-file-using-Nifi-Pro...

However, You might find the Age and Country values are given to you as string values vs integer, if that is OK with you then you dont need to do anything, If you need them to be integer then you have to use another processor like QueryRecord where you can cast the string into integer or JoltTransformationJson processor where you can do conversion using the following jolt spec:

 

[
  {
    "operation": "modify-overwrite-beta",
    "spec": {
      "Age": "=toInteger(@(1,Age))",
      "Country": "=toInteger(@(1,Country))"
    }
  }
]

 

 

2- The easiest way I found if you dont like to use Regex is using QueryRecord with RecordReader set as GrokReader and RecordWrite set as JsonRecordSetWriter as follows

SAMSAL_0-1699635705346.png

The JsonRecord is dynamic property that will define the relationship that will produce the Json output you are looking for and it has the following value:

 

select Name
,CAST(AgeN as SMALLINT) as Age
,Gender
,City
,CAST(CountryN as SMALLINT) as Country
 from FLOWFILE

 

The GrokReader service is used to help you read unstructured data such as log files and its configured as the following:

SAMSAL_2-1699635881126.png

The Grok Expressions Property is set to the following:

 

%{WORD:Name} %{NUMBER:AgeN} %{WORD:Gender} %{DATA:City} %{NUMBER:CountryN}

 

The Grok Expression uses predefined Regex for the given types: WORD, NUMBER, DATA...etc. For more info: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-reaord-serialization-services...

The JsonRecordSetWriter service is configured as follows :

SAMSAL_3-1699636047849.png

This will produce the Json you are looking for the correct data types. Notice how I needed to still cast the Age,Country to INT despite they are being defined as Number in the Grok Expression and that is because JsonRecordSetWriter will still convert everything to string unless you provide an Avro Schema.

If that helps please accept solution.

Thanks

 

 

avatar
New Contributor

Dear @SAMSAL ,

I extend my sincere appreciation for your invaluable insights and guidance on the query. Your expertise has been instrumental in implementing the first method, and I'm grateful for your prompt and effective assistance.

Regarding the second method you suggested, it's working seamlessly for our current requirements. However, I'm curious about its applicability to scenarios where a file contains data that requires splitting and segregation.

For Instance, consider the input
JohnCena32 Male New York USA813668

And the desired JSON output:
{
"firstname": "John",
"lastname": "Cena",
"age": 32,
"gender": "Male",
"city": "New York",
"country": "USA",
"mobile": 813668
}

Could you please shed light on whether the second method remains effective in handling data structures like this?

Thank you once again for your time and expertise.

 

avatar
Super Guru

There is no magic solution for those scenarios and no one solution fits all out of Nifi that I can think of. You have to understand the nature of the input before you start consuming it and you have to provide the solution catered to this input. Sometimes if you are lucky you can combine multiple scenarios into one flow but that still depends on the complexity of the input. Even thought in your first scenario the second option I proposed seem to be simple enough and it did the job, for your second example its more complex and I dont think the out of the box GrokReader will be able to handle such complexity, therefore the first option of using the ExtractText Processor will work better because you can customize your regex as needed. For example, based on the text you provided:

 

 

 

JohnCena32 Male New York USA813668

 

 

 

I can use the following regex:

 

 

 

[A-Z][a-z]+[A-Z][a-z]+\d+\s(?:Male|Female|M|F)\s[A-Z][a-z]+(?:\s[A-Z][a-z]+)?\s[A-Za-z]+\d+

 

 

 

In the ExtractText processor I will define a dynamic property for each attribute (city, age, firstname...etc.) and surround the segment of the pattern that corresponds to the value with a parenthesis to extract as matching group. For Example:

Age:

[A-Z][a-z]+[A-Z][a-z]+(\d+)\s(?:Male|Female|M|F)\s[A-Z][a-z]+(?:\s[A-Z][a-z]+)?\s[A-Za-z]+\d+

FirstName:

([A-Z][a-z]+)[A-Z][a-z]+\d+\s(?:Male|Female|M|F)\s[A-Z][a-z]+(?:\s[A-Z][a-z]+)?\s[A-Za-z]+\d+

Gender:

[A-Z][a-z]+[A-Z][a-z]+\d+\s((?:Male|Female|M|F))\s[A-Z][a-z]+(?:\s[A-Z][a-z]+)?\s[A-Za-z]+\d+

Country:

[A-Z][a-z]+[A-Z][a-z]+\d+\s(?:Male|Female|M|F)\s[A-Z][a-z]+(?:\s[A-Z][a-z]+)?\s([A-Za-z]+)\d+

And so on...

This should give you the attribute you need. Then you can use the AttributeToJson processor to get the json output and finally if you want to convert the data to the proper type you can either user JoltTransformation  or QueryRecord with cast as shown above.

One final note: If you know how to use some external libraries in python for example or groovy or any of the supported code script in the ExecuteScript processor then you can use that to write your custom code to create the required fllowfile\attributes that will help you downstream to generate the final output.

If that helps please accept solution.

Thanks