Created on 06-02-2017 03:12 PM - edited 08-17-2019 12:40 PM
With the latest release of Apache NiFi 1.2.0 the JoltTransformJson Processor became a bit more powerful with an upgrade to the Jolt library (to version 0.1.0) and the introduction of expression language (EL) support. This now provides users the ability to create dynamic specifications for JSON transformation and to perform some data manipulation tasks all within the context of the processor. Internal caching has also been added to improve overall performance.
Let’s take an example of transformation Twitter json payload seen below:
{"created_at":"Wed Mar 29 02:53:48 +0000 2017","id":846918283102081024,"id_str":"846918283102081024","text":"CSUB falls to Georgia Tech 76-61 in NIT semifinal game. @Bakersfieldcali @BVarsityLive @CSUBAthletics @CSUB_MBB\u2026 https:\/\/t.co\/9e5dQesIbg","display_text_range":[0,140],"source":"\u003ca href=\"http:\/\/twitter.com\" rel=\"nofollow\"\u003eTwitter Web Client\u003c\/a\u003e","truncated":true,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2918922812,"id_str":"2918922812","name":"Felix Adamo","screen_name":"tbcpix","location":"Bakersfield Californian","url":null,"description":"Newspaper Photographer","protected":false,"verified":false,"followers_count":677,"friends_count":247,"listed_count":12,"favourites_count":1366,"statuses_count":3576,"created_at":"Thu Dec 04 18:46:27 +0000 2014","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/570251877397180416\/jL2kuB4f_normal.png","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/570251877397180416\/jL2kuB4f_normal.png","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/2918922812\/1483041284","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"extended_tweet":{"full_text":"CSUB falls to Georgia Tech 76-61 in NIT semifinal game. @Bakersfieldcali @BVarsityLive @CSUBAthletics @CSUB_MBB @csubnews https:\/\/t.co\/yV2AHFdVLc","display_text_range":[0,121],"entities":{"hashtags":[],"urls":[],"user_mentions":[{"screen_name":"Bakersfieldcali","name":"The Bakersfield Cali","id":33055408,"id_str":"33055408","indices":[56,72]},{"screen_name":"BVarsityLive","name":"BVarsityLive","id":762418351,"id_str":"762418351","indices":[73,86]},{"screen_name":"CSUBAthletics","name":"CSUB Athletics","id":51115996,"id_str":"51115996","indices":[87,101]},{"screen_name":"CSUB_MBB","name":"\ud83c\udfc0CSUB Men's Hoops\ud83c\udfc0","id":2897931481,"id_str":"2897931481","indices":[102,111]},{"screen_name":"csubnews","name":"CSU Bakersfield","id":209666415,"id_str":"209666415","indices":[112,121]}],"symbols":[],"media":[{"id":846918121248047104,"id_str":"846918121248047104","indices":[122,145],"media_url":"http:\/\/pbs.twimg.com\/media\/C8Dbi0rUwAAiffu.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/C8Dbi0rUwAAiffu.jpg","url":"https:\/\/t.co\/yV2AHFdVLc","display_url":"pic.twitter.com\/yV2AHFdVLc","expanded_url":"https:\/\/twitter.com\/tbcpix\/status\/846918283102081024\/photo\/1","type":"photo","sizes":{"medium":{"w":1200,"h":608,"resize":"fit"},"large":{"w":2048,"h":1038,"resize":"fit"},"small":{"w":680,"h":345,"resize":"fit"},"thumb":{"w":150,"h":150,"resize":"crop"}}},{"id":846918179397906433,"id_str":"846918179397906433","indices":[122,145],"media_url":"http:\/\/pbs.twimg.com\/media\/C8DbmNTVMAEvpd3.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/C8DbmNTVMAEvpd3.jpg","url":"https:\/\/t.co\/yV2AHFdVLc","display_url":"pic.twitter.com\/yV2AHFdVLc","expanded_url":"https:\/\/twitter.com\/tbcpix\/status\/846918283102081024\/photo\/1","type":"photo","sizes":{"large":{"w":2048,"h":1213,"resize":"fit"},"medium":{"w":1200,"h":711,"resize":"fit"},"small":{"w":680,"h":403,"resize":"fit"},"thumb":{"w":150,"h":150,"resize":"crop"}}}]},"extended_entities":{"media":[{"id":846918121248047104,"id_str":"846918121248047104","indices":[122,145],"media_url":"http:\/\/pbs.twimg.com\/media\/C8Dbi0rUwAAiffu.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/C8Dbi0rUwAAiffu.jpg","url":"https:\/\/t.co\/yV2AHFdVLc","display_url":"pic.twitter.com\/yV2AHFdVLc","expanded_url":"https:\/\/twitter.com\/tbcpix\/status\/846918283102081024\/photo\/1","type":"photo","sizes":{"medium":{"w":1200,"h":608,"resize":"fit"},"large":{"w":2048,"h":1038,"resize":"fit"},"small":{"w":680,"h":345,"resize":"fit"},"thumb":{"w":150,"h":150,"resize":"crop"}}},{"id":846918179397906433,"id_str":"846918179397906433","indices":[122,145],"media_url":"http:\/\/pbs.twimg.com\/media\/C8DbmNTVMAEvpd3.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/C8DbmNTVMAEvpd3.jpg","url":"https:\/\/t.co\/yV2AHFdVLc","display_url":"pic.twitter.com\/yV2AHFdVLc","expanded_url":"https:\/\/twitter.com\/tbcpix\/status\/846918283102081024\/photo\/1","type":"photo","sizes":{"large":{"w":2048,"h":1213,"resize":"fit"},"medium":{"w":1200,"h":711,"resize":"fit"},"small":{"w":680,"h":403,"resize":"fit"},"thumb":{"w":150,"h":150,"resize":"crop"}}}]}},"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/9e5dQesIbg","expanded_url":"https:\/\/twitter.com\/i\/web\/status\/846918283102081024","display_url":"twitter.com\/i\/web\/status\/8\u2026","indices":[113,136]}],"user_mentions":[{"screen_name":"Bakersfieldcali","name":"The Bakersfield Cali","id":33055408,"id_str":"33055408","indices":[56,72]},{"screen_name":"BVarsityLive","name":"BVarsityLive","id":762418351,"id_str":"762418351","indices":[73,86]},{"screen_name":"CSUBAthletics","name":"CSUB Athletics","id":51115996,"id_str":"51115996","indices":[87,101]},{"screen_name":"CSUB_MBB","name":"\ud83c\udfc0CSUB Men's Hoops\ud83c\udfc0","id":2897931481,"id_str":"2897931481","indices":[102,111]}],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1490756028329"}
In our case we want to accomplish several things when transforming this data in JoltTransformJson:
Once the data has been transformed it will land on the file system as well as within a Mongo db repository.
Basic Flow of Twitter Data Transformation and Storage
Here's a close up of the specification in use:
[{ "operation": "shift", "spec": { "${id.var}": "tweet_id", "text": "tweet_text", "in_reply_to_*": "&" } },{ "operation": "modify-overwrite-beta", "spec": { "tweet_text": "=toLower" } },{ "operation": "modify-default-beta", "spec": { "~in_reply_to_status_id": 0, "~in_reply_to_status_id_str": "", "~in_reply_to_user_id": "", "~in_reply_to_user_id_str": 0, "~in_reply_to_screen_name": "" } },{ "operation": "default", "spec":{ "flow_file_id" : "${uuid}" } }]
In the above you’ll see we’ve this accomplished with a chain specification containing four operations (shift, modify-overwrite, modify-default, and default). The shift helps to define the fields needed for the final schema and translates those fields into new labels. Note the shift’s specification uses expression language on the left side (${id.var}) that will evaluate to a value populated by the UpdateAttribute processor (this value could also be populated from the Variable Registry). The Jolt library will then attempt to match that value to the corresponding label in the incoming json data and change it to the new label (in this case “tweet_id”) on the right.
The next operation uses modifier-overwrite to ensure that for all the tweet text coming in we apply the Jolt lower case function to that data. We then use a modifier-default operation that applies default values to the in_reply_to fields if those values are null. Finally we use a basic default operation to create the new flow_file_id field by applying expression language on the right of the field name to dynamically create the flow file id entry.
JoltTransformJson Advanced UI with Chain Specification
New Test Attributes Modal for testing Expression Language used in Specifications
The Advanced UI (shown above) has also been enhanced to allow testing of specifications with expression language (specifically to provide test attributes that need to be resolved during testing). This gives users greater insight into how a flow will behave without relying on any external dependencies such as flow file attributes or variable registry entries.
Example of Transformed JSON (shown in Provenance)
Looking to give this a try? Feel free to download the example template on GitHub Gist here and import it into NiFi. The template includes the specification described above which you can tweak and test out various scenarios. Also if you have any questions about transforming JSON in Apache NiFi with Jolt please comment below or reach out to the community on the Apache NiFi mailing list.
Created on 10-10-2017 12:22 PM
Hi, I want to load complete specification file from the database, for that, i create processors to load from db and put it in the properties but while specifying property as a specification, i got invalid error.
I stored the specification in "SPECS" properties but while specifying it through "${SPECS}" , its not working. I also tried through Advance option and property creation.
I think, currently, replacing the content through property is not fully supported in current expression language of Jolt processor?
Created on 10-23-2018 09:10 AM
Created on 03-12-2023 01:40 AM
{ "operation": "default", "spec":{
"*" : { "flow_file_id" : "${uuid}"
} } }
this spec is generating the same uuid for all the objects in the array of json.