Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Rising Star

With the latest release of Apache NiFi 1.2.0 the JoltTransformJson Processor became a bit more powerful with an upgrade to the Jolt library (to version 0.1.0) and the introduction of expression language (EL) support. This now provides users the ability to create dynamic specifications for JSON transformation and to perform some data manipulation tasks all within the context of the processor. Internal caching has also been added to improve overall performance.

Let’s take an example of transformation Twitter json payload seen below:

{"created_at":"Wed Mar 29 02:53:48 +0000 2017","id":846918283102081024,"id_str":"846918283102081024","text":"CSUB falls to Georgia Tech 76-61 in NIT semifinal game. @Bakersfieldcali @BVarsityLive @CSUBAthletics @CSUB_MBB\u2026 https:\/\/t.co\/9e5dQesIbg","display_text_range":[0,140],"source":"\u003ca href=\"http:\/\/twitter.com\" rel=\"nofollow\"\u003eTwitter Web Client\u003c\/a\u003e","truncated":true,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2918922812,"id_str":"2918922812","name":"Felix Adamo","screen_name":"tbcpix","location":"Bakersfield Californian","url":null,"description":"Newspaper Photographer","protected":false,"verified":false,"followers_count":677,"friends_count":247,"listed_count":12,"favourites_count":1366,"statuses_count":3576,"created_at":"Thu Dec 04 18:46:27 +0000 2014","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/570251877397180416\/jL2kuB4f_normal.png","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/570251877397180416\/jL2kuB4f_normal.png","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/2918922812\/1483041284","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"extended_tweet":{"full_text":"CSUB falls to Georgia Tech 76-61 in NIT semifinal game. @Bakersfieldcali @BVarsityLive @CSUBAthletics @CSUB_MBB @csubnews https:\/\/t.co\/yV2AHFdVLc","display_text_range":[0,121],"entities":{"hashtags":[],"urls":[],"user_mentions":[{"screen_name":"Bakersfieldcali","name":"The Bakersfield Cali","id":33055408,"id_str":"33055408","indices":[56,72]},{"screen_name":"BVarsityLive","name":"BVarsityLive","id":762418351,"id_str":"762418351","indices":[73,86]},{"screen_name":"CSUBAthletics","name":"CSUB Athletics","id":51115996,"id_str":"51115996","indices":[87,101]},{"screen_name":"CSUB_MBB","name":"\ud83c\udfc0CSUB Men's Hoops\ud83c\udfc0","id":2897931481,"id_str":"2897931481","indices":[102,111]},{"screen_name":"csubnews","name":"CSU Bakersfield","id":209666415,"id_str":"209666415","indices":[112,121]}],"symbols":[],"media":[{"id":846918121248047104,"id_str":"846918121248047104","indices":[122,145],"media_url":"http:\/\/pbs.twimg.com\/media\/C8Dbi0rUwAAiffu.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/C8Dbi0rUwAAiffu.jpg","url":"https:\/\/t.co\/yV2AHFdVLc","display_url":"pic.twitter.com\/yV2AHFdVLc","expanded_url":"https:\/\/twitter.com\/tbcpix\/status\/846918283102081024\/photo\/1","type":"photo","sizes":{"medium":{"w":1200,"h":608,"resize":"fit"},"large":{"w":2048,"h":1038,"resize":"fit"},"small":{"w":680,"h":345,"resize":"fit"},"thumb":{"w":150,"h":150,"resize":"crop"}}},{"id":846918179397906433,"id_str":"846918179397906433","indices":[122,145],"media_url":"http:\/\/pbs.twimg.com\/media\/C8DbmNTVMAEvpd3.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/C8DbmNTVMAEvpd3.jpg","url":"https:\/\/t.co\/yV2AHFdVLc","display_url":"pic.twitter.com\/yV2AHFdVLc","expanded_url":"https:\/\/twitter.com\/tbcpix\/status\/846918283102081024\/photo\/1","type":"photo","sizes":{"large":{"w":2048,"h":1213,"resize":"fit"},"medium":{"w":1200,"h":711,"resize":"fit"},"small":{"w":680,"h":403,"resize":"fit"},"thumb":{"w":150,"h":150,"resize":"crop"}}}]},"extended_entities":{"media":[{"id":846918121248047104,"id_str":"846918121248047104","indices":[122,145],"media_url":"http:\/\/pbs.twimg.com\/media\/C8Dbi0rUwAAiffu.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/C8Dbi0rUwAAiffu.jpg","url":"https:\/\/t.co\/yV2AHFdVLc","display_url":"pic.twitter.com\/yV2AHFdVLc","expanded_url":"https:\/\/twitter.com\/tbcpix\/status\/846918283102081024\/photo\/1","type":"photo","sizes":{"medium":{"w":1200,"h":608,"resize":"fit"},"large":{"w":2048,"h":1038,"resize":"fit"},"small":{"w":680,"h":345,"resize":"fit"},"thumb":{"w":150,"h":150,"resize":"crop"}}},{"id":846918179397906433,"id_str":"846918179397906433","indices":[122,145],"media_url":"http:\/\/pbs.twimg.com\/media\/C8DbmNTVMAEvpd3.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/C8DbmNTVMAEvpd3.jpg","url":"https:\/\/t.co\/yV2AHFdVLc","display_url":"pic.twitter.com\/yV2AHFdVLc","expanded_url":"https:\/\/twitter.com\/tbcpix\/status\/846918283102081024\/photo\/1","type":"photo","sizes":{"large":{"w":2048,"h":1213,"resize":"fit"},"medium":{"w":1200,"h":711,"resize":"fit"},"small":{"w":680,"h":403,"resize":"fit"},"thumb":{"w":150,"h":150,"resize":"crop"}}}]}},"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/9e5dQesIbg","expanded_url":"https:\/\/twitter.com\/i\/web\/status\/846918283102081024","display_url":"twitter.com\/i\/web\/status\/8\u2026","indices":[113,136]}],"user_mentions":[{"screen_name":"Bakersfieldcali","name":"The Bakersfield Cali","id":33055408,"id_str":"33055408","indices":[56,72]},{"screen_name":"BVarsityLive","name":"BVarsityLive","id":762418351,"id_str":"762418351","indices":[73,86]},{"screen_name":"CSUBAthletics","name":"CSUB Athletics","id":51115996,"id_str":"51115996","indices":[87,101]},{"screen_name":"CSUB_MBB","name":"\ud83c\udfc0CSUB Men's Hoops\ud83c\udfc0","id":2897931481,"id_str":"2897931481","indices":[102,111]}],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1490756028329"}

In our case we want to accomplish several things when transforming this data in JoltTransformJson:

  1. Create a subset of json data that contains id, tweet text, in reply to fields and a new flow_file_id field
  2. Match the “id” variable in the twitter payload based on flow file variable and convert that to a new label (tweet_id)
  3. Set my tweet text to all lower case
  4. Set some default values for in reply to fields that are null
  5. Add flow file unique id to json data

Once the data has been transformed it will land on the file system as well as within a Mongo db repository.

15934-jolt-el-fullflow.png

Basic Flow of Twitter Data Transformation and Storage

Here's a close up of the specification in use:

[{
  "operation": "shift",
  "spec": {
    "${id.var}": "tweet_id",
    "text": "tweet_text",
      "in_reply_to_*": "&"
  }
  },{
  "operation": "modify-overwrite-beta",
  "spec": {
   "tweet_text": "=toLower"
  }
},{
  "operation": "modify-default-beta",
  "spec": {
   "~in_reply_to_status_id": 0,
        "~in_reply_to_status_id_str": "",
   "~in_reply_to_user_id": "",
   "~in_reply_to_user_id_str": 0,
   "~in_reply_to_screen_name": ""
  }
},{
    "operation": "default",
    "spec":{
     "flow_file_id" : "${uuid}"
  }
}]

In the above you’ll see we’ve this accomplished with a chain specification containing four operations (shift, modify-overwrite, modify-default, and default). The shift helps to define the fields needed for the final schema and translates those fields into new labels. Note the shift’s specification uses expression language on the left side (${id.var}) that will evaluate to a value populated by the UpdateAttribute processor (this value could also be populated from the Variable Registry). The Jolt library will then attempt to match that value to the corresponding label in the incoming json data and change it to the new label (in this case “tweet_id”) on the right.

The next operation uses modifier-overwrite to ensure that for all the tweet text coming in we apply the Jolt lower case function to that data. We then use a modifier-default operation that applies default values to the in_reply_to fields if those values are null. Finally we use a basic default operation to create the new flow_file_id field by applying expression language on the right of the field name to dynamically create the flow file id entry.

15935-jolt-el-advancedui.png

JoltTransformJson Advanced UI with Chain Specification

15936-jolt-el-testattributes.png

New Test Attributes Modal for testing Expression Language used in Specifications

The Advanced UI (shown above) has also been enhanced to allow testing of specifications with expression language (specifically to provide test attributes that need to be resolved during testing). This gives users greater insight into how a flow will behave without relying on any external dependencies such as flow file attributes or variable registry entries.

15937-jolt-el-transformed-content.png

Example of Transformed JSON (shown in Provenance)

Looking to give this a try? Feel free to download the example template on GitHub Gist here and import it into NiFi. The template includes the specification described above which you can tweak and test out various scenarios. Also if you have any questions about transforming JSON in Apache NiFi with Jolt please comment below or reach out to the community on the Apache NiFi mailing list.

8,074 Views
Comments
New Contributor

Hi, I want to load complete specification file from the database, for that, i create processors to load from db and put it in the properties but while specifying property as a specification, i got invalid error.

I stored the specification in "SPECS" properties but while specifying it through "${SPECS}" , its not working. I also tried through Advance option and property creation.

I think, currently, replacing the content through property is not fully supported in current expression language of Jolt processor?

Rising Star

@Yolanda M. Davis

Hi Yolanda, thanks for all the information about NiFi-JoltTransform, helped al lot! I hope you keep going on :-)

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 12:40 PM
Updated by:
 
Contributors
Top Kudoed Authors