- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on 07-06-2016 04:24 PM - edited 08-17-2019 11:31 AM
Query JSON using Spark
Imagine you are ingesting JSON msgs but each one has different tag names or even a different structure. This is very common because JSON is a flexible nested structure. However we commonly interact with data in a flat table like structure using SQL. The decision becomes to either parse the dynamic data into a physical schema (on write) or apply a schema at runtime (on read). Ultimately the decision will likely be made based on the number of writes vs reads. However there is one major advantage to using Spark to apply schema on read to JSON events, it alleviates the parsing step. Typically you have to hand code all the tags in the JSON msgs and map each one to a schema column. This may require meeting with upstream teams or third parties to get the DDL/xsd or schema definition. It also doesn't protect you from msgs you haven't seen or new tags being added to existing JSON structures. Sparks schema on read handles all of this as well as flattens the structure into a SQL queryable table. In the example below there are 3 different JSON msgs each with different tags and structures. If the goal is to normalize the data for a specific reporting or data science task you may be better off defining a physical schema where items like price and strikePrice are converged to a common column that makes sense in both contexts. However if your goal is to process or serve msgs like a msg bus, or if you find that it is better to query stocks separately from options because the attributes should not be interpreted and you do not want to become the author of the data you are processing then this could be an ideal approach. (A non-authoritative, low maintenance approach that is queryable)
{"tradeId":"123", "assetClass":"stock", "transType":"buy", "price":"22.34", "stockAttributes":{ "5avg":"20.12","52weekHi":"27.56" } }
{"tradeId":"456", "assetClass":"future", "transType":"sell", "strikePrice":"40.00", "contractType": "forward", "account":{ "city":"Columbus","state":"Ohio", "zip":"21000" } }
{"tradeId":"789", "assetClass":"option", "transType":"buy", "strikePrice":"35.75", "account":{ "accountType":"retail","city":"Columbus","state":"Ohio" } }
1.0 The below image shows the 3 different JSON msgs (stock,option,future) with different attributes and structures.
2.0 Here you can query all of the data or any segment of the data using SQL.
Full code on zephub - code link
Pros:
Data tags and structure are always in sync with provider
No data loss
No parsing layer (code effort), faster time to market
No authoring, naming or defining columns
Cons:
SQL reads will be slower than a physically flattened and written table
Deserialization cost and can't benefit from modern day columnar operations
Compression - "don't use JSON" video from summit https://www.youtube.com/watch?v=tB28rPTvRiI&feature=youtu.be&t=20m3s
Created on 07-07-2016 02:42 AM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Great article.