- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Best Practice - JSON to Avro, data type preserving
- Labels:
-
Apache NiFi
Created ‎01-26-2017 05:37 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All, What is the best approach to convert a JSON to AVRO preserving the source datatypes. My source JSON has a field with timestamp (value would look like 2017-01-26T00:00:00-05:00) which I need to eventually insert to a hive table with column type timestamp. When I infer the schema, I get String for the timestamp field. Is there some pre-formatting that I can do on the timestamp field so that it gets inferred as timestamp field. Current flow is as below - JSON>>AVRO(infer/manually add schema)>>Streaming Insert to hive
Created ‎01-26-2017 05:56 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What are you using for your AVRO Schema? If you specify that the field as timestamp.
https://avro.apache.org/docs/1.8.0/spec.html#Timestamp+%28millisecond+precision%29
Can you post an example JSON file, AVRO Schema, data provenance of the run, Hive DDL
Standard Format "YYYY-MM-DD HH:MM:SS.fffffffff
"
For the schema, copy the inferred one from a data provenance run and then change the type from string to timestamp and save that schema and use that for the next run.
If there's empty strings or null in the first 20 records, I am thinking you will get String as a type.
Created ‎01-26-2017 06:31 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My Sample JSON File is
{ "timestamp": "2017-01-26T00:00:00-05:00", "c1": 73.0, "c2": 36.5, "c3": 43.8, "c4": 0.1, "c5": 75.4, "c6": 997.8, "c7": 0.5, "c8": 4.58, "c9": 43.8, "c10": 1.5, "c11": 40.6, "postal_code": "08863", "country": "us" }And the Avro schema inferred by NiFi is
{ "type": "record", "name": "date", "fields": [ { "name": "timestamp", "type": "string" }, { "name": "c1", "type": "double", "doc": "Type inferred from '73'" }, { "name": "c2", "type": "double", "doc": "Type inferred from '36.5'" }, { "name": "c3", "type": "double", "doc": "Type inferred from '43.8'" }, { "name": "c4", "type": "double", "doc": "Type inferred from '0'" }, { "name": "c5", "type": "double", "doc": "Type inferred from '75.4'" }, { "name": "c6", "type": "double", "doc": "Type inferred from '997.8'" }, { "name": "c7", "type": "double", "doc": "Type inferred from '0'" }, { "name": "c8", "type": "double", "doc": "Type inferred from '4.58'" }, { "name": "c9", "type": "double", "doc": "Type inferred from '43.8'" }, { "name": "c10", "type": "double", "doc": "Type inferred from '1.5'" }, { "name": "c11", "type": "double", "doc": "Type inferred from '40.6'" }, { "name": "postal_code", "type": "string", "doc": "Type inferred from '\"08863\"'" }, { "name": "country", "type": "string", "doc": "Type inferred from '\"us\"'" } ] }
Created ‎01-26-2017 07:57 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Timothy Spann : I tried editing the avro schema manually adding "type": "timestamp-millis" in place of "string". However the processor does not accept this and notifies a "Schema Validation Failure"
Created ‎01-26-2017 08:40 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
try as timestamp and not timestamp-millis. It may not support Avro 1.8
Created ‎01-27-2017 06:45 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @Timothy Spann The Avro version is 1.7.7 (looking at the pom.xml). Is there a timestamp type in avro 1.7.7? I guess not. I could see the logicalType timestamp-* in 1.8 Should I assume that the timestamp would not work.
Created ‎01-27-2017 07:08 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What version of HDF are you using? can you upgrade?
No timestamps in 1.7.7 http://avro.apache.org/docs/1.7.7/spec.html#schema_primitive
I seem to remember that being added in 1.8
Here are some AVRO notes I did last year for a meetup: https://github.com/airisdata/avroparquet
Created ‎01-27-2017 09:39 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Timothy Spann This is NiFi 1.1.1. I believe avro is still 1.7.7. Not sure if it is 1.8.
Created ‎01-26-2017 08:40 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎01-26-2017 06:06 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
{ "type" : "record", "namespace" : "hivetable", "name" : "hivetable", "fields" : [ { "name" : "status" , "type" : "string" }, { "name" : "rtimestamp" , "type" : "timestamp-millis" }, { "name" : "country" , "type" : "string" }, { "name" : "ip" , "type" : "string" } ] }
