Support Questions

Find answers, ask questions, and share your expertise
Announcements
Welcome to the upgraded Community! Read this blog to see What’s New!

Best Practice - JSON to Avro, data type preserving

avatar
Super Collaborator

Hi All, What is the best approach to convert a JSON to AVRO preserving the source datatypes. My source JSON has a field with timestamp (value would look like 2017-01-26T00:00:00-05:00) which I need to eventually insert to a hive table with column type timestamp. When I infer the schema, I get String for the timestamp field. Is there some pre-formatting that I can do on the timestamp field so that it gets inferred as timestamp field. Current flow is as below - JSON>>AVRO(infer/manually add schema)>>Streaming Insert to hive

21 REPLIES 21

avatar
Super Guru

What are you using for your AVRO Schema? If you specify that the field as timestamp.

https://avro.apache.org/docs/1.8.0/spec.html#Timestamp+%28millisecond+precision%29

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-Timestamps...

Can you post an example JSON file, AVRO Schema, data provenance of the run, Hive DDL

Standard Format "YYYY-MM-DD HH:MM:SS.fffffffff"

For the schema, copy the inferred one from a data provenance run and then change the type from string to timestamp and save that schema and use that for the next run.

If there's empty strings or null in the first 20 records, I am thinking you will get String as a type.

avatar
Super Collaborator

My Sample JSON File is

{
    "timestamp": "2017-01-26T00:00:00-05:00",
    "c1": 73.0,
    "c2": 36.5,
    "c3": 43.8,
    "c4": 0.1,
    "c5": 75.4,
    "c6": 997.8,
    "c7": 0.5,
    "c8": 4.58,
    "c9": 43.8,
    "c10": 1.5,
    "c11": 40.6,
    "postal_code": "08863",
    "country": "us"
}
And the Avro schema inferred by NiFi is
{
    "type": "record",
    "name": "date",
    "fields": [
        {
            "name": "timestamp",
            "type": "string"
        },
        {
            "name": "c1",
            "type": "double",
            "doc": "Type inferred from '73'"
        },
        {
            "name": "c2",
            "type": "double",
            "doc": "Type inferred from '36.5'"
        },
        {
            "name": "c3",
            "type": "double",
            "doc": "Type inferred from '43.8'"
        },
        {
            "name": "c4",
            "type": "double",
            "doc": "Type inferred from '0'"
        },
        {
            "name": "c5",
            "type": "double",
            "doc": "Type inferred from '75.4'"
        },
        {
            "name": "c6",
            "type": "double",
            "doc": "Type inferred from '997.8'"
        },
        {
            "name": "c7",
            "type": "double",
            "doc": "Type inferred from '0'"
        },
        {
            "name": "c8",
            "type": "double",
            "doc": "Type inferred from '4.58'"
        },
        {
            "name": "c9",
            "type": "double",
            "doc": "Type inferred from '43.8'"
        },
        {
            "name": "c10",
            "type": "double",
            "doc": "Type inferred from '1.5'"
        },
        {
            "name": "c11",
            "type": "double",
            "doc": "Type inferred from '40.6'"
        },
        {
            "name": "postal_code",
            "type": "string",
            "doc": "Type inferred from '\"08863\"'"
        },
        {
            "name": "country",
            "type": "string",
            "doc": "Type inferred from '\"us\"'"
        }
    ]
}

avatar
Super Collaborator

@Timothy Spann : I tried editing the avro schema manually adding "type": "timestamp-millis" in place of "string". However the processor does not accept this and notifies a "Schema Validation Failure"

avatar
Super Guru

try as timestamp and not timestamp-millis. It may not support Avro 1.8

avatar
Super Collaborator

Thanks @Timothy Spann The Avro version is 1.7.7 (looking at the pom.xml). Is there a timestamp type in avro 1.7.7? I guess not. I could see the logicalType timestamp-* in 1.8 Should I assume that the timestamp would not work.

avatar
Super Guru

What version of HDF are you using? can you upgrade?

No timestamps in 1.7.7 http://avro.apache.org/docs/1.7.7/spec.html#schema_primitive

I seem to remember that being added in 1.8

Here are some AVRO notes I did last year for a meetup: https://github.com/airisdata/avroparquet

avatar
Super Collaborator

@Timothy Spann This is NiFi 1.1.1. I believe avro is still 1.7.7. Not sure if it is 1.8.

avatar
Super Guru
{
   "type" : "record",
   "namespace" : "hivetable",
   "name" : "hivetable",
   "fields" : [
      { "name" : "status" , "type" : "string" },
      { "name" : "rtimestamp" , "type" : "timestamp-millis" },
      { "name" : "country" , "type" : "string" }, 
      { "name" : "ip" , "type" : "string" }
   ]
}

avatar
New Contributor

Can u please help on when we convert json to avro format how can we define logicaltype in avro

avatar
New Contributor

While converting from json to avro format,how to get logicaltype in avro format.

And to get logicaltype in avro format,what we need to add in json data .

avatar
Contributor

Here's an exmaple of what a Timestamp schema entry would look like: https://avro.apache.org/docs/1.8.0/spec.html#Timestamp+%28millisecond+precision%29

{ "type": "record",

"namespace": "test",

"name": "test1",

"fields": [

{ "name": "ID", "type": "string" },

{ "name": "V", "type": "float" },

{ "name": "T", "type":

{ "type": "long", "logicalType": "timestamp-millis" }

} ]

}

avatar
Expert Contributor
@Arun A K

why not use an the new Record oriented processors, store the schema locally in nifi using the avroschemaregistry.

https://community.hortonworks.com/questions/113959/use-nifi-to-change-the-format-of-numeric-date-and...

avatar
Super Collaborator

Thanks @Karthik Narayanan, I can use them now, but the question was when we had not released NiFi 1.2

avatar
Expert Contributor

as per the avro doc, timestamp-millis is a logicaltype, so you have to use something like this...

{"name":"timestamp","type": {"type": "string", "logicalType": "timestamp-millis"}

see if that works. https://github.com/mtth/avsc/wiki/Advanced-usage#logical-types

avatar
Super Collaborator

Thanks @Karthik Narayanan, same reason that I was on a NiFi that had Avro 1.7 the logicalType doesn't work. If I remember correctly, it threw an invalid avro schema error.

avatar
Expert Contributor

hmm, strange which version of nifi were you on?

avatar
Super Collaborator

@Karthik Narayanan it was NiFi 1.1 and looking at the pom.xml, I am assuming there is an avro 1.7 dependency. The logical time stamp stuff was introduced in avro 1.8

avatar
Expert Contributor

i tried on 1.1 it worked fine.

json2avro.xml

{    
"type": "record",
    "name": "testdata",
    "fields": [
        {
            "name": "timestamp",
            "type": {"type":"string","logicalType":"timestamp-millis"}
        },
        {
            "name": "c1",
            "type": "double"
        },
        {
            "name": "c2",
            "type": "double"
        },
        {
            "name": "c3",
            "type": "double"
        },
        {
            "name": "c4",
            "type": "double"
        },
        {
            "name": "c5",
            "type": "double"
        },
        {
            "name": "c6",
            "type": "double"
        },
        {
            "name": "c7",
            "type": "double"
        },
        {
            "name": "c8",
            "type": "double"
        },
        {
            "name": "c9",
            "type": "double"
        },
        {
            "name": "c10",
            "type": "double"
        },
        {
            "name": "c11",
            "type": "double"
        },
        {
            "name": "postal_code",
            "type": "string"
        },
        {
            "name": "country",
            "type": "string"
        }
    ]
}

Labels