Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Best Practice - JSON to Avro, data type preserving

avatar
Super Collaborator

Hi All, What is the best approach to convert a JSON to AVRO preserving the source datatypes. My source JSON has a field with timestamp (value would look like 2017-01-26T00:00:00-05:00) which I need to eventually insert to a hive table with column type timestamp. When I infer the schema, I get String for the timestamp field. Is there some pre-formatting that I can do on the timestamp field so that it gets inferred as timestamp field. Current flow is as below - JSON>>AVRO(infer/manually add schema)>>Streaming Insert to hive

21 REPLIES 21

avatar
Master Guru

What are you using for your AVRO Schema? If you specify that the field as timestamp.

https://avro.apache.org/docs/1.8.0/spec.html#Timestamp+%28millisecond+precision%29

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-Timestamps...

Can you post an example JSON file, AVRO Schema, data provenance of the run, Hive DDL

Standard Format "YYYY-MM-DD HH:MM:SS.fffffffff"

For the schema, copy the inferred one from a data provenance run and then change the type from string to timestamp and save that schema and use that for the next run.

If there's empty strings or null in the first 20 records, I am thinking you will get String as a type.

avatar
Super Collaborator

My Sample JSON File is

{
    "timestamp": "2017-01-26T00:00:00-05:00",
    "c1": 73.0,
    "c2": 36.5,
    "c3": 43.8,
    "c4": 0.1,
    "c5": 75.4,
    "c6": 997.8,
    "c7": 0.5,
    "c8": 4.58,
    "c9": 43.8,
    "c10": 1.5,
    "c11": 40.6,
    "postal_code": "08863",
    "country": "us"
}
And the Avro schema inferred by NiFi is
{
    "type": "record",
    "name": "date",
    "fields": [
        {
            "name": "timestamp",
            "type": "string"
        },
        {
            "name": "c1",
            "type": "double",
            "doc": "Type inferred from '73'"
        },
        {
            "name": "c2",
            "type": "double",
            "doc": "Type inferred from '36.5'"
        },
        {
            "name": "c3",
            "type": "double",
            "doc": "Type inferred from '43.8'"
        },
        {
            "name": "c4",
            "type": "double",
            "doc": "Type inferred from '0'"
        },
        {
            "name": "c5",
            "type": "double",
            "doc": "Type inferred from '75.4'"
        },
        {
            "name": "c6",
            "type": "double",
            "doc": "Type inferred from '997.8'"
        },
        {
            "name": "c7",
            "type": "double",
            "doc": "Type inferred from '0'"
        },
        {
            "name": "c8",
            "type": "double",
            "doc": "Type inferred from '4.58'"
        },
        {
            "name": "c9",
            "type": "double",
            "doc": "Type inferred from '43.8'"
        },
        {
            "name": "c10",
            "type": "double",
            "doc": "Type inferred from '1.5'"
        },
        {
            "name": "c11",
            "type": "double",
            "doc": "Type inferred from '40.6'"
        },
        {
            "name": "postal_code",
            "type": "string",
            "doc": "Type inferred from '\"08863\"'"
        },
        {
            "name": "country",
            "type": "string",
            "doc": "Type inferred from '\"us\"'"
        }
    ]
}

avatar
Super Collaborator

@Timothy Spann : I tried editing the avro schema manually adding "type": "timestamp-millis" in place of "string". However the processor does not accept this and notifies a "Schema Validation Failure"

avatar
Master Guru

try as timestamp and not timestamp-millis. It may not support Avro 1.8

avatar
Super Collaborator

Thanks @Timothy Spann The Avro version is 1.7.7 (looking at the pom.xml). Is there a timestamp type in avro 1.7.7? I guess not. I could see the logicalType timestamp-* in 1.8 Should I assume that the timestamp would not work.

avatar
Master Guru

What version of HDF are you using? can you upgrade?

No timestamps in 1.7.7 http://avro.apache.org/docs/1.7.7/spec.html#schema_primitive

I seem to remember that being added in 1.8

Here are some AVRO notes I did last year for a meetup: https://github.com/airisdata/avroparquet

avatar
Super Collaborator

@Timothy Spann This is NiFi 1.1.1. I believe avro is still 1.7.7. Not sure if it is 1.8.

avatar
Master Guru
{
   "type" : "record",
   "namespace" : "hivetable",
   "name" : "hivetable",
   "fields" : [
      { "name" : "status" , "type" : "string" },
      { "name" : "rtimestamp" , "type" : "timestamp-millis" },
      { "name" : "country" , "type" : "string" }, 
      { "name" : "ip" , "type" : "string" }
   ]
}