Support Questions

Find answers, ask questions, and share your expertise

How to validating a JSON file with JSON Schema.?

avatar
Super Collaborator

Hi,

i need to ingest only the JSON files following a valid schema.

i am trying to achieve this by validate-record processor.

i am supplying the same schema for both JSONTreeReader and JSONRecsetWriter.

I am not using AVRO because my input contains _ in the names.

(but i came up with this schema by modifying the input file without _ and using inferAvroSchema and then changed both to use _ to match the input file)

my schema and files are matching but its sending it to invalid relation. Anything wrong that i am doing..??

Schema :

{ "type": "record",

"name": "iHist",

"fields": [

{ "name": "file_name", "type": "string" },

{ "name": "plant", "type": "string" },

{ "name": "collector", "type": "string" },

{ "name": "name", "type": "string" },

{ "name": "unique_id", "type": "string" },

{ "name": "description", "type": "string" },

{ "name": "general_1", "type": "string" },

{ "name": "general_2", "type": "string" },

{ "name": "general_3", "type": "string" },

{ "name": "general_4", "type": "string" },

{ "name": "general_5", "type": "string" },

{ "name": "data_points", "type":

{ "type": "array",

"items":

{ "type": "record",

"name": "data_points",

"fields": [

{ "name": "timestamp", "type": "string" },

{ "name": "value", "type": "string" },

{ "name": "quality", "type": "string" }

]

} } } ] }

Data file..

{ "file-name": "tp-tcollec.tag.json",

"plant": "P11A3",

"collector": "test_Collector",

"name": "tag_SAFETY_MARGN.F_CV",

"unique-id": "1532358720761",

"description": "test",

"general-1": "",

"general-2": "",

"general-3": "",

"general-4": "",

"general-5": "",

"datapoints": [

{ "timestamp": "2016-07-19T10:25:43.000Z", "value": "177", "quality": "100" },

{ "timestamp": "2016-07-19T10:25:42.000Z", "value": "177", "quality": "100" },

{ "timestamp": "2016-07-19T10:25:41.000Z", "value": "177", "quality": "100" } ] }

I just need to validate if the input file is following the schema. any better ways to do this.??

82450-jrecsetwriter.jpg

82449-jtreereader.jpg

82448-validate.jpg

1 ACCEPTED SOLUTION

avatar
Master Guru

I'm not sure what you meant about changing the "-" to "_" with InferAvroSchema and such, but here's a different approach (assuming you have HDF 3.1 / Apache NiFi 1.5.0+) so you can use the "correct" Avro schema even though the field names have Avro-invalid characters:

Create an AvroSchemaRegistry controller service, and add a property called "mySchema" or whatever you want to call it, with the original schema as the value. Then set the "Validate Field Names" property to false (this was added in NiFi 1.5.0 via NIFI-4612), this will allow field names such as "general-1" without throwing an error. Then in your JsonTreeReader you can have an access strategy of "Use Schema Name", specifying your AvroSchemaRegistry in the "Schema Registry" property and "mySchema" as the value for the "Schema Name" property. The JSON writer can Inherit Schema so you don't need to put the schema in there either. When the schema comes from the schema registry with Validate Field Names set to false, you can use it even when the field names are not Avro-valid.

View solution in original post

5 REPLIES 5

avatar
Super Collaborator

Hi @Matt Burgess ,

any idea what i am doing wrong in the above case.?

avatar
Master Guru

I'm not sure what you meant about changing the "-" to "_" with InferAvroSchema and such, but here's a different approach (assuming you have HDF 3.1 / Apache NiFi 1.5.0+) so you can use the "correct" Avro schema even though the field names have Avro-invalid characters:

Create an AvroSchemaRegistry controller service, and add a property called "mySchema" or whatever you want to call it, with the original schema as the value. Then set the "Validate Field Names" property to false (this was added in NiFi 1.5.0 via NIFI-4612), this will allow field names such as "general-1" without throwing an error. Then in your JsonTreeReader you can have an access strategy of "Use Schema Name", specifying your AvroSchemaRegistry in the "Schema Registry" property and "mySchema" as the value for the "Schema Name" property. The JSON writer can Inherit Schema so you don't need to put the schema in there either. When the schema comes from the schema registry with Validate Field Names set to false, you can use it even when the field names are not Avro-valid.

avatar
Super Collaborator

@Matt Burgess ,

thank you. I didn't know about Validate Field Names.

avatar
Super Collaborator

Hi @Matt Burgess ,

is there anyway I can use the "validaterecord" to just validate if its following a schema and then route to valid.

don't know why we need to have the "Record Writer" for validation. its changing the file format a little bit. moving the tags order etc..

I just want the input file as output if its valid without changing the contents. or is there any other way that I can achieve this.?

Regards,

Sai

avatar
Master Guru

ValidateRecord is more about validating the individual records than it is about validating the entire flow file. If some records are valid and some are invalid, each type will be routed to the corresponding relationship. However, for invalid records, we can't use the same record writer as valid records, or else we know it will fail (because we know they're invalid), so there is a second RecordWriter for invalid records (you might use this to try to record the field names or something, but by the time that ValidateRecord knows the individual record is invalid, it doesn't know that it came in as Avro (for example), nor does it know that you might want it to go out as Avro. That's the flexibility and power of the Record Reader/Writer paradigm, but in this case the tradeoff is that you can't currently treat the entire flow file as valid or invalid.

It may make sense to have a "Invalid Record Strategy" property, to choose between "Individual Records" using the RecordWriters (the current behavior), or "Original FlowFile" which would ignore the RecordWriters and instead transfer the entire incoming flow file as-is to the 'invalid' relationship. Please feel free to file an improvement Jira for this capability.