The creation of large database schemas can be a very complicated task. In this article, I am going to share how I used NiFi to fully automate a monstrous task. For my project, I needed to create very large Avro Schemas, and corresponding Hive Tables for five or more Data Sources, each having from 400-500+ different CSV columns. Doing this manually would have been a nightmare just to manage the initial schema creations. Managing schema changes an even bigger task over time. My answer was the Schema Generator API using NiFi and Schema Registry.
A sample call to create Schema Registry Entity (demo).
A sample call to parse Data Columns (22 string columns).
Lots of helpful labels with notes.
The following are the template setup instructions:
Download, Upload Template, and Drag Template to your NiFi Pallette.
Make sure a Schema Registry is Setup within reach of NiFi.
Edit the following Schema Generator Demo Process Group's Variables:
schemaGeneratorApiHost
schemaGeneratorApiPort
schemaRegistryUrl
hiveDatabaseName
hiveDatabaseConnectionUrl (jdbc string)
hiveConfigurationResources (path to hive-site.xml)
Enable controller services in Schema Generator API process group:
StandardHttpContextMap for HandleHttpRequest & Response
HiveConnectionPool for PutHiveQl
Start Schema Generator API Processor group.
Navigate to samples and execute Sample Call 1, then 2 by switching appropriate GenerateFlowFile On/Off. These 2 proc are disabled by default as you should switch them On and Off immediately. These are the only 2 proc that should not always run. Disable them again when done.
This is just a basic demonstration to get you started with Schema Registry and Data Source Schema Automation. Parts of this template are also helpful for anyone who needs to automate creating Avro Schemas and/or Hive Schemas for large CSVs which could still be done without Schema Registry. The demo above has been tested up to 500 columns and includes mapping various different column types to hive data types.
Important Information
The template is built and tested on NiFi 1.9, Single Node Nifi Cluster, with local Schema Registry Installed.
The Schema Registry UI doesn't have full capability. Learn the API to work with your Schemas Directly. For Example: Delete. See my previous post Using the Schema Registry API for detailed API info.
Versioning Schema Forward and Backward can be very problematic. Be Warned.
Use a proper and consistent table and column naming conventions. Complicated column names will break Avro and Hive. Example characters include but not limited to: spaces, /, \, $, *, [, ], (, ), etc.
Schema Registry Entities and Associated Avro Schemas can be used in NiFi Record Readers, using HortonworksSchemaRegistry, and other Controller Services.