Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

The creation of large database schemas can be a very complicated task.  In this article, I am going to share how I used NiFi to fully automate a monstrous task.  For my project, I needed to create very large Avro Schemas, and corresponding Hive Tables for five or more Data Sources, each having from 400-500+ different CSV columns.   Doing this manually would have been a nightmare just to manage the initial schema creations.  Managing schema changes an even bigger task over time.   My answer was the Schema Generator API using NiFi and Schema Registry.  

 

Please reference the following NiFi Template:

Schema Generator API Demo Template 

Schema Generator APISchema Generator API

The above NiFi template provides the following capabilities:

  1. A NiFi API is capable of the following:
    • Accepting POST for a new table, given the tables data columns and data types is in CSV Format 
    • Creating Schema Registry Entity (POST: create)
    • Creating Schema Registry Avro Schema (POST: parse)
    • Creating Hive HQL Schema
    • Executing Hive HQL Statement
  2. A sample call to create Schema Registry Entity (demo).
  3. A sample call to parse Data Columns (22 string columns).
  4. Lots of helpful labels with notes.

The following are the template setup instructions:

  1. Download, Upload Template, and Drag Template to your NiFi Pallette.
  2. Make sure a Schema Registry is Setup within reach of NiFi.
  3. Edit the following Schema Generator Demo Process Group's Variables:
    • schemaGeneratorApiHost
    • schemaGeneratorApiPort
    • schemaRegistryUrl
    • hiveDatabaseName
    • hiveDatabaseConnectionUrl (jdbc string)
    • hiveConfigurationResources (path to hive-site.xml)
  4. Enable controller services in Schema Generator API process group:
    • StandardHttpContextMap for HandleHttpRequest & Response
    • HiveConnectionPool for PutHiveQl
  5. Start Schema Generator API Processor group.
  6. Navigate to samples and execute Sample Call 1, then 2 by switching appropriate GenerateFlowFile On/Off.  These 2 proc are disabled by default as you should switch them On and Off immediately. These are the only 2 proc that should not always run.  Disable them again when done.

This is just a basic demonstration to get you started with Schema Registry and Data Source Schema Automation.  Parts of this template are also helpful for anyone who needs to automate creating Avro Schemas and/or Hive Schemas for large CSVs which could still be done without Schema Registry.   The demo above has been tested up to 500 columns and includes mapping various different column types to hive data types.

Important Information

  1. The template is built and tested on NiFi 1.9, Single Node Nifi Cluster, with local Schema Registry Installed.
  2. The Schema Registry UI doesn't have full capability.  Learn the API to work with your Schemas Directly.  For Example:  Delete.  See my previous post Using the Schema Registry API for detailed API info.
  3. Versioning Schema Forward and Backward can be very problematic.  Be Warned.
  4. Use a proper and consistent table and column naming conventions. Complicated column names will break Avro and Hive.  Example characters include but not limited to: spaces, /, \, $, *, [, ], (, ), etc.  
  5. Schema Registry Entities and Associated Avro Schemas can be used in NiFi Record Readers, using HortonworksSchemaRegistry, and other Controller Services.
210 Views
0 Kudos
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
3 of 3
Last update:
‎04-08-2020 03:35 AM
Updated by:
 
Top Kudoed Authors