Community Articles

Find and share helpful community-sourced technical articles.
Welcome to the upgraded Community! Read this blog to see What’s New!
Rising Star

Image Data Flow for Industrial Imaging


Ingest and store manufacturing quality assurance images, measurements, and metadata in a cost-effective and simple-to-retrieve-from platform that can provide analytic capability in the future.


In high-speed manufacturing, imaging systems may be used to identify material imperfections, monitor thermal state, or identify when tolerances are exceeded. Many commercially-available systems automate measurement and reporting of specific tests, but combining results from multiple instrumentation vendors, longer-term storage, process analytics, and comprehensive auditability require different technology.

Using HDF’s NiFi and HDP’s HDFS, Hive or Hbase, and Zeppelin, one can build a cost-effective and performant solution to store and retrieve these images, as well as provide a platform for machine learning based on that data.

Sample files and code, including the Zeppelin notebook, can be found on this github repository:


HDF 3.0 or later (NiFi

HDP 2.6.5 or later (Hadoop 2.6.3+ and Hive 1.2.1+)

Spark or later

Zeppelin 0.7.2+


  1. Get the files to a filesystem accessible to NiFi. In this case, we are assuming the source system can get the files to a local directory (e.g., via an NFS mount).
  2. Ingest the image and data files to long-term storage
    1. Use a ListFile processor to scrape the directory. In this example, the collected data files are in a root directory, with each manufacturing run’s image files placed in a separate subdirectory. We’ll use that location to separate out the files later
    2. Use a FetchFile to pull the files listed in the flowfile generated by our ListFile. FetchFile can move all the files to an archive directory once they have been read.
    3. Since we’re using a different path for the images versus the original source data, we’ll split the flow using an UpdateAttribute to store the file type, then route them to two PutHDFS processors to place them appropriately. Note that PutHDFS_images uses the automatically-parsed original ${path} to reproduce the source folder structure.


  3. Parse the data files to make them available for SQL queries
    1. Beginning with only the csv flowfiles, the ExtractGrok processor is used to pick one field from the second line of the flowfile (skipping the header row). This field is referenced by expression language that sets the schema name we will use to parse the flowfile.
    2. A RouteOnAttribute processor checks the schema name using a regex to determine whether it the flowfile format is one that requires additional processing to parse. In the example, flowfiles identified as the “sem_meta_10083” schema are routed to the processor group “Preprocess-SEM-particle.” This processor group contains the steps for parsing nested arrays within the csv flowfile.
    3. Within the “Preprocess-SEM-particle” processor group, the flowfile is parsed using a temporary schema. A temporary schema can be helpful to parse some sections of a flowfile row (or tuple) while leaving others for later processing.


    4. The flowfile is split into individual records by a SplitRecord processor. SplitRecord is similar to a SplitJSON or SplitText processor, but it uses NiFi’s record-oriented parsers to identify each record rather than relying strictly on length or linebreaks.
    5. A JoltTransform uses the powerful JOLT language to parse a section of the csv file with nested arrays. In this case, a semicolon-delimited array of comma-separated values is reformatted to valid JSON then split out into separate flowfiles by an EvaluateJSONPath processor. This JOLT transform uses an interesting combination of JOLT wildcards and repeated processing of the same path to handle multiple possible formats.
    6. Once formatted by a record-oriented processor such as ConvertRecord or SplitRecord, the flowfile can be reformatted easily as Avro, then inserted into a Hive table using a PutHiveStreaming processor. PutHiveStreaming can be configured to ignore extra fields in the source flowfile or target Hive table so that many overlapping formats can be written to a table with a superset of columns in Hive. In this example, the 10083-formatted flowfiles are inserted row-by-row, and the particle and 10021-formatted flowfiles are inserted in bulk.


  4. Create a simple interface to retrieve individual images for review. The browser-based Zeppelin notebook can natively render images stored in SQL tables or in HDFS.
    1. The notebook begins with some basic queries to view the data loaded from the imaging subsystems.
    2. The first example paragraph uses SQL to pull a specific record from the manufacturing run, then looks for the matching file on HDFS by its timestamp.


    3. The second set of example paragraphs use an HTML/Angular form to collect the information, then display the matching image.


    4. The third set of sample paragraphs demonstrates how to obtain the image via Scala for analysis or display.




  • Storing microscopy data in HDF5/USID Schema to make it available for analysis using standard libraries
  • Applying TensorFlow to microscopy data for image analysis
Version history
Last update:
‎08-17-2019 05:03 AM
Updated by: