Community Articles
Find and share helpful community-sourced technical articles.
Labels (1)
Super Guru

Business Need

I needed to extract links from web pages using JSoup. I originally wrote a microservice for my NiFi MP3 Jukebox. I built a custom NiFi Processor instead. It's a pretty simple process and there's a ton of great articles on how to do it referenced below.

Custom NiFi Process Development Process

One thing I found useful was having at least one JUnit to test running your processor. Deploying takes a while especially if you need to deploy to a cluster of servers. I found a lot of great NIFI Custom Processor Unit and Integration tests online (see reference area). It's really easy to develop custom processors in Java. In your tests you can input files and get out real files. In your tests you should have a saved copy of what the valid file should be and then you can compare the output. Your JUnits can be triggered from Jenkins or other build tools. NiFi Custom Processors can be developed in your standard development process (TDD, CD, CI, Autodeploy). For the NAR you just need to SCP your NAR file to all the NIFI nodes lib directory and restart NIFI. If you want to deploy your flow templates, you can do so with this tool.


package com.dataflowdeveloper.processors.process;

import java.util.List;

import org.apache.nifi.util.MockFlowFile;
import org.apache.nifi.util.TestRunner;
import org.apache.nifi.util.TestRunners;
import org.junit.Before;
import org.junit.Test;

public class LinkProcessorTest {

 private TestRunner testRunner;

 public void init() {
  testRunner = TestRunners.newTestRunner(LinkProcessor.class);

 public void testProcessor() {
  testRunner.setProperty("url", "");
  try {
   testRunner.enqueue(new FileInputStream(new File("src/test/resources/test.csv")));
  } catch (FileNotFoundException e) {
  List<MockFlowFile> successFiles = testRunner.getFlowFilesForRelationship(LinkProcessor.REL_SUCCESS);

  for (MockFlowFile mockFile : successFiles) {
   try {
    System.out.println("FILE:" + new String(mockFile.toByteArray(), "UTF-8"));
   } catch (UnsupportedEncodingException e) {


Step 1: Ingest / Parse a URL via NiFi (or could source from a file)


Step 2: LinkProcessor url could be hardcoded or expression from attributes from previous processor.


Step 3: UpdateAttribute to change the name of the resulting filename (add JSON, make unique).

Building the Link Processor

Run the included (on Linux or OSX), or run mvn install. Requires JDK 8 and Maven and internet access to build. Deploy the NAR! scp nifi-process-nar/target/nifi-linkextractor-nar-1.0-SNAPSHOT.nar PLACE:/place/ or copy it to your NIFI/lib directory locally. You can also get a release of the NAR from github.

Maven Build Script (pom.xml)

<?xml version="1.0" encoding="UTF-8"?>
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements. See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License. You may obtain a copy of the License at
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  See the License for the specific language governing permissions and
  limitations under the License.
<project xmlns="" xmlns:xsi="" xsi:schemaLocation="">


  <!-- jsoup HTML parser library @ -->


Upgrades Recommended

When I add this to a production flow, I would upsert this into Phoenix or convert to ORC for a Hive table. See my other articles listed below for examples of doing just that.

Example File

hdfs dfs -cat /linkprocessor/379875e9-5d99-4f88-82b1-fda7cdd7bc98.json
[{"link":"","descr":""},{"link":"","descr":""},{"link":"","descr":""},{"link":"","descr":""},{"link":"DataFlow Developer","descr":""},{"link":"Programmable OCR with Tesseract","descr":""},{"link":"Python Text Searchinv","descr":""},{"link":"Simple Bayes Text Classifier","descr":""},{"link":"Python Wrapper for Tesseract","descr":""},{"link":"Lector","descr":""},{"link":"VietOCR","descr":""},{"link":"OCRivist","descr":""},{"link":"TesseractGUI","descr":""},{"link":"Tesseract4J","descr":""},{"link":"Java wrapper for Tesseract","descr":""},{"link":"Basic Tesseract OCR Engine","descr":""},{"link":"Command Line Example Usage","descr":""},{"link":"Tesseract OCR Wiki","descr":""},{"link":"OCR Engine PDF","descr":""},{"link":"Homebrew","descr":""},{"link":"Running Tesseract from NiFI","descr":""},

Source Code

Articles for Storing to Phoenix, ORC and More


Example Flow File


Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.
Version history
Last update:
‎08-17-2019 07:10 AM
Updated by:
Top Kudoed Authors