Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

Automated data ingest with NiFi

Super Collaborator

Suppose I would want to take as many manual steps out of the ingestion as possible for trivial tasks. What would I need to reduce the dependency on the Data Engineering team?

 

A trivial flow I may want to automate could be that an analyst has:

1. A directory on an FTP server where headerless CSV files come in

2. A hive database in which a new table should be created to store these files

 

Currently the main steps to take are:

1. Clarifying the specifications

2. Creating the flow

3. Testing the flow

4. Deploying the flow to production

 

The main goal is to reduce the workload, but introducing standardization would also be great.

I have some thoughts, but additional inputs or examples are welcome.


- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'. Also check out my techincal portfolio at https://portfolio.jaheruddin.nl
1 REPLY 1

Super Collaborator

Here is an overview of the key steps in achieving this:

 

1. Clarifying the specifications

Someone will always need to define the specifications. However this does not imply you must follow an iterative process between the business user and the data engineering team,where questions are asked as you run into them. Instead introduce a structured way of gathering that information, and make sure it is intuitive enough that it can be done without involvement of the core team.

 

Essentially you need an input form, it is up to you in which tool you want to make this (excel, a small website, a dedicated form creation tool), just make sure it is easily usable by your target audience, and that you can easily read programmatically what people filled in.

 

In the example case these parts of the form would be the most important:

a. Source information (e.g. ftp server url, directory)

b. Target information (e.g. hive database, table)

c. Target Metadata (e.g. Column names, data types, descriptions, tags)

 

2. Creating the flow

Rather than designing a flow, define once how your flow should be made based on an arbitrary input. Of course will need to define this for every type of flow that you want to support. But I would just start with one flow, and then decide whether you want to expand the logic for supporting more similar flows, or want independent logic for different types of flows that you want to automate.

 

Once you know what the flow should look like you no longer need manual design, you can just generate it programatically. There are two ways to attack the problem, depending on how flexible you want to be:

a. Using templates

b. Directly generating the elements on the canvas, for instance with nipi(e.g. this)

 

After deciding on your logic, you will likely also want to deploy the flow. To avoid duplication I will only cover that in the later point on deploying to production.

 

3. Testing the flow

This is quite a broad topic. There is the official documentation and it certainly covers part of the topic, but don't get confused with the parts that are more geared to testing in the context of your own custom processors. Rougly it comes down to this:

a. You want to check if the deployment succeeded and everything is running (easy)

b. You want to check your nonfunctionals (very situational)

c. You want to check if the outcome roughly looks like you expect (hard)

 

For the latter, you could think about checking the number of output records and things like 'if i feed a row with non-null content, does my output contain any null values?!' or 'does the last value in the file land in the last cell in the table?!'. This is very much dependent on the situation, but fortunately you can gather inspiration on what kind of things to check by looking at your traditional ETL tests.

 

4. Deploying the flow to production

The good news is that you can use any automation tool of choice to achieve all the above, and deploy to produciton after a (hopefully automated) successful test. Usefull tools for this are again NiPy and the Variable Registry. Here is a blog by @Pierre Villard illustrating this, including actual examples.


- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'. Also check out my techincal portfolio at https://portfolio.jaheruddin.nl