Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (2)
Expert Contributor

With the arrival of the Hive3Streaming processors, performance has never been better between NiFi and Hive3. Below, we'll be taking a look at the PutHive3Streaming (separate from the PutHiveStreaming processor) processor and how it can fit into a basic Change Data Capture workflow. We will be performing only inserts on the source Hive table and then carrying over those inserts into a destination Hive table. Updates are also supported through this process with the addition of a 'last_updated_datetime' timestamp, but that is out of scope for this article. This is meant to simulate copying data from one HDP cluster to another - we're using the same cluster as the source and destination here however.

Here is the completed flow. Take note that most of this flow is just keeping track of the latest ID that we've seen, so that we can pull that back out of HDFS periodically and query Hive for records that were added beyond that latest ID.

93134-wq7ml.png

These last two processors, SelectHive3QL and PutHive3Streaming, are the ones doing the heavy lifting. They are the ones first getting the data from the source table (based on the pre-determined latest ID) and then inserting that retrieved data into the destination Hive table.

93129-kzxkp.png

Here's the configuration for the SelectHive3QL processor. Note the ${curr_id} variable used in the HiveQL Select Query field. That ensures our query will be dynamic.

93132-zllzn.png

This is the configuration for the PutHive3Streaming processor. Nothing special here - we've configured the Hive Configuration Resources with the hive-site.xml file and used the Avro format (above) to retrieve data and (below) to write it back out.

93135-dujv7.png

Here's the state of the table as we first check the destination table:

93130-janli.png

And then insert a record into the source table:

93133-geiiv.png

Here's the new state of the table:

93131-iueou.png

Here is the full NiFi flow:

puthive3streaming-cdc-flow.xml

In conclusion, we've shown one of the ways you can utilize the new PutHive3Streaming processors to perform quick inserts into Hive and at a broader level perform Change Data Capture.

1,185 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 05:45 AM
Updated by:
 
Contributors
Top Kudoed Authors