I recently did a PoC with a customer to integrate NiFi with CDH, part of this was creating external tables in Hive on the newly loaded data. In this article I will share the approaches, useful workarounds, how to customise your own NiFi build for backwards compatibility, and provide a pre-built CDH-compatible Hive Bundle for you to download and try.
So first, why is this necessary?
Well the short answer is that NiFi 1.x's minimum supported version of Hive is 1.2.x, but CDH uses a fork of Hive1.1.x, which introduces two common backwards compatibility challenges:
The first is that it uses an older version of Thrift, so we need to configure NiFi to use this same version if we want to talk directly.
The second is that new features introduced after version 1.1.0 aren't available in the CDH release, so we have to stop NiFi from looking for them.
The obvious other option here is to work with CDH Hive indirectly, and thus we come to the workarounds.
It is very common in PoCs to not have all the software and configuration parameters exactly as you would like them to be, and to have no time to wait for change control to allow installs and firewall modifications. One of the great things about NiFi is the flexibility to quickly work around roadblocks, so here's the list of workarounds investigated:
The WebHCat service provides a RESTApi to run Hive queries which we could've accessed using the NiFi HTTP processors; unfortunately the port was blocked at the firewall.
The Beeline client could've been run via the NiFi Execute processors; however the NiFi server was outside the test CDH cluster and there was no available license for installing another gateway, nor time for the change control.
Stream the Hive queries in a bash runner via an SSH tunnel into an existing edge node on the test CDH cluster using NiFi ExecuteStream processors; this works, but breaks various rules.
Modify the NiFi-Hive processors to be Cloudera compatible, if not officially supported...
A pre-built NiFi-Hive bundle for CDH 5.10.0:
Note that I have only tested the Hive bundle functionality against CDH5.10.0, not any of the other processors such as HDFS or Kafka nor other versions. Neither I nor Hortonworks offer guarantees that this or other services will work against CDH and you should thoroughly test things before trusting them with important data.
Here is a Hive-Bundle I've built for CDH5.10.0, just copy it into your nifi/lib directory and restart the service, you should be able to connect the PutHiveQL and SelectHiveQL to your Hive2 service. (dropbox link to file)
How to create your own Cloudera-compatible NiFi Hive Bundle:
The following instructions were tested on a Centos7 VM.
ssh <build server FQDN>
sudo su -
yum update -y
yum install -y wget
# Install Maven, Java1.8, Git, to meet minimum NiFi build requirements.
wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo
yum install -y git java-1.8.0-openjdk apache-maven
git clone https://github.com/Chaffelson/nifi.git
git checkout nifi-1.1.x-cdhHiveBundle
mvn -T C2.0 clean install -Pcloudera -Dhive.version=1.1.0-cdh5.10.0 -Dhive.hadoop.version=2.6.0-cdh5.10.0 -Dhadoop.version=2.6.0-cdh5.10.0 -DskipTests
# browse to http://<build server FQDN>:8080/nifi to test your new hive bundle
I have created a branch of NiFi-1.1.x and modified it so the Hive Bundle is backwards compatible with CDH, and rolled in an updated fix or two for your convenience, here's a link to the diff
You may need to change the listed CDH versions to match your environment, I suggest you use the CDH Maven Repository documentation pages