Community Articles

Find and share helpful community-sourced technical articles.
avatar

I recently did a PoC with a customer to integrate NiFi with CDH, part of this was creating external tables in Hive on the newly loaded data. In this article I will share the approaches, useful workarounds, how to customise your own NiFi build for backwards compatibility, and provide a pre-built CDH-compatible Hive Bundle for you to download and try.

So first, why is this necessary?

Well the short answer is that NiFi 1.x's minimum supported version of Hive is 1.2.x, but CDH uses a fork of Hive1.1.x, which introduces two common backwards compatibility challenges:

  1. The first is that it uses an older version of Thrift, so we need to configure NiFi to use this same version if we want to talk directly.
  2. The second is that new features introduced after version 1.1.0 aren't available in the CDH release, so we have to stop NiFi from looking for them.

The obvious other option here is to work with CDH Hive indirectly, and thus we come to the workarounds.

Workarounds:

It is very common in PoCs to not have all the software and configuration parameters exactly as you would like them to be, and to have no time to wait for change control to allow installs and firewall modifications. One of the great things about NiFi is the flexibility to quickly work around roadblocks, so here's the list of workarounds investigated:

  1. The WebHCat service provides a RESTApi to run Hive queries which we could've accessed using the NiFi HTTP processors; unfortunately the port was blocked at the firewall.
  2. The Beeline client could've been run via the NiFi Execute processors; however the NiFi server was outside the test CDH cluster and there was no available license for installing another gateway, nor time for the change control.
  3. Stream the Hive queries in a bash runner via an SSH tunnel into an existing edge node on the test CDH cluster using NiFi ExecuteStream processors; this works, but breaks various rules.
  4. Modify the NiFi-Hive processors to be Cloudera compatible, if not officially supported...

A pre-built NiFi-Hive bundle for CDH 5.10.0:

Note that I have only tested the Hive bundle functionality against CDH5.10.0, not any of the other processors such as HDFS or Kafka nor other versions. Neither I nor Hortonworks offer guarantees that this or other services will work against CDH and you should thoroughly test things before trusting them with important data.

Here is a Hive-Bundle I've built for CDH5.10.0, just copy it into your nifi/lib directory and restart the service, you should be able to connect the PutHiveQL and SelectHiveQL to your Hive2 service. (dropbox link to file)

How to create your own Cloudera-compatible NiFi Hive Bundle:

The following instructions were tested on a Centos7 VM.

ssh <build server FQDN>
sudo su -
yum update -y
yum install -y wget
# Install Maven, Java1.8, Git, to meet minimum NiFi build requirements.
wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo
yum install -y git java-1.8.0-openjdk apache-maven
logout
git clone https://github.com/Chaffelson/nifi.git
cd nifi
git checkout nifi-1.1.x-cdhHiveBundle
mvn -T C2.0 clean install -Pcloudera -Dhive.version=1.1.0-cdh5.10.0 -Dhive.hadoop.version=2.6.0-cdh5.10.0 -Dhadoop.version=2.6.0-cdh5.10.0 -DskipTests
nifi-assembly/target/nifi-1.1.1-SNAPSHOT-bin/nifi-1.1.1-SNAPSHOT/bin/nifi.sh start
# browse to http://<build server FQDN>:8080/nifi to test your new hive bundle
  • I have created a branch of NiFi-1.1.x and modified it so the Hive Bundle is backwards compatible with CDH, and rolled in an updated fix or two for your convenience, here's a link to the diff
  • You may need to change the listed CDH versions to match your environment, I suggest you use the CDH Maven Repository documentation pages
11,841 Views
Comments
avatar
Contributor

Hi @Dan Chaffelson,

I had the backward compatibility issue and I followed your steps and pasted the nifi-hive-nar into my NiFi 1.1.2 instance. Now , SelectHiveQL was able to connect and query the table but it only gives me the headers(column names) and doesn't retrieve the data. My query was select * from table limit 100. Any idea why? The nifi-app.log wasn't updated either

avatar

Hi @Raghav Ramakrishann sorry I only just saw this comment as I've been away on Paternity leave. Can you share the version of CDH you're connecting to, and your service parameters? I might be able to troubleshoot a bit.

avatar
Contributor

Hi @Dan Chaffelson, sorry to not update my comment. I was able to troubleshoot it. It was an issue from the CDH side and not with the NAR file. It's working for me now. Thanks for sharing this article. Really helped me out! 🙂

avatar

Glad to hear it!

avatar

For connecting Nifi with Hive and Cloudera and Kerberos you can use JDBC. Configure a DBCPConnectionPool as follows:

Database Connection URL: <host>:10000;AuthMech=1;KrbRealm=<kerberos realm>;KrbHostFQDN=_HOST;KrbServiceName=hive

Database Driver Class Name: com.cloudera.hive.jdbc41.HS2Driver

Database Driver Location: Location Cloudera JDBC jar files

After that you can use PutSQL, GetSQL en ConvertJSONtoSQL

Add comment Share

avatar

Can confirm the DBCPConnectionPool approach suggested here by @Rudolf Schimmel works. We did run into issues when using Java 10 (uncaught Exception: java.lang.NoClassDefFoundError: org/apache/thrift/TException even though libthrift was specified). Using Java 8 worked.