Coming to NiFi we will be making use of the following processors :
1.ListHdfs + FetchHdfs processor – While configuring the List and Fetch HDFS processors we need to make sure that both these processors run on the primary node only so that the flow files are not duplicated across nodes
2.Convert Json to Avro processor – PutHiveStreaming processor supports input in the Avro format only. So any Json input needs to be converted to avro format
Lets construct the Nifi flow as below : ListHDFS--> FetchHDFS--> ConvertJsonToAvro-->PutHiveStreaming
Configuring the PutHiveStreaming processor
Set the values for the above as follows
The Hive meta store Uri --- Should be of
the format thrift://<Hive Metastore host>:9083. Note that hive meta store
host is not the same as the hive server host.
Hive Configuration Resources – Paths to Hadoop
and hive configuration files. We need to copy the Hadoop and hive configuration
files i.e. Hadoop-site.xml, core-site.xml and hive-site.xml to all the NiFi
Database Name – the database to which you
want to connect
Table name – Table name in which you
want to insert the data. Again note that the
a.ORC is the only format supported
currently. So your table must have "stored as orc"
b.transactional = "true" should
be set in the table create statement
c.Bucketed but not sorted. So your table
must have "clustered by (colName) into (n) buckets"
Auto-create partitions – If set to true hive
partitions will be auto created
Kerberos Principal – The Kerberos principal
Kerberos keytab – the path to the Kerberos
This completes the configuration part. Now we can start the processors to insert data into hive from hdfs.