Created on 04-19-201905:35 AM - edited 08-17-201902:30 PM
Configuring Hive (streaming)
Hive Streaming API allows data to be pumped continuously into Hive. The incoming data can be continuously committed in small batches of records into an existing Hive partition or table. Once data is committed it becomes immediately visible to all Hive queries initiated subsequently. Streaming support is built on top of ACID based insert/update support in Hive
Streaming Requirements
The following settings are required in hive-site.xml to enable ACID support for streaming:
tblproperties("transactional"="true") must be set on the table during creation.
hive.support.concurrency=true
The Hive table must be bucketed, but not sorted. So something like “clustered by (colName) into 10 buckets” must be specified during table creation. The number of buckets is ideally the same as the number of streaming writers.
User of the client streaming process must have the necessary permissions to write to the table or partition and create partitions in the table
Limitations
Out of the box, currently, the streaming API only provides support for streaming delimited input data (such as CSV, tab separated, etc.) and JSON (strict syntax) formatted data. Support for other input formats can be provided by additional implementations of the RecordWriter interface.
Currently only ORC is supported for the format of the destination table.Creating hive database and tables – I will be creating all tables from Nifi user, as this will be the user with which data will be ingested. Since I am using Ranger, hence I don’t expect permission issues related to it.
Configuration
Depending on the data we are capturing we will create our table in that way. For eg the data captured in the form “hostname,lat,long,year,month,day,hour,min,sec,temp,pressure” separated by comma, hence below will be the table description.
1. Create Hive database and table (change the location) and input the details in Hive3streaming processor.
2. Now at this point we are almost done, the only next thing is to start the Nifi Workflow, Minifi. If not, check service logs for any error or warnings. Once done, you will observe that data becomes visible on Nifi through the pipeline and eventually lands into HDFS via Hive. Now you can login to beeline and use Zeppelin to run sql queries on this table.