Community Articles

rchaman · ‎04-19-2019

Configuring Hive (streaming)

Hive Streaming API allows data to be pumped continuously into Hive. The incoming data can be continuously committed in small batches of records into an existing Hive partition or table. Once data is committed it becomes immediately visible to all Hive queries initiated subsequently. Streaming support is built on top of ACID based insert/update support in Hive

Streaming Requirements

The following settings are required in hive-site.xml to enable ACID support for streaming:

hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.compactor.initiator.on = true (See more important details here)
hive.compactor.worker.threads > 0
“stored as orc” must be specified during table creation. Only ORC storage format is supported currently.
tblproperties("transactional"="true") must be set on the table during creation.
hive.support.concurrency=true
The Hive table must be bucketed, but not sorted. So something like “clustered by (colName) into 10 buckets” must be specified during table creation. The number of buckets is ideally the same as the number of streaming writers.
User of the client streaming process must have the necessary permissions to write to the table or partition and create partitions in the table

Limitations

Out of the box, currently, the streaming API only provides support for streaming delimited input data (such as CSV, tab separated, etc.) and JSON (strict syntax) formatted data. Support for other input formats can be provided by additional implementations of the RecordWriter interface.

Currently only ORC is supported for the format of the destination table.Creating hive database and tables – I will be creating all tables from Nifi user, as this will be the user with which data will be ingested. Since I am using Ranger, hence I don’t expect permission issues related to it.

Configuration

Depending on the data we are capturing we will create our table in that way. For eg the data captured in the form “hostname,lat,long,year,month,day,hour,min,sec,temp,pressure” separated by comma, hence below will be the table description.

1. Create Hive database and table (change the location) and input the details in Hive3streaming processor.

CREATE database sensor_data;

CREATE TABLE `sensor_data_orc`(

`hostname` string,

`lat` float,

`long` float,

`year` int,

`month` int,

`day` int,

`hour` int,

`min` int,

`second` int,

`temp` float,

`hum` float)

CLUSTERED BY (

day)

INTO 2 BUCKETS

ROW FORMAT SERDE

'org.apache.hadoop.hive.ql.io.orc.OrcSerde'

STORED AS INPUTFORMAT

'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

TBLPROPERTIES (

'transactional'='true');

2. Now at this point we are almost done, the only next thing is to start the Nifi Workflow, Minifi. If not, check service logs for any error or warnings. Once done, you will observe that data becomes visible on Nifi through the pipeline and eventually lands into HDFS via Hive. Now you can login to beeline and use Zeppelin to run sql queries on this table.

Happy hadooping !

Links to series

Part 1, Part 2, Part 3, Part 4, Part 5

Cloudera Community