New Contributor
Posts: 3
Registered: ‎05-06-2014

Kite HBase Dataset with Hive External Table



We are using Kafka, Flume, Kite, and Morphlines to ETL data to our CDH 5.4.2 cluster.  Our real dataset has about 50 fields in it - this simplified example should help illustrate the problem (and my question).  Suppose we have the following fields:

- id
- customerName
- customerAddress


and we create an avro schema for that:


{"namespace" : "test",
 "type" : "record",
 "name" : "customers",
 "fields" : [
   {"name" : "id", "type" : "long"},
   {"name" : "customerName", "type" : "string"},
   {"name" : "customerAddress", "type" : "string"}


and the kite dataset on an HBase table:


# create partition strategy
kite-dataset -v partition-config id:copy -s schema.avsc -o partition.json
# create mapping-config
kite-dataset -v mapping-config id:key customerName:common customerAddress:common \
   -s schema.avsc -p partition.json -o mapping.json
# create dataset
kite-dataset create dataset:hbase:$ZKHOSTS/customers --use-hbase -s schema.avsc \
  --partition-by partition.json \
  --mapping mapping.json


Data is sent to Kafka as simple lines of text:


1<TAB>John Doe<TAB>123 Main St
2<TAB>Jane Smith<TAB>456 Main St


In our Flume config, we are using the org.apache.flume.source.kafka.KafkaSource.  We run a morphline to parse the input and turn it into avro:


morphlines : [
  id: customer
  importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
  commands : [
      readCSV {
        separator : "\t"
        columns : [
        ignoreFirstLine : false
        charset : UTF-8
      toAvro {
        schemaFile: schema.avsc


Also in the Flume config, we configure a DatasetSink:

agent.sinks.customer-dataset.type = org.apache.flume.sink.kite.DatasetSink
agent.sinks.customer-dataset.kite.dataset.uri = dataset:hbase:$ZKHOSTS/customers


The messages stream in to HBase fine.  When using the kite-dataset show command, the data is returned as expected.


Our problem arises when trying to view the data with Hive.  We create the table using the HBaseStorageHandler:


  customerName string,
  customerAddress string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
  "hbase.columns.mapping" = ":key,common:customerName,common:customerAddress");


When we select rows from this table, there are extra characters at the beginning of the strings:




ID   Customer Name   Customer Address
1    .John Doe       $123 Main St
2    .Jane Smith    $456 Main St


Looking at HBase directly, it appears that these values have been stored as binary, possibly a binary representation of a Java string with the byte length at the beginning:


hbase(main):001:0> scan 'customers'
ROW                          COLUMN+CELL
 000000001\x00\x00           column=_s:w, timestamp=nnnnn, value=\x00\x00\x00\x00\x00\x00\x01
 000000001\x00\x00           column=common:customerName, timestamp=nnnnn, value=\x00.John Doe


Looking at the kite source in kite-data-hbase, it seems that all the serialization is in binary.


Is it possible to either 1) serialize these values as strings, or 2) configure the HBaseStorageHandler, or perhaps a different StorageHandler, to read these values for Hive?


Thanks for any help anyone can provide!

New Contributor
Posts: 3
Registered: ‎02-12-2016

Re: Kite HBase Dataset with Hive External Table

Try this:


CREATE EXTERNAL TABLE CUSTOMERS(id BIGINT, customerName string, customerAddress string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping"=":key,common:customerName,common:customerAddress", "" = "binary")


Add "" = "binary" on SERDEPROPERTIES.


The Kite SDK is a collection of docs, sample code, APIs, and tools to make Hadoop application development faster. Learn more at