Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Kite HBase Dataset with Hive External Table

Highlighted

Kite HBase Dataset with Hive External Table

New Contributor

Hello,

 

We are using Kafka, Flume, Kite, and Morphlines to ETL data to our CDH 5.4.2 cluster.  Our real dataset has about 50 fields in it - this simplified example should help illustrate the problem (and my question).  Suppose we have the following fields:

- id
- customerName
- customerAddress

 

and we create an avro schema for that:

 

{"namespace" : "test",
 "type" : "record",
 "name" : "customers",
 "fields" : [
   {"name" : "id", "type" : "long"},
   {"name" : "customerName", "type" : "string"},
   {"name" : "customerAddress", "type" : "string"}
  ]
}

 

and the kite dataset on an HBase table:

 

# create partition strategy
kite-dataset -v partition-config id:copy -s schema.avsc -o partition.json
# create mapping-config
kite-dataset -v mapping-config id:key customerName:common customerAddress:common \
   -s schema.avsc -p partition.json -o mapping.json
# create dataset
kite-dataset create dataset:hbase:$ZKHOSTS/customers --use-hbase -s schema.avsc \
  --partition-by partition.json \
  --mapping mapping.json

 

Data is sent to Kafka as simple lines of text:

 

1<TAB>John Doe<TAB>123 Main St
2<TAB>Jane Smith<TAB>456 Main St

 

In our Flume config, we are using the org.apache.flume.source.kafka.KafkaSource.  We run a morphline to parse the input and turn it into avro:

 

morphlines : [
  id: customer
  importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
  commands : [
    {
      readCSV {
        separator : "\t"
        columns : [
          id,
          customerName,
          customerAddress
        ]
        ignoreFirstLine : false
        charset : UTF-8
      }
    }
    {
      toAvro {
        schemaFile: schema.avsc
      }
    }
  ]
]

 

Also in the Flume config, we configure a DatasetSink:

agent.sinks.customer-dataset.type = org.apache.flume.sink.kite.DatasetSink
agent.sinks.customer-dataset.kite.dataset.uri = dataset:hbase:$ZKHOSTS/customers

 

The messages stream in to HBase fine.  When using the kite-dataset show command, the data is returned as expected.

 

Our problem arises when trying to view the data with Hive.  We create the table using the HBaseStorageHandler:

 

CREATE EXTERNAL TABLE CUSTOMERS(id BIGINT,
  customerName string,
  customerAddress string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
  "hbase.columns.mapping" = ":key,common:customerName,common:customerAddress");

 

When we select rows from this table, there are extra characters at the beginning of the strings:

 

SELECT * FROM CUSTOMERS;

 

ID   Customer Name   Customer Address
1    .John Doe       $123 Main St
2    .Jane Smith    $456 Main St

 

Looking at HBase directly, it appears that these values have been stored as binary, possibly a binary representation of a Java string with the byte length at the beginning:

 

hbase(main):001:0> scan 'customers'
ROW                          COLUMN+CELL
 000000001\x00\x00           column=_s:w, timestamp=nnnnn, value=\x00\x00\x00\x00\x00\x00\x01
 000000001\x00\x00           column=common:customerName, timestamp=nnnnn, value=\x00.John Doe
 ...
 ...

 

Looking at the kite source in kite-data-hbase, it seems that all the serialization is in binary.

 

Is it possible to either 1) serialize these values as strings, or 2) configure the HBaseStorageHandler, or perhaps a different StorageHandler, to read these values for Hive?

 

Thanks for any help anyone can provide!

1 REPLY 1

Re: Kite HBase Dataset with Hive External Table

New Contributor

Try this:

 

CREATE EXTERNAL TABLE CUSTOMERS(id BIGINT, customerName string, customerAddress string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping"=":key,common:customerName,common:customerAddress", "hbase.table.default.storage.type" = "binary")

 

Add "hbase.table.default.storage.type" = "binary" on SERDEPROPERTIES.