Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Log parsing and loading to Hive/Impala tables

Re: Log parsing and loading to Hive/Impala tables

Expert Contributor

Thanks Joey for the detailed information! I'm using the latest 5.2.1 and will go through the steps to get things going.

Re: Log parsing and loading to Hive/Impala tables

Expert Contributor

What should kite.repo.uri be set to? I've created the dataset and the hdfs location is "//nameservice/user/hive/warehouse/logs" but not clear what the kite.repo.uri should be set to? Please let me know.. thanks!

Re: Log parsing and loading to Hive/Impala tables

Contributor
You should use a Hive repo URI:

repo:hive://<metastore server="" host="">:<metastore server="" port="">

This assumes that you created the table using the Kite command line
tool, which would have been something like:

kite-dataset create dataset:repo://<metastore server="" host="">:<metastore>
server port>/logs

If you created the table without using Kite but you're sure that the
table uses the same Avro schema as your Flume configuration, then you
can use an HDFS repo:

repo:hdfs://nameservice/user/hive/warehouse/logs

-Joey

Re: Log parsing and loading to Hive/Impala tables

Expert Contributor

I'm running into this NPE error:

 

(org.apache.flume.SinkRunner$PollingRunner.run:160)  - Unable to deliver   event. Exception follows.

org.apache.flume.EventDeliveryException: java.lang.NullPointerException

    at org.apache.flume.sink.kite.DatasetSink.process(DatasetSink.java:310)

    at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)

    at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)

    at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.NullPointerException

    at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:187)

    at com.google.common.cache.LocalCache.get(LocalCache.java:3964)

    at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)

    at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)

    at org.apache.flume.sink.kite.DatasetSink.schema(DatasetSink.java:346)

    at org.apache.flume.sink.kite.DatasetSink.deserialize(DatasetSink.java:330)

    at org.apache.flume.sink.kite.DatasetSink.process(DatasetSink.java:262)

 

 

I do notice the output record which seems to look fine and conforms with the avro schema with the same string fields:

[{_attachment_body=[[B@94cf495], userId=[112233], userLogin=[abcd]}]

 

Please let me know if I'm expected to see anything else in the output record. Thanks!

 

Re: Log parsing and loading to Hive/Impala tables

Contributor

The Event needs to have a header set with the schema literal or a schema URL:

 

"The only supported serialization is avro, and the record schema must be passed in the event headers, using either flume.avro.schema.literal with the JSON schema representation or flume.avro.schema.url with a URL where the schema may be found (hdfs:/... URIs are supported). This is compatible with the Log4jAppender flume client and the spooling directory source’s Avro deserializer using deserializer.schemaType =LITERAL."

 

The JSON example uses an interceptor to set the header so you could probably copy that:

 

https://github.com/kite-sdk/kite-examples/blob/master/json/flume.properties#L35-L41

Re: Log parsing and loading to Hive/Impala tables

Expert Contributor

Thanks Joey!! I also had to whitelist 'flume*' in the removeFields and that helped store the avro files to HDFS. 

Highlighted

Re: Log parsing and loading to Hive/Impala tables

Contributor
That's great!

If you're interested in describing your final solution as a guest blog
post, feel free to contact me directly. joey at cloudera.com.

-Joey