Reply
Expert Contributor
Posts: 139
Registered: ‎07-21-2014

Re: Log parsing and loading to Hive/Impala tables

Thanks Joey for the detailed information! I'm using the latest 5.2.1 and will go through the steps to get things going.

Expert Contributor
Posts: 139
Registered: ‎07-21-2014

Re: Log parsing and loading to Hive/Impala tables

What should kite.repo.uri be set to? I've created the dataset and the hdfs location is "//nameservice/user/hive/warehouse/logs" but not clear what the kite.repo.uri should be set to? Please let me know.. thanks!

Cloudera Employee
Posts: 26
Registered: ‎07-08-2013

Re: Log parsing and loading to Hive/Impala tables

You should use a Hive repo URI:

repo:hive://<metastore server="" host="">:<metastore server="" port="">

This assumes that you created the table using the Kite command line
tool, which would have been something like:

kite-dataset create dataset:repo://<metastore server="" host="">:<metastore>
server port>/logs

If you created the table without using Kite but you're sure that the
table uses the same Avro schema as your Flume configuration, then you
can use an HDFS repo:

repo:hdfs://nameservice/user/hive/warehouse/logs

-Joey

Expert Contributor
Posts: 139
Registered: ‎07-21-2014

Re: Log parsing and loading to Hive/Impala tables

I'm running into this NPE error:

 

(org.apache.flume.SinkRunner$PollingRunner.run:160)  - Unable to deliver   event. Exception follows.

org.apache.flume.EventDeliveryException: java.lang.NullPointerException

    at org.apache.flume.sink.kite.DatasetSink.process(DatasetSink.java:310)

    at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)

    at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)

    at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.NullPointerException

    at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:187)

    at com.google.common.cache.LocalCache.get(LocalCache.java:3964)

    at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)

    at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)

    at org.apache.flume.sink.kite.DatasetSink.schema(DatasetSink.java:346)

    at org.apache.flume.sink.kite.DatasetSink.deserialize(DatasetSink.java:330)

    at org.apache.flume.sink.kite.DatasetSink.process(DatasetSink.java:262)

 

 

I do notice the output record which seems to look fine and conforms with the avro schema with the same string fields:

[{_attachment_body=[[B@94cf495], userId=[112233], userLogin=[abcd]}]

 

Please let me know if I'm expected to see anything else in the output record. Thanks!

 

Cloudera Employee
Posts: 26
Registered: ‎07-08-2013

Re: Log parsing and loading to Hive/Impala tables

The Event needs to have a header set with the schema literal or a schema URL:

 

"The only supported serialization is avro, and the record schema must be passed in the event headers, using either flume.avro.schema.literal with the JSON schema representation or flume.avro.schema.url with a URL where the schema may be found (hdfs:/... URIs are supported). This is compatible with the Log4jAppender flume client and the spooling directory source’s Avro deserializer using deserializer.schemaType =LITERAL."

 

The JSON example uses an interceptor to set the header so you could probably copy that:

 

https://github.com/kite-sdk/kite-examples/blob/master/json/flume.properties#L35-L41

Expert Contributor
Posts: 139
Registered: ‎07-21-2014

Re: Log parsing and loading to Hive/Impala tables

Thanks Joey!! I also had to whitelist 'flume*' in the removeFields and that helped store the avro files to HDFS. 

Cloudera Employee
Posts: 26
Registered: ‎07-08-2013

Re: Log parsing and loading to Hive/Impala tables

That's great!

If you're interested in describing your final solution as a guest blog
post, feel free to contact me directly. joey at cloudera.com.

-Joey

Announcements