Druid kafka ingestion from Hive - HDP 3.0

mlamairesse — Fri, 16 Sep 2022 13:51:53 GMT

Hi,

I'm trying to ingest event data form Kafka to Druid using the new Hive/Druid/Kafka integration in Hive 3
(see - https://cwiki.apache.org/confluence/display/Hive/Druid+Integration ; section " Druid Kafka Ingestion from Hive"

I've got events in JSON format in a Kafka topic using the following structure :

{
  "timestamp": "2018-11-04T22:43:10Z",
  "machine1": "RXI901",
  "machine2": "RXI902",
  "priority": "74",
  "level": "[e.warning]",
  "machine3": "RXI900",
  "Protocol": "TCP",
  "SrcIP": "109.26.211.73",
  "OriginalClientIP": "::",
  "DstIP": "192.168.104.96",
  "SrcPort": "36711",
  "DstPort": "54",
  "TCPFlags": "0x0",
  "IngressInterface": "s3p4",
  "EgressInterface": "s3p3",
  "IngressZone": "INUTILISE",
  "EgressZone": "INUTILISE",
  "DE": "Primary Detection Engine (f77608a0-0e20-11e6-91d7-88d7e001637c)",
  "Policy": "Default Access Control",
  "ConnectType": "Start",
  "AccessControlRuleName": "Unknown",
  "AccessControlRuleAction": "Allow",
  "PrefilterPolicy": "Unknown",
  "UserName": "No Authentication Required",
  "InitiatorPackets": 1,
  "ResponderPackets": 0,
  "InitiatorBytes": 80,
  "ResponderBytes": 0,
  "NAPPolicy": "Network Analysis",
  "DNSResponseType": "No Error",
  "Sinkhole": "Unknown",
  "URLCategory": "Unknown",
  "URLReputation": "Risk unknown"
}

To ingest them from Kafka I've created to following external table in Hive matching the JSON structure of the messages

CREATE EXTERNAL TABLE ssh_druid_kafka (
 `__time` timestamp,
 `machine1` string,
 `machine2` string,
 `priority` string,
 `level` string,
 `machine3` string,
 `Protocol` string,
 `SrcIP` string,
 `OriginalClientIP` string,
 `DstIP` string,
 `SrcPort` string,
 `DstPort` string,
 `TCPFlags` string,
 `IngressInterface` string,
 `EgressInterface` string,
 `IngressZone` string,
 `EgressZone` string,
 `DE` string,
 `Policy` string,
 `ConnectType` string,
 `AccessControlRuleName` string,
 `AccessControlRuleAction` string,
 `PrefilterPolicy` string,
 `UserName` string,
 `InitiatorPackets` int,
 `ResponderPackets` int,
 `InitiatorBytes` int,
 `ResponderBytes` int,
 `NAPPolicy` string,
 `DNSResponseType` string,
 `Sinkhole` string,
 `URLCategory` string,
 `URLReputation` string
)
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
 TBLPROPERTIES (
 "kafka.bootstrap.servers" = "[kakfa host]:6667",
 "kafka.topic" = "log_schema_raw",
 "druid.kafka.ingestion.useEarliestOffset" = "true",
 "druid.kafka.ingestion.maxRowsInMemory" = "20",
 "druid.kafka.ingestion.startDelay" = "PT5S",
 "druid.kafka.ingestion.period" = "PT30S",
 "druid.kafka.ingestion.consumer.retries" = "2"
);

ALTER TABLE ssh_druid_kafka SET TBLPROPERTIES("druid.kafka.ingestion" = 'START');

I'm getting an indexing task in Druid supervisor...

=> but no data source in the Druid Broker 😞

Upon closer look at the the task logs in Druid Supervisor, I see parsing errors :

2018-11-04T23:06:06,305 ERROR [MonitorScheduler-0] io.druid.segment.realtime.RealtimeMetricsMonitor - [60] Unparseable events! Turn on debug logging to see exception stack trace.
2018-11-04T23:09:06,306 ERROR [MonitorScheduler-0] io.druid.segment.realtime.RealtimeMetricsMonitor - [60] Unparseable events! Turn on debug logging to see exception stack trace.
...

Questions :

1. How do I enable Debug Logging on tasks ?
=> I've tried setting the log4j level to DEBUG in the Ambari Druid tab. That does affect the the log levels of the components but doesn't seem to affect the indexing tasks.

2. What is the format expected by Druid for using the Kafka Indexing Service ?

Am I missing something ?

Thank for your help

Re: Druid kafka ingestion from Hive - HDP 3.0

sbouguerra — Wed, 07 Nov 2018 04:19:55 GMT

Looked at the code and seems like the current state of the art the timestamp column is hard coded to be __time, thus that is why you are getting the exceptions since your column is called `timestamp`.

https://github.com/apache/hive/blob/a51e6aeaf816bdeea5e91ba3a0fab8a31b3a496d/druid-handler/src/java/org/apache/hadoop/hive/druid/DruidStorageHandler.java#L301

If this is the case this is a serious limitation and need to be fixed. @Nishant Bangarwa what you think?

Re: Druid kafka ingestion from Hive - HDP 3.0

mlamairesse — Thu, 08 Nov 2018 03:05:21 GMT

Hi @Slim

Thanks, yep that was it.

Another little quirk identified with the help of Charles Bernard :

- All names in the JSON object must be in lower case for them to be parsed

- A corollary of this is that all columns names must also be in lower case

Re: Druid kafka ingestion from Hive - HDP 3.0

sbouguerra — Thu, 08 Nov 2018 03:07:13 GMT

Correct @Matthieu Lamairesse Druid is case sensitive while Hive is not, thus, to make it work you need to make sure that all the columns are in lowercase format.

question Druid kafka ingestion from Hive - HDP 3.0 in Archives of Support Questions (Read Only)

Druid kafka ingestion from Hive - HDP 3.0

Re: Druid kafka ingestion from Hive - HDP 3.0

Re: Druid kafka ingestion from Hive - HDP 3.0

Re: Druid kafka ingestion from Hive - HDP 3.0