About jknulst

jknulst · ‎12-05-2018

@Ramisetty Venkatesh its easy, just execute which kinit on Nix and the output is what should go into the hook script part at: echo "The cluster is secure, calling kinit ..." kinit_cmd="/usr/bin/kinit -kt $HDFS_KEYTAB $HDFS_PRINCIPAL" But 9 times ou of 10 /usr/bin/kinit will be just fine

jknulst · ‎12-04-2018

@Ricardo Junior Thanks for your answer to yourself 🙂 It helped me after many many many hours of Kafka debugging. BTW; I my case it was exactly the same scenario: Kerberos -> De-Kerberize -> Re-Kerberize Thanks

jknulst · ‎11-21-2018

@Arindam Choudhury I think the only way to do it is to set the Hiveserver2 to 'No authentication' mode. If you really want anyone to connect anonymously that is what you could do.

jknulst · ‎11-21-2018

@Amit Nandi I almost gave up on performing the last step with Hive, thinking I needed Spark/Scala to do it. End then it just worked. But doing the same with Spark can be done without a doubt. Maybe next time

jknulst · ‎11-21-2018

THIS IS ACTUALLY AN ARTICLE, NOT A QUESTION I want to show you the power of some built-in Hive functions to transform JSON data, which is 'normalized' (optimized for transport) into a denormalized format which is much more suitable for data analysis. This demo has been tested on HDP-3.0.1.0 with, Hive 3.1.0 but should be portable to lower Hive versions. Suppose we have this inbound data, which might represent some inbound experiment test data: { "data" : { "receipt_time" : "2018-09-28T10:00:00.000Z", "site" : "Los Angeles", "measures" : [ { "test_id" : "C23_PV", "metrics" : [ { "val1" : [ 0.76, 0.75, 0.71 ], "temp" : [ 0, 2, 5 ], "TS" : [ 1538128801336, 1538128810408, 1538128818420 ] } ] }, { "test_id" : "HBI2_XX", "metrics" : [ { "val1" : [ 0.65, 0.71 ], "temp" : [ 1, -7], "TS" : [ 1538128828433, 1538128834541 ] } ] }] } } There are 3 nested arrays in this 1 JSON record. It is pretty printed above to get a feel for the data structure, but remember we need to feed it to Hive as 1 line per JSON only: {"data":{"receipt_time":"2018-09-28T10:00:00.000Z","site":"LosAngeles","measures":[{"test_id":"C23_PV","metrics":[{"val1":[0.76,0.75,0.71],"temp":[0,2,5],"TS":[1538128801336,1538128810408,1538128818420]}]},{"test_id":"HBI2_XX","metrics":[{"val1":[0.65,0.71],"temp":[1,-7],"TS":[1538128828433,1538128834541]}]}]}} The goal of the Hive transformations is to get to the layout as below receipt_time | site | test_id | val1 | temp | TS ------------------------------------------------------------------------------------ 2018-09-28T10:00:00.000Z | Los Angeles | C23_PV | 0.76 | 0 | 1538128801336 2018-09-28T10:00:00.000Z | Los Angeles | C23_PV | 0.75 | 2 | 1538128810408 2018-09-28T10:00:00.000Z | Los Angeles | C23_PV | 0.71 | 5 | 1538128818420 2018-09-28T10:00:00.000Z | Los Angeles | HBI2_XX | 0.65 | 1 | 1538128828433 2018-09-28T10:00:00.000Z | Los Angeles | HBI2_XX | 0.71 | -7 | 1538128834541 Note that 1 JSON record has been exploded into 5 rows (the sum of sizes of the 'metrics' array in the 'measures' array) and keys of the inner most JSON keys (val1, temp, TS) have been transposed to top level columns. So how do we go about this? First we need a Hive table overlay that understands the JSON structure: CREATE EXTERNAL TABLE IF NOT EXISTS ds.json_serde( data struct< receipt_time: STRING, site: STRING, measures: ARRAY< struct< test_id: STRING, metrics: ARRAY< struct< val1: array<DOUBLE>, temp: array<SMALLINT>, TS: array<BIGINT> > > > > > ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '/user/hive/external/json_serde' TBLPROPERTIES ("transactional"="false"); It is a Hive EXTERNAL table since then it is much easier to hand it (insert) some file containing the JSON strings. To do that we create a local file that contains just the JSON one liner (without line endings, copy it from above) Upload that file ('/home/cloudbreak/hive_json/source.json' in my case) to the folder for the external table we just created: hdfs dfs -mkdir -p /user/hive/external/json_serde hdfs dfs -put /home/cloudbreak/hive_json/source.json /user/hive/external/json_serde Test the Hive table; hive> select data.site, data.measures[0].metrics[0].temp from ds.json_serde; It should return: INFO : OK +-------------+----------+ | site | temp | +-------------+----------+ | LosAngeles | [0,2,5] | +-------------+----------+ 1 row selected (0.826 seconds) 0: jdbc:hive2://spark1-e0.zmq0bv3frkfuhfsbkcz> Now we begin transforming the data SELECT b.*, a.data.receipt_time, a.data.site from ds.json_serde a LATERAL VIEW OUTER inline(a.data.measures) b; The inline function will do 2 things here: 1.Explode the json into as many rows as there are array members in a.data.measures, 2 rows in this case 2.Create a new column for each JSON key that exists on the top level of the array members, in this case 'test_id' and 'metrics' of the 'measures' array objects *You can also try to exchange 'inline(a.data.measures)' for 'explode(a.data.measures)' in the statement above to see the difference. The output should look like this: INFO : OK +------------+----------------------------------------------------+---------------------------+-------------+ | b.test_id | b.metrics | receipt_time | site | +------------+----------------------------------------------------+---------------------------+-------------+ | C23_PV | [{"val1":[0.76,0.75,0.71],"temp":[0,2,5],"ts":[1538128801336,1538128810408,1538128818420]}] | 2018-09-28T10:00:00.000Z | LosAngeles | | HBI2_XX | [{"val1":[0.65,0.71],"temp":[1,-7],"ts":[1538128828433,1538128834541]}] | 2018-09-28T10:00:00.000Z | LosAngeles | +------------+----------------------------------------------------+---------------------------+-------------+ 2 rows selected (0.594 seconds) 0: jdbc:hive2://spark1-e0.zmq0bv3frkfuhfsbkcz> Note that the 'receipt_time' and 'site' fields have been propagated (or denormalized) onto every row. That is something we wanted. Because of the nested arrays we need to take this 1 step further: SELECT c.receipt_time, c.site, c.test_id, d.* FROM (SELECT b.*, a.data.receipt_time, a.data.site from ds.json_serde a LATERAL VIEW OUTER inline(a.data.measures) b) c LATERAL VIEW OUTER inline(c.metrics) d This statement might look daunting, but if you look carefully I am just doing the very same thing but on a nested table which is exactly the same as the previous query. *You could also materialize the result of the first LATERAL query in a Hive table with a CTAS statement The result should look like this now: INFO : OK +---------------------------+-------------+------------+-------------------+----------+----------------------------------------------+ | c.receipt_time | c.site | c.test_id | d.val1 | d.temp | d.ts | +---------------------------+-------------+------------+-------------------+----------+----------------------------------------------+ | 2018-09-28T10:00:00.000Z | LosAngeles | C23_PV | [0.76,0.75,0.71] | [0,2,5] | [1538128801336,1538128810408,1538128818420] | | 2018-09-28T10:00:00.000Z | LosAngeles | HBI2_XX | [0.65,0.71] | [1,-7] | [1538128828433,1538128834541] | +---------------------------+-------------+------------+-------------------+----------+----------------------------------------------+ 2 rows selected (0.785 seconds) 0: jdbc:hive2://spark1-e0.zmq0bv3frkfuhfsbkcz> It is beginning to look a lot better, but there is 1 last problem to solve; the arrays of 'metrics' are always of equal size but we want the first member of the 'val1' array to be connected/merged with the first member of the 'temp' array etc. There is a creative way to do this: For readability I will now materialize the second query statement into a intermediary table named 'ds.json_serde_II': CREATE TABLE ds.json_serde_II AS SELECT c.receipt_time, c.site, c.test_id, d.* FROM (SELECT b.*, a.data.receipt_time, a.data.site from ds.json_serde a LATERAL VIEW OUTER inline(a.data.measures) b) c LATERAL VIEW OUTER inline(c.metrics) d * Make sure you get the same result by running 'select * from ds.json_serde_II;' From here it takes only 1 step to get to the desired end result: SELECT a.receipt_time, a.site, a.test_id, a.temp[b.pos] as temp, a.TS[b.pos] as TS, b.* from ds.json_serde_II a LATERAL VIEW OUTER posexplode(a.val1) b; It will result in: INFO : OK +---------------------------+-------------+------------+-------+----------------+--------+--------+ | a.receipt_time | a.site | a.test_id | temp | ts | b.pos | b.val | +---------------------------+-------------+------------+-------+----------------+--------+--------+ | 2018-09-28T10:00:00.000Z | LosAngeles | C23_PV | 0 | 1538128801336 | 0 | 0.76 | | 2018-09-28T10:00:00.000Z | LosAngeles | C23_PV | 2 | 1538128810408 | 1 | 0.75 | | 2018-09-28T10:00:00.000Z | LosAngeles | C23_PV | 5 | 1538128818420 | 2 | 0.71 | | 2018-09-28T10:00:00.000Z | LosAngeles | HBI2_XX | 1 | 1538128828433 | 0 | 0.65 | | 2018-09-28T10:00:00.000Z | LosAngeles | HBI2_XX | -7 | 1538128834541 | 1 | 0.71 | +---------------------------+-------------+------------+-------+----------------+--------+--------+ 5 rows selected (1.716 seconds) 0: jdbc:hive2://spark1-e0.zmq0bv3frkfuhfsbkcz> This needs some explaining: The function posexplode does the same thing as explode (creating as many rows as there are members in the array argument) but it also yields a positional number which is the zero-based index number of the array member. We can use this positional index to link to the corresponding index members of the other arrays 'temp' and 'TS' (all first members together on 1 row, all second members on next row etc. etc.). The clause a.temp[b.pos] is just walking the JSON/Hive path to the corresponding value in the other arrays. The value of b.pos is apparently known and resolved correctly because Hive will first take care of the exploding and then join the results back to the main query where b.pos is needed. Happy data processing!

jknulst · ‎10-15-2018

@dvillarreal Thanks for this, very useful! Changing the principal on the beeline connect string to "principal=HTTP/_HOST@SUPPORT.COM" is something I forgot implementing this hiveserver2 access pattern

jknulst · ‎10-12-2018

This is complex I believe your problem is you need to forward the traffic to/from the KDC to your Mac. You can do this by SSH tunnelling. That alone is not enough though since SSH port forwarding is only fit for TCP traffic and KDC traffic is UDP.

jknulst · ‎10-11-2018

I have the same problem on Nifi 1.5 and would be very interested to get the solution to get Nifi's En(De)crypt processor to work with PGP. In the meantime I turned to another solution by using the ExecuteStreamCommand processor and outsource the decryption to the CLI which is verified to work: Just be aware that you have to import the pub and private keys into the /home/nifi/.gnupg folder of the nifi user since that is the one executing the stream command. So you might have to run these commands (on every Nifi node!) first: gpg --import < pub_keys_armor.pgp gpg --import < priv_key_armor.pgp

jknulst · ‎06-19-2018

Hi, I wonder if there is a way to have the Kafka verifiable consumer only consume from 1 partition. The standard console consumer has a parameter: --partition 1 now to support that, but the Verifiable consumer doesn't. Can you achieve the same it by playing with the consumer config options?

jknulst · ‎06-19-2018

Hi, I run a verifiable consumer on HDF-3.1.1 but it never exits. echo -e "max.poll.records=1\nenable.auto.commit=true\nauto.commit.interval.ms=1" > /tmp/consumer.config && /usr/hdf/3.1.1.0-35/kafka/bin/kafka-verifiable-consumer.sh --broker-list rjk-hdf-m:6667,rjk-hdf-s-01:6667,rjk-hdf-s-02:6667 --topic truck_speed_events_only_avro_keyed_non_transactional --reset-policy earliest --consumer.config /tmp/consumer.config --group-id test_group --verbose --max-messages 10 The output is according to expectations (at first): {"timestamp":1529393105586,"name":"startup_complete"} {"timestamp":1529393105824,"name":"partitions_revoked","partitions":[]} {"timestamp":1529393108933,"name":"partitions_assigned","partitions":[{"topic":"truck_speed_events_only_avro_keyed_non_transactional","partition":2},{"topic":"truck_speed_events_only_avro_keyed_non_transactional","partition":1},{"topic":"truck_speed_events_only_avro_keyed_non_transactional","partition":0}]} {"timestamp":1529393109005,"name":"record_data","key":"95","value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0004\u0000\u0000\u0000\u0001.2018-05-28 19:43:05.689��X\"truck_speed_event�\u00010\u001ANadeem Asghar\u0006:Saint Louis to Chicago Route2�\u0001","topic":"truck_speed_events_only_avro_keyed_non_transactional","partition":1,"offset":0} {"timestamp":1529393109008,"name":"records_consumed","count":1,"partitions":[{"topic":"truck_speed_events_only_avro_keyed_non_transactional","partition":1,"count":1,"minOffset":0,"maxOffset":0}]} {"timestamp":1529393109024,"name":"offsets_committed","offsets":[{"topic":"truck_speed_events_only_avro_keyed_non_transactional","partition":1,"offset":1}],"success":true} {"timestamp":1529393109032,"name":"record_data","key":"65","value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0004\u0000\u0000\u0000\u0001.2018-05-28 19:43:05.734��X\"truck_speed_event�\u00014\u0016Don Hilborn\u00006Saint Louis to Tulsa Route2�\u0001","topic":"truck_speed_events_only_avro_keyed_non_transactional","partition":1,"offset":1} {"timestamp":1529393109032,"name":"records_consumed","count":1,"partitions":[{"topic":"truck_speed_events_only_avro_keyed_non_transactional","partition":1,"count":1,"minOffset":1,"maxOffset":1}]} {"timestamp":1529393109039,"name":"offsets_committed","offsets":[{"topic":"truck_speed_events_only_avro_keyed_non_transactional","partition":1,"offset":2}],"success":true} ETC. ETC. ETC. but after the parameterized 10 messages have been consumed the util just keeps running forever with the following screen output: {"timestamp":1529393109144,"name":"records_consumed","count":1,"partitions":[]} {"timestamp":1529393109144,"name":"offsets_committed","offsets":[],"success":true} {"timestamp":1529393109144,"name":"records_consumed","count":1,"partitions":[]} {"timestamp":1529393109144,"name":"offsets_committed","offsets":[],"success":true} {"timestamp":1529393109144,"name":"records_consumed","count":1,"partitions":[]} {"timestamp":1529393109144,"name":"offsets_committed","offsets":[],"success":true} {"timestamp":1529393109145,"name":"records_consumed","count":1,"partitions":[]} Why wouldn't the verifiable consumer just stop after --max-messages is reached? On the source code of the verifiable consumer (source) there is a method that should break the while loop: private boolean isFinished() { return hasMessageLimit() && consumedMessages >= maxMessages; } Thanks

Online	Offline
Last Visited	‎01-14-2025 09:37 AM

Member Since	‎08-15-2016 02:39 PM
Last Visited	‎01-14-2025 09:37 AM
Posts	189
Kudos received	63

Cloudera Community

Re: Metron Connection failed [Errno 111] error on ...

Re: How do I verify all of my data from enrichment...

Re: Metron profiler client unbale to load native h...

Re: Kafka ConsumerGroupCommand Error

Re: Metron Parser Fail to start with Ambari 2.5 an...

Re: How to create user home directory [/user/${use...

Re: No controller for Kafka cluster after kerberos

Re: How to configure hive so beeline does not ask ...

Re: Complex Json transformation using Hive functio...

Complex Json transformation using Hive functions

Re: Knox with kerberos authentication to proxy to ...

Re: Unable to Kinit on local mac with MIT KDC runn...

Re: NIFI::Decryption::encrypted message contains a...

Make Kafka verifiable consumer consume from 1 part...

Kafka verifiable consumer never shuts down