Member since
08-15-2016
189
Posts
63
Kudos Received
22
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2819 | 01-02-2018 09:11 AM | |
1149 | 12-04-2017 11:37 AM | |
1110 | 10-03-2017 11:52 AM | |
14974 | 09-20-2017 09:35 PM | |
798 | 09-12-2017 06:50 PM |
01-22-2019
07:48 AM
@Krzysztof Zarzycki If I remember it correctly it was quite simple in the end. Just make sure you have the snippet below in your topology: <provider>
<role>identity-assertion</role>
<name>Default</name>
<enabled>true</enabled>
</provider>
This should get you the default behaviour where Knox propagates your identity forward to the proxied service, not its own
... View more
12-05-2018
04:14 PM
@Ramisetty Venkatesh its easy, just execute which kinit
on Nix and the output is what should go into the hook script part at: echo "The cluster is secure, calling kinit ..."
kinit_cmd="/usr/bin/kinit -kt $HDFS_KEYTAB $HDFS_PRINCIPAL" But 9 times ou of 10 /usr/bin/kinit will be just fine
... View more
12-04-2018
12:09 PM
@Ricardo Junior Thanks for your answer to yourself 🙂 It helped me after many many many hours of Kafka debugging. BTW; I my case it was exactly the same scenario: Kerberos -> De-Kerberize -> Re-Kerberize Thanks
... View more
11-21-2018
10:16 PM
@Arindam Choudhury I think the only way to do it is to set the Hiveserver2 to 'No authentication' mode. If you really want anyone to connect anonymously that is what you could do.
... View more
11-21-2018
09:02 PM
@Amit Nandi I almost gave up on performing the last step with Hive, thinking I needed Spark/Scala to do it. End then it just worked. But doing the same with Spark can be done without a doubt. Maybe next time
... View more
11-21-2018
06:47 PM
3 Kudos
THIS IS ACTUALLY AN ARTICLE, NOT A QUESTION I want to show you the power of some built-in Hive functions to transform JSON data, which is 'normalized' (optimized for transport) into a denormalized format which is much more suitable for data analysis. This demo has been tested on HDP-3.0.1.0 with, Hive 3.1.0 but should be portable to lower Hive versions. Suppose we have this inbound data, which might represent some inbound experiment test data: {
"data" : {
"receipt_time" : "2018-09-28T10:00:00.000Z",
"site" : "Los Angeles",
"measures" : [ {
"test_id" : "C23_PV",
"metrics" : [ {
"val1" : [ 0.76, 0.75, 0.71 ],
"temp" : [ 0, 2, 5 ],
"TS" : [ 1538128801336, 1538128810408, 1538128818420 ]
} ]
},
{
"test_id" : "HBI2_XX",
"metrics" : [ {
"val1" : [ 0.65, 0.71 ],
"temp" : [ 1, -7],
"TS" : [ 1538128828433, 1538128834541 ]
} ]
}]
}
}
There are 3 nested arrays in this 1 JSON record. It is pretty printed above to get a feel for the data structure, but remember we need to feed it to Hive as 1 line per JSON only: {"data":{"receipt_time":"2018-09-28T10:00:00.000Z","site":"LosAngeles","measures":[{"test_id":"C23_PV","metrics":[{"val1":[0.76,0.75,0.71],"temp":[0,2,5],"TS":[1538128801336,1538128810408,1538128818420]}]},{"test_id":"HBI2_XX","metrics":[{"val1":[0.65,0.71],"temp":[1,-7],"TS":[1538128828433,1538128834541]}]}]}} The goal of the Hive transformations is to get to the layout as below receipt_time | site | test_id | val1 | temp | TS
------------------------------------------------------------------------------------
2018-09-28T10:00:00.000Z | Los Angeles | C23_PV | 0.76 | 0 | 1538128801336
2018-09-28T10:00:00.000Z | Los Angeles | C23_PV | 0.75 | 2 | 1538128810408
2018-09-28T10:00:00.000Z | Los Angeles | C23_PV | 0.71 | 5 | 1538128818420
2018-09-28T10:00:00.000Z | Los Angeles | HBI2_XX | 0.65 | 1 | 1538128828433
2018-09-28T10:00:00.000Z | Los Angeles | HBI2_XX | 0.71 | -7 | 1538128834541
Note that 1 JSON record has been exploded into 5 rows (the sum of sizes of the 'metrics' array in the 'measures' array) and keys of the inner most JSON keys (val1, temp, TS) have been transposed to top level columns. So how do we go about this? First we need a Hive table overlay that understands the JSON structure: CREATE EXTERNAL TABLE IF NOT EXISTS ds.json_serde(
data struct<
receipt_time: STRING,
site: STRING,
measures: ARRAY<
struct< test_id: STRING,
metrics: ARRAY<
struct< val1: array<DOUBLE>,
temp: array<SMALLINT>,
TS: array<BIGINT>
> >
>
>
>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION '/user/hive/external/json_serde'
TBLPROPERTIES ("transactional"="false");
It is a Hive EXTERNAL table since then it is much easier to hand it (insert) some file containing the JSON strings. To do that we create a local file that contains just the JSON one liner (without line endings, copy it from above) Upload that file ('/home/cloudbreak/hive_json/source.json' in my case) to the folder for the external table we just created: hdfs dfs -mkdir -p /user/hive/external/json_serde
hdfs dfs -put /home/cloudbreak/hive_json/source.json /user/hive/external/json_serde Test the Hive table; hive> select data.site, data.measures[0].metrics[0].temp from ds.json_serde; It should return: INFO : OK
+-------------+----------+
| site | temp |
+-------------+----------+
| LosAngeles | [0,2,5] |
+-------------+----------+
1 row selected (0.826 seconds)
0: jdbc:hive2://spark1-e0.zmq0bv3frkfuhfsbkcz>
Now we begin transforming the data SELECT b.*, a.data.receipt_time, a.data.site from ds.json_serde a LATERAL VIEW OUTER inline(a.data.measures) b; The inline function will do 2 things here: 1.Explode the json into as many rows as there are array members in a.data.measures, 2 rows in this case 2.Create a new column for each JSON key that exists on the top level of the array members, in this case 'test_id' and 'metrics' of the 'measures' array objects *You can also try to exchange 'inline(a.data.measures)' for 'explode(a.data.measures)' in the statement above to see the difference. The output should look like this: INFO : OK
+------------+----------------------------------------------------+---------------------------+-------------+
| b.test_id | b.metrics | receipt_time | site |
+------------+----------------------------------------------------+---------------------------+-------------+
| C23_PV | [{"val1":[0.76,0.75,0.71],"temp":[0,2,5],"ts":[1538128801336,1538128810408,1538128818420]}] | 2018-09-28T10:00:00.000Z | LosAngeles |
| HBI2_XX | [{"val1":[0.65,0.71],"temp":[1,-7],"ts":[1538128828433,1538128834541]}] | 2018-09-28T10:00:00.000Z | LosAngeles |
+------------+----------------------------------------------------+---------------------------+-------------+
2 rows selected (0.594 seconds)
0: jdbc:hive2://spark1-e0.zmq0bv3frkfuhfsbkcz>
Note that the 'receipt_time' and 'site' fields have been propagated (or denormalized) onto every row. That is something we wanted. Because of the nested arrays we need to take this 1 step further: SELECT c.receipt_time, c.site, c.test_id, d.* FROM (SELECT b.*, a.data.receipt_time, a.data.site from ds.json_serde a LATERAL VIEW OUTER inline(a.data.measures) b) c LATERAL VIEW OUTER inline(c.metrics) d This statement might look daunting, but if you look carefully I am just doing the very same thing but on a nested table which is exactly the same as the previous query. *You could also materialize the result of the first LATERAL query in a Hive table with a CTAS statement The result should look like this now: INFO : OK
+---------------------------+-------------+------------+-------------------+----------+----------------------------------------------+
| c.receipt_time | c.site | c.test_id | d.val1 | d.temp | d.ts |
+---------------------------+-------------+------------+-------------------+----------+----------------------------------------------+
| 2018-09-28T10:00:00.000Z | LosAngeles | C23_PV | [0.76,0.75,0.71] | [0,2,5] | [1538128801336,1538128810408,1538128818420] |
| 2018-09-28T10:00:00.000Z | LosAngeles | HBI2_XX | [0.65,0.71] | [1,-7] | [1538128828433,1538128834541] |
+---------------------------+-------------+------------+-------------------+----------+----------------------------------------------+
2 rows selected (0.785 seconds)
0: jdbc:hive2://spark1-e0.zmq0bv3frkfuhfsbkcz>
It is beginning to look a lot better, but there is 1 last problem to solve; the arrays of 'metrics' are always of equal size but we want the first member of the 'val1' array to be connected/merged with the first member of the 'temp' array etc. There is a creative way to do this: For readability I will now materialize the second query statement into a intermediary table named 'ds.json_serde_II': CREATE TABLE ds.json_serde_II AS SELECT c.receipt_time, c.site, c.test_id, d.* FROM (SELECT b.*, a.data.receipt_time, a.data.site from ds.json_serde a LATERAL VIEW OUTER inline(a.data.measures) b) c LATERAL VIEW OUTER inline(c.metrics) d * Make sure you get the same result by running 'select * from ds.json_serde_II;' From here it takes only 1 step to get to the desired end result: SELECT a.receipt_time, a.site, a.test_id, a.temp[b.pos] as temp, a.TS[b.pos] as TS, b.* from ds.json_serde_II a LATERAL VIEW OUTER posexplode(a.val1) b; It will result in: INFO : OK
+---------------------------+-------------+------------+-------+----------------+--------+--------+
| a.receipt_time | a.site | a.test_id | temp | ts | b.pos | b.val |
+---------------------------+-------------+------------+-------+----------------+--------+--------+
| 2018-09-28T10:00:00.000Z | LosAngeles | C23_PV | 0 | 1538128801336 | 0 | 0.76 |
| 2018-09-28T10:00:00.000Z | LosAngeles | C23_PV | 2 | 1538128810408 | 1 | 0.75 |
| 2018-09-28T10:00:00.000Z | LosAngeles | C23_PV | 5 | 1538128818420 | 2 | 0.71 |
| 2018-09-28T10:00:00.000Z | LosAngeles | HBI2_XX | 1 | 1538128828433 | 0 | 0.65 |
| 2018-09-28T10:00:00.000Z | LosAngeles | HBI2_XX | -7 | 1538128834541 | 1 | 0.71 |
+---------------------------+-------------+------------+-------+----------------+--------+--------+
5 rows selected (1.716 seconds)
0: jdbc:hive2://spark1-e0.zmq0bv3frkfuhfsbkcz>
This needs some explaining: The function posexplode does the same thing as explode (creating as many rows as there are members in the array argument) but it also yields a positional number which is the zero-based index number of the array member. We can use this positional index to link to the corresponding index members of the other arrays 'temp' and 'TS' (all first members together on 1 row, all second members on next row etc. etc.). The clause a.temp[b.pos] is just walking the JSON/Hive path to the corresponding value in the other arrays. The value of b.pos is apparently known and resolved correctly because Hive will first take care of the exploding and then join the results back to the main query where b.pos is needed. Happy data processing!
... View more
Labels:
- Labels:
-
Apache Hive
10-21-2018
07:10 PM
Did you install the Java JCE Unlimited strength policy jars ?
... View more
10-15-2018
03:48 PM
I am implementing the following access pattern from Tableau (server or desktop) to Hiveserver2: Tableau Desktop MacOS + Kerberos --> Knox (topology with Kerberos/hadoop-auth) --> Hiveserver2 Kerberos I got this to work, but the only thing that is lacking is adding 'user impersonation'. What I mean by that is similar behaviour you get when using beeline and having the: ;hive.server2.proxy.user=<end-user> switch to have Ranger policies for <end-user> applied and not for the service user that is Kerberos authenticated to Hiveserver2 (that would be 'knox' in this case). So that is the caveat at the moment; all hive interactions show as user 'knox' on Ranger audits and not the <end-user>. On the Windows version of the Hortonworks Hive ODBC driver there is an input box for 'Delegation UID' which seems to be just the option I am after, but on the OsX version of the driver it is different. You can manage many options on the odbc.ini files cat /Library/hortonworks/hive/Setup/odbc.ini
[ODBC]
# Specify any global ODBC configuration here such as ODBC tracing.
[ODBC Data Sources]
Hortonworks Hive=Hortonworks Hive ODBC Driver
[Hortonworks Hive]
# Description: DSN Description.
# This key is not necessary and is only to give a description of the data source.
Description=Hortonworks Hive ODBC Driver DSN
# Driver: The location where the ODBC driver is installed to.
Driver=/Library/hortonworks/hive/lib/universal/libhortonworkshiveodbc.dylib
# When using No Service Discovery, specify the IP address or host name of the Hive server.
# When using ZooKeeper as the Service Discovery Mode, specify a comma-separated list of ZooKeeper
# servers in the following format:
# <zk_host1:zk_port1>,<zk_host2:zk_port2>,...
HOST=
# The TCP port Hive server is listening. This is not required when using ZooKeeper as the service
# discovery mode as the port is specified in the HOST connection attribute.
PORT=
# The name of the database schema to use when a schema is not explicitly specified in a query.
Schema=default
# Set to 0 to when connecting directory to Hive Server 2 (No Service Discovery).
# Set to 1 to do Hive Server 2 service discovery using ZooKeeper.
# Note service discovery is not support when using Hive Server 1.
ServiceDiscoveryMode=0
# The namespace on ZooKeeper under which Hive Server 2 znodes are added. Required only when doing
# HS2 service discovery with ZooKeeper (ServiceDiscoveryMode=1).
ZKNamespace=
# Set to 1 if you are connecting to Hive Server 1. Set to 2 if you are connecting to Hive Server 2.
HiveServerType=2
# The authentication mechanism to use for the connection.
# Set to 0 for No Authentication
# Set to 1 for Kerberos
# Set to 2 for User Name
# Set to 3 for User Name and Password
# Note only No Authentication is supported when connecting to Hive Server 1.
AuthMech=1
# The Thrift transport to use for the connection.
# Set to 0 for Binary
# Set to 1 for SASL
# Set to 2 for HTTP
# Note for Hive Server 1 only Binary can be used.
ThriftTransport=2
# When this option is enabled (1), the driver does not transform the queries emitted by an
# application, so the native query is used.
# When this option is disabled (0), the driver transforms the queries emitted by an application and
# converts them into an equivalent from in HiveQL.
UseNativeQuery=0
# Set the UID with the user name to use to access Hive when using AuthMech 2 to 8.
UID=
# The following is settings used when using Kerberos authentication (AuthMech 1 and 10)
# The fully qualified host name part of the of the Hive Server 2 Kerberos service principal.
# For example if the service principal name of you Hive Server 2 is:
# hive/myhs2.mydomain.com@EXAMPLE.COM
# Then set KrbHostFQDN to myhs2.mydomain.com
KrbHostFQDN=_HOST
# The service name part of the of the Hive Server 2 Kerberos service principal.
# For example if the service principal name of you Hive Server 2 is:
# hive/myhs2.mydomain.com@EXAMPLE.COM
# Then set KrbServiceName to hive
KrbServiceName=hive
# The realm part of the of the Hive Server 2 Kerberos service principal.
# For example if the service principal name of you Hive Server 2 is:
# hive/myhs2.mydomain.com@EXAMPLE.COM
# Then set KrbRealm to EXAMPLE.COM
KrbRealm=EDL.DEV.BASF.COM
# Set to 1 to enable SSL. Set to 0 to disable.
SSL=1
# Set to 1 to enable two-way SSL. Set to 0 to disable. You must enable SSL in order to
# use two-way SSL.
TwoWaySSL=0
# The file containing the client certificate in PEM format. This is required when using two-way SSL.
ClientCert=/Users/jknulst/Documents/Customers/BASF/Knox/gateway-identity.pem
# The client private key. This is used for two-way SSL authentication.
ClientPrivateKey=
# The password for the client private key. Password is only required for password protected
# client private key.
ClientPrivateKeyPassword=
So there is an option # Set the UID with the user name to use to access Hive when using AuthMech 2 to 8.
UID= but it states specifically that it can't be used in tandem with Kerberos auth. So, the question is ; When connecting Tableau to Hiveserver2 can you pass the desired end-user to Hiveserver2 using the HWX ODBC Hive driver to get similar impersonation like using 'hive.server2.proxy.user' on a jdbc connect string ?
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Knox
10-15-2018
12:06 PM
@dvillarreal Thanks for this, very useful! Changing the principal on the beeline connect string to "principal=HTTP/_HOST@SUPPORT.COM" is something I forgot implementing this hiveserver2 access pattern
... View more
10-12-2018
02:44 PM
This is complex I believe your problem is you need to forward the traffic to/from the KDC to your Mac. You can do this by SSH tunnelling. That alone is not enough though since SSH port forwarding is only fit for TCP traffic and KDC traffic is UDP.
... View more
10-11-2018
08:13 AM
I have the same problem on Nifi 1.5 and would be very interested to get the solution to get Nifi's En(De)crypt processor to work with PGP. In the meantime I turned to another solution by using the ExecuteStreamCommand processor and outsource the decryption to the CLI which is verified to work: Just be aware that you have to import the pub and private keys into the /home/nifi/.gnupg folder of the nifi user since that is the one executing the stream command. So you might have to run these commands (on every Nifi node!) first: gpg --import < pub_keys_armor.pgp
gpg --import < priv_key_armor.pgp
... View more
09-03-2018
07:46 PM
I have a HDF-3.1.2.0 deploy managed by an Ambari 2.6.2.2. Now the requirement is to have Atlas extracting Nifi metadata. That Atlas preferably has to be managed by the same Ambari currently running HDF. Problem is that all docs and HCC posts seem to handle the opposite (Installing HDF on Ambari already running HDP) and not this particular use case (Installing HDP on Ambari already running HDF). So far Ambari does seems to support it, I can register HDP-2.6.5.0 as a new version: But when I click the fly out at "Install On" and choose the current cluster Ambari takes me back to the versions screen after a short while, where the HDP stack is then NOT appearing: There is no error, no warning, nowhere, not even in the ambari-server logs It this possible with Ambari 2.6.2.2 ? Am I missing a step? What is so different in this scenario (first HDF, then HDP) from installing HDF on top of HDP? I really hope that I don't need to start over completely with this Ambari, install HDP first (for Atlas) and then put HDF back on it (this would erase all my HDF settings) Any advise greatly appreciated
... View more
06-19-2018
07:42 AM
Hi, I wonder if there is a way to have the Kafka verifiable consumer only consume from 1 partition. The standard console consumer has a parameter: --partition 1 now to support that, but the Verifiable consumer doesn't. Can you achieve the same it by playing with the consumer config options?
... View more
Labels:
- Labels:
-
Apache Kafka
06-19-2018
07:36 AM
Hi, I run a verifiable consumer on HDF-3.1.1 but it never exits. echo -e "max.poll.records=1\nenable.auto.commit=true\nauto.commit.interval.ms=1" > /tmp/consumer.config && /usr/hdf/3.1.1.0-35/kafka/bin/kafka-verifiable-consumer.sh --broker-list rjk-hdf-m:6667,rjk-hdf-s-01:6667,rjk-hdf-s-02:6667 --topic truck_speed_events_only_avro_keyed_non_transactional --reset-policy earliest --consumer.config /tmp/consumer.config --group-id test_group --verbose --max-messages 10 The output is according to expectations (at first): {"timestamp":1529393105586,"name":"startup_complete"}
{"timestamp":1529393105824,"name":"partitions_revoked","partitions":[]}
{"timestamp":1529393108933,"name":"partitions_assigned","partitions":[{"topic":"truck_speed_events_only_avro_keyed_non_transactional","partition":2},{"topic":"truck_speed_events_only_avro_keyed_non_transactional","partition":1},{"topic":"truck_speed_events_only_avro_keyed_non_transactional","partition":0}]}
{"timestamp":1529393109005,"name":"record_data","key":"95","value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0004\u0000\u0000\u0000\u0001.2018-05-28 19:43:05.689�����X\"truck_speed_event�\u00010\u001ANadeem Asghar\u0006:Saint Louis to Chicago Route2�\u0001","topic":"truck_speed_events_only_avro_keyed_non_transactional","partition":1,"offset":0}
{"timestamp":1529393109008,"name":"records_consumed","count":1,"partitions":[{"topic":"truck_speed_events_only_avro_keyed_non_transactional","partition":1,"count":1,"minOffset":0,"maxOffset":0}]}
{"timestamp":1529393109024,"name":"offsets_committed","offsets":[{"topic":"truck_speed_events_only_avro_keyed_non_transactional","partition":1,"offset":1}],"success":true}
{"timestamp":1529393109032,"name":"record_data","key":"65","value":"\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0004\u0000\u0000\u0000\u0001.2018-05-28 19:43:05.734�����X\"truck_speed_event�\u00014\u0016Don Hilborn\u00006Saint Louis to Tulsa Route2�\u0001","topic":"truck_speed_events_only_avro_keyed_non_transactional","partition":1,"offset":1}
{"timestamp":1529393109032,"name":"records_consumed","count":1,"partitions":[{"topic":"truck_speed_events_only_avro_keyed_non_transactional","partition":1,"count":1,"minOffset":1,"maxOffset":1}]}
{"timestamp":1529393109039,"name":"offsets_committed","offsets":[{"topic":"truck_speed_events_only_avro_keyed_non_transactional","partition":1,"offset":2}],"success":true}
ETC. ETC. ETC.
but after the parameterized 10 messages have been consumed the util just keeps running forever with the following screen output: {"timestamp":1529393109144,"name":"records_consumed","count":1,"partitions":[]}
{"timestamp":1529393109144,"name":"offsets_committed","offsets":[],"success":true}
{"timestamp":1529393109144,"name":"records_consumed","count":1,"partitions":[]}
{"timestamp":1529393109144,"name":"offsets_committed","offsets":[],"success":true}
{"timestamp":1529393109144,"name":"records_consumed","count":1,"partitions":[]}
{"timestamp":1529393109144,"name":"offsets_committed","offsets":[],"success":true}
{"timestamp":1529393109145,"name":"records_consumed","count":1,"partitions":[]}
Why wouldn't the verifiable consumer just stop after --max-messages is reached? On the source code of the verifiable consumer (source) there is a method that should break the while loop: private boolean isFinished() { return hasMessageLimit() && consumedMessages >= maxMessages; } Thanks
... View more
Labels:
- Labels:
-
Apache Kafka
05-01-2018
09:56 PM
Great article!
... View more
04-19-2018
11:39 AM
Hi, This means you have to set up your browser to forward your Kerberos ticket to the Schema Registry server to authenticate. Like listed here I always use Firefox for this. Schema Registry currently does not support basic user:password authentication like you were trying to do
... View more
03-27-2018
02:58 PM
Hi, I am looking for a way to get only the very last row of data from a Druid cube into a Superset dashboard. Actually the data set is a time series and in a dashboard I want to present the very last state, so the row with the highest timestamp (' ts2' ). On the Druid side I made sure the data is not aggregated (by including the ts2 field as a dimension field) My Druid datasource has the following fields: So I thought if I put the query like this (granularity ALL, all fields in the NOT GROUPED BY box): it should return the atomic, non aggregated rows and if I sort by the time stamp 'ts2' I should be able to return only the last row of data: But it doesn't. Looking at the Druid query definition Superset creates out of this it makes sense: {
"queryType": "timeseries",
"intervals": "2018-03-20T14:43:23+00:00/2018-03-27T14:43:23+00:00",
"granularity": "all",
"postAggregations": [],
"aggregations": [
{
"type": "count",
"name": "count"
}
],
"dataSource": "webmetricsIIts2"
} None of my requested dimension columns of NOT GROUPED BY input box are in the query def. Anyone got an idea how to do this?
... View more
03-21-2018
08:49 AM
@abu ameen Please mark the question as answered if sufficiently answered
... View more
03-14-2018
06:39 PM
Hi, It depends on what JSON files you mean: -The unit-of-work JSONs (incoming and enriched sensor events) are always 'stored' on Kafka topics (it is not meant to be there forever, it is just intermediate storage in between stream processing) -The configuration JSON's of Metron are stored on Zookeeper
... View more
03-13-2018
07:32 PM
@Matt Burgess I don't get the syntax, but it works like a charm! Thank you
... View more
03-13-2018
06:13 PM
1 Kudo
@Rahul Soni Thanks for your answer but I really don't want to hardcode all of my 50+ JSON fields
... View more
03-13-2018
01:24 PM
@Shu Maybe, one more thing: -I would preferably don't want to do any hardcoding, since the JSON has many keys I am specifically looking for a dynamic way I came across JOLT functions like =toLower / =join and was kind of hoping for =replace but that one seems missing.
... View more
03-13-2018
12:17 PM
Hi, I have a requirement for transforming JSON on Nifi which seems simple but I haven't been able to solve: Input json: {
"agent-submit-time" : -1,
"agent-end-time" : 123445,
"agent-name" : "Marie Bayer-Smith"
} Desired: {
"agent_submit_time" : -1,
"agent_end_time" : 123445,
"agent_name" : "Marie Bayer-Smith"
} -I don't want to use the ReplaceText processor since replacing "-" for "_" might impact values too -I need this to be able to infer AVRO schema on incoming JSON records (AVRO does not like the dashes at all) -since I already use a Jolt processor for another transformation in the JSON it makes sense to include it in the same processor to prevent unnecessary Nifi overhead. I think I would need the JoltTransformJSON processor as it is very powerful (but the syntax evades me) for this but open for other options too.
... View more
Labels:
- Labels:
-
Apache NiFi
01-17-2018
05:29 PM
Hi Pravin, I my case this setting on Ambari > Yarn > Config > Scheduler was vital: yarn.scheduler.capacity.root.acl_submit_applications= (<--- there is a space there as value !! ) Explanation can be read here: https://community.hortonworks.com/content/supportkb/49101/capacity-scheduler-users-can-submit-to-any-queue-1.html Basically since this baseline authorization is not disabled by default, anyone can just submit any job in Yarn by default. Also since all child queue inherit this auth from the root parent submitting app is very tolerant if you don't restrict it. After I had done this, I finally started to see my Yarn Ranger policies kicking in and also many DENIED results on SUBMIT_APP on Ranger audit based on Access Enforcer "yarn-acl"
... View more
01-04-2018
09:25 AM
Updated. Maybe you need a github account
... View more
01-04-2018
08:53 AM
Please have a look at the documentation. For instance here : https://github.com/apache/metron/tree/master/metron-platform/metron-parsers and here: https://github.com/apache/metron/tree/master/metron-platform/metron-enrichment https://github.com/apache/metron/tree/master/metron-platform/metron-indexing Indexing to ES is supported in all Metron versions. If you want the latest and greatest go for HCP-1.4 / Metron-0.4.1 ( https://docs.hortonworks.com/HDPDocuments/HCP1/HCP-1.4.0/index.html )
... View more
01-04-2018
08:45 AM
It could be that the setting at Ambari > Metron > Indexing > "Indexing Max Pending" would limit indexing output. Is that one set to 10.000.000 by any chance? This setting limits the total number of in-flight tuples (read/consumed from Kafka input topic but not acked yet) for the topology. In this case, the tuples are not acked for 1 or both outputs (HDFS / ES) and thus not marked as 'no longer in-flight'.
... View more
01-04-2018
08:34 AM
@Kartik Batheja Please mark the question as answered if sufficiently answered
... View more
01-04-2018
08:33 AM
@Kartik Batheja Kartik, there is some news around this. In the newest Hortonworks HCP version (HCP-1.4.0 / Metron 0.4.1) there is support for Elastic Search / Kibana 5.6.2. More details here : https://docs.hortonworks.com/HDPDocuments/HCP1/HCP-1.4.0/bk_release-notes/content/ch01s06.html
... View more