About Jim_B

Jim_B · ‎01-15-2019

If the file is one big text string without newline characters, you could treat it as one line and parse with python (See https://community.hortonworks.com/content/kbentry/155544/how-to-troubleshoot-hive-udtf-functions.html) or define it as a single string and then parse with Json functions and normalize the array. You would have to make sure that the data wasn't too big in any file and also have to consider splits to ensure the whole file got read as one. In any case, not very scalable. If you have newline characters, you are pretty much stuck as the json serde is based on the text serde and each newline is considered a new record. If that is the case, you are going to have to preprocess maybe with python, or if you need more scale then Spark or PySpark.

Jim_B · ‎01-14-2019

Hmm, This is pretty simple json (the format with one complete document on each line of text file is correct) and it pretty much just worked for me (Sandbox 2.6.0). Also, this serde is usually available by default and doesn't require you to add any extra libs. you can see from the "at org.apache.hive.hcatalog.data.JsonSerDe.deserialize(JsonSerDe.java:172)" That Hive IS finding the serde jar. I would try creating the data file again, uploading and creating an external table then test. # place in text file and upload into hdfs /data/testjson { "id": 1, "nm": "Edward the Elder", "cty": "United Kingdom", "hse": "House of Wessex", "yrs": "899-925" } { "id": 2, "nm": "Athelstan", "cty": "United Kingdom", "hse": "House of Wessex", "yrs": "925-940" } CREATE EXTERNAL TABLE IF NOT EXISTS TestJson (id int, nm varchar(30), cty varchar(30), hse varchar(30), yrs varchar(20)) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE LOCATION '/data/testjson/'; CREATE TABLE IF NOT EXISTS TestJsonInt (id int, nm varchar(30), cty varchar(30), hse varchar(30), yrs varchar(20)) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE; insert overwrite table TestJsonInt select * from TestJson;

Jim_B · ‎01-11-2019

Need a bit more information - the schema, a sample of the json file, stack trace. The standard Hive Serde should be able to read most common json. But need more detail to tell.

Jim_B · ‎12-19-2018

You can do this in numerous ways. 1. With Hive You could A. Use the built-in json functions with some conditional logic (if, isnull, etc.) and create a superset (ugly!) B. Define the table as a string and Build a Java UDF C. Define the table as a string and use a Python Transform https://community.hortonworks.com/articles/72414/how-to-create-a-custom-udf-for-hive-using-python.html As an aside, the build in Json serde is pretty basic. I have used the alternative at https://github.com/rcongiu/Hive-JSON-Serde, which can do name mapping and some other cool stuff. But I don't think by itself, you could define a table definition that would nicely handle this case. 2. With Spark or Pyspark, you can pretty much do whatever you want If you are comfortable, I would recommend pre-processing the data with Spark into a common format. Or if you are more comfortable with Hive then use the TRANSFORM statement with a python script.

Jim_B · ‎12-12-2018

You might look at the table definition (show create table x) and verify the prefix on the table location. Hive stores the whole filespec including the protocol.

Jim_B · ‎11-19-2018

You can get the url from the Hive, Summary page in Ambari. For a Kerberized cluster, you will need to cal kinit to get a ticket before launching the python program. The format of the connection string for kerberos is something like the following. The actual user principal (and authentication) will be taken from the Kerberos ticket. Test the connection string with Beeline to make sure it works. beeline -u "jdbc:hive2://myhs2.foo:10000/default;principal=hive/myhs2@foo@MYKERBREALM;auth=kerberos"

Jim_B · ‎10-30-2018

Well, the hive-env.sh is just a shell script, so you could do some bash magic to see if user is in group. Something like the following (I am sure someone could do it more simply!). Note that this assumes that your groups are synced with linux using sssd, centrify, etc. NOTE: While not officially deprecated, using the hive cli command is discouraged in favor of beeline. Hive CLI doesn't take advantage of Ranger security. In HDP 3.x it is being rewritten as a wrapper around Hiveserver2. G=`groups $USER` IFS=', ' read -r -a mygroups <<< "$G" found=0 searchGroup="admin" if (printf '%s\n' "${mygroups[@]}" | grep -xq $searchGroup); then found=1 # Logic to allow Hive here. fi echo $found

Jim_B · ‎08-23-2018

After some research and help, I found that I had incorrectly set the following nifi.remote.input.host property. It should be set as follows in Advance nifi properties: nifi.remote.input.host={{nifi_node_host}} Each node should have a different value for nifi.remote.input.host, it's the value the current node is going to advertise for s2s comms... if you set that the same on all nodes then they are all advertising the same hostname and thus all data going to same host. You have to set the other multi-threading parameters such as "Maximum Timer Driven Thread Count" in controller settings, and "Concurrent Tasks" in the appropriate processors. But, this gets multiple nodes to listen for the RPG requests.

Jim_B · ‎08-23-2018

Did some additional research and the key to this is setting the following in the Advanced nifi-properties. nifi.remote.input.host={{nifi_node_host}} The documentation is a little weak, but setting this to a value, causes only that particular node to listen for RPG requests. Setting to this variable, will substitute each host's name in its' own nifi.properties file, and you are good to go. Still have to set the levels of concurrency for the overall canvas, and processors's, of course. But this gets all the nodes to listen for remote requests.

Jim_B · ‎08-21-2018

Trying to get nifi site-to-site RPG (loopback to a single cluster) to balance across nodes. Real flow sends out table names for processing across node, but put together a simple test flow to test. Still only going to one node after setting the usual suspects of properties Attachment 1 shows Test flow. RPG url is set to node xxx22, one of the worker nodes Attachment 2 shows settings for the test input port (Concurrent tasks 7, Batch Settings count 1) Attachment 3 shows Concurrent tasks of receiving process (not sure this matters, but I was desparate!) Attachment 3 shows provenance of receiving process with all files going to node xxx28. Anyone see an obvious (or not so obvious!) configuration error here? Thanks!

Online	Offline
Last Visited	‎11-12-2020 12:26 AM

Member Since	‎05-22-2019 10:28 AM
Last Visited	‎11-12-2020 12:26 AM
Posts	70
Kudos received	22

Cloudera Community

Re: Hive queries are failing in Ineractive quey HD...

Re: Disable Hive shell for user and provide access...

Re: Unable to get Nifi site-to-site RPG to balance...

Re: Ranger permissions to create temporary functio...

Re: Ambari client install performs unwanted JDK up...

Re: Issue with Hive JSON SerDe

Re: Issue with Hive JSON SerDe

Re: Issue with Hive JSON SerDe

Re: How to process multiple json events, all with ...

Re: Hive queries are failing in Ineractive quey HD...

Re: How do I access and query a HDFS table from Py...

Re: Disable Hive shell for user and provide access...

Re: Unable to get Nifi site-to-site RPG to balance...

Re: Unable to get Nifi site-to-site RPG to balance...

Unable to get Nifi site-to-site RPG to balance acr...