Member since
05-22-2019
70
Posts
24
Kudos Received
8
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1157 | 12-12-2018 09:05 PM | |
1178 | 10-30-2018 06:48 PM | |
1641 | 08-23-2018 11:17 PM | |
7360 | 10-07-2016 07:54 PM | |
1970 | 08-18-2016 05:55 PM |
01-15-2019
12:45 AM
If the file is one big text string without newline characters, you could treat it as one line and parse with python (See https://community.hortonworks.com/content/kbentry/155544/how-to-troubleshoot-hive-udtf-functions.html) or define it as a single string and then parse with Json functions and normalize the array. You would have to make sure that the data wasn't too big in any file and also have to consider splits to ensure the whole file got read as one. In any case, not very scalable. If you have newline characters, you are pretty much stuck as the json serde is based on the text serde and each newline is considered a new record. If that is the case, you are going to have to preprocess maybe with python, or if you need more scale then Spark or PySpark.
... View more
01-14-2019
05:55 PM
Hmm, This is pretty simple json (the format with one complete document on each line of text file is correct) and it pretty much just worked for me (Sandbox 2.6.0). Also, this serde is usually available by default and doesn't require you to add any extra libs. you can see from the "at org.apache.hive.hcatalog.data.JsonSerDe.deserialize(JsonSerDe.java:172)" That Hive IS finding the serde jar. I would try creating the data file again, uploading and creating an external table then test. # place in text file and upload into hdfs /data/testjson { "id": 1, "nm": "Edward the Elder", "cty": "United Kingdom", "hse": "House of Wessex", "yrs": "899-925" } { "id": 2, "nm": "Athelstan", "cty": "United Kingdom", "hse": "House of Wessex", "yrs": "925-940" } CREATE EXTERNAL TABLE IF NOT EXISTS TestJson (id int, nm varchar(30), cty varchar(30), hse varchar(30), yrs varchar(20))
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION '/data/testjson/'; CREATE TABLE IF NOT EXISTS TestJsonInt (id int, nm varchar(30), cty varchar(30), hse varchar(30), yrs varchar(20))
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE;
insert overwrite table TestJsonInt select * from TestJson;
... View more
01-11-2019
07:27 PM
Need a bit more information - the schema, a sample of the json file, stack trace. The standard Hive Serde should be able to read most common json. But need more detail to tell.
... View more
12-19-2018
07:11 PM
You can do this in numerous ways. 1. With Hive You could A. Use the built-in json functions with some conditional logic (if, isnull, etc.) and create a superset (ugly!) B. Define the table as a string and Build a Java UDF C. Define the table as a string and use a Python Transform https://community.hortonworks.com/articles/72414/how-to-create-a-custom-udf-for-hive-using-python.html As an aside, the build in Json serde is pretty basic. I have used the alternative at https://github.com/rcongiu/Hive-JSON-Serde, which can do name mapping and some other cool stuff. But I don't think by itself, you could define a table definition that would nicely handle this case. 2. With Spark or Pyspark, you can pretty much do whatever you want If you are comfortable, I would recommend pre-processing the data with Spark into a common format. Or if you are more comfortable with Hive then use the TRANSFORM statement with a python script.
... View more
12-12-2018
09:05 PM
You might look at the table definition (show create table x) and verify the prefix on the table location. Hive stores the whole filespec including the protocol.
... View more
11-19-2018
05:04 PM
You can get the url from the Hive, Summary page in Ambari. For a Kerberized cluster, you will need to cal kinit to get a ticket before launching the python program. The format of the connection string for kerberos is something like the following. The actual user principal (and authentication) will be taken from the Kerberos ticket. Test the connection string with Beeline to make sure it works. beeline -u "jdbc:hive2://myhs2.foo:10000/default;principal=hive/myhs2@foo@MYKERBREALM;auth=kerberos"
... View more
10-30-2018
06:48 PM
Well, the hive-env.sh is just a shell script, so you could do some bash magic to see if user is in group. Something like the following (I am sure someone could do it more simply!). Note that this assumes that your groups are synced with linux using sssd, centrify, etc. NOTE: While not officially deprecated, using the hive cli command is discouraged in favor of beeline. Hive CLI doesn't take advantage of Ranger security. In HDP 3.x it is being rewritten as a wrapper around Hiveserver2. G=`groups $USER`
IFS=', ' read -r -a mygroups <<< "$G"
found=0
searchGroup="admin"
if (printf '%s\n' "${mygroups[@]}" | grep -xq $searchGroup); then
found=1
# Logic to allow Hive here.
fi
echo $found
... View more
08-23-2018
11:17 PM
After some research and help, I found that I had incorrectly set the following nifi.remote.input.host property. It should be set as follows in Advance nifi properties: nifi.remote.input.host={{nifi_node_host}} Each node should have a different value for nifi.remote.input.host, it's the value the current node is going to advertise for s2s comms... if you set that the same on all nodes then they are all advertising the same hostname and thus all data going to same host. You have to set the other multi-threading parameters such as "Maximum Timer Driven Thread Count" in controller settings, and "Concurrent Tasks" in the appropriate processors. But, this gets multiple nodes to listen for the RPG requests.
... View more
08-23-2018
11:13 PM
Did some additional research and the key to this is setting the following in the Advanced nifi-properties. nifi.remote.input.host={{nifi_node_host}} The documentation is a little weak, but setting this to a value, causes only that particular node to listen for RPG requests. Setting to this variable, will substitute each host's name in its' own nifi.properties file, and you are good to go. Still have to set the levels of concurrency for the overall canvas, and processors's, of course. But this gets all the nodes to listen for remote requests.
... View more
08-21-2018
01:19 AM
Trying to get nifi site-to-site RPG (loopback to a single cluster) to balance across nodes. Real flow sends out table names for processing across node, but put together a simple test flow to test. Still only going to one node after setting the usual suspects of properties Attachment 1 shows Test flow. RPG url is set to node xxx22, one of the worker nodes Attachment 2 shows settings for the test input port (Concurrent tasks 7, Batch Settings count 1) Attachment 3 shows Concurrent tasks of receiving process (not sure this matters, but I was desparate!) Attachment 3 shows provenance of receiving process with all files going to node xxx28. Anyone see an obvious (or not so obvious!) configuration error here? Thanks!
... View more
Labels:
- Labels:
-
Apache NiFi