Member since
09-24-2015
816
Posts
488
Kudos Received
189
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2581 | 12-25-2018 10:42 PM | |
11861 | 10-09-2018 03:52 AM | |
4139 | 02-23-2018 11:46 PM | |
1798 | 09-02-2017 01:49 AM | |
2117 | 06-21-2017 12:06 AM |
02-14-2017
07:41 AM
This NPE shows that .tableinfo file does not exist. As suggested by Predrag below, run the repair tool multople times might solve the problem. (of course, sometime it might not solve the problem. We have to manually fix the issue) Another thing is that please check whether any file/directory exist for this 'prod:testj' table (via hdfs commands). I suspected that the entire table directory disappear (as I mentioned earlier, someone might make a mistake by removing this directory). In this case, another solution is to drop this table and recreate the table and re-populate data.
... View more
02-24-2017
05:05 AM
+1 for a nice article! I had to add "library(ggplot2)" in steps 4 and 6 which provides ggplot function.
... View more
02-11-2017
04:51 PM
I was able to stop services, amberi-agent on node and was able to delete node. Installed the deleted services on another node. Thank you @Jay SenSharma
... View more
02-08-2017
08:27 AM
Oh, great! That worked, I used "ambari-server setup-security" and opted not to persist the master-key. Thanks!
... View more
02-01-2017
10:21 PM
As the Jira says, stmk is rarely used, but the current version of Zookeeper, ver. 3.4.6 was added in HDP-2.x, so yes, 3.4.7 or higher which fixes the bug should be added soon.
... View more
12-18-2016
06:48 AM
You need a really recent FreeIPA to support --maxlife=0 (https://git.fedorahosted.org/cgit/freeipa.git/commit/?id=d2cb9ed327ee4003598d5e45d80ab7918b89eeed). If you are on supported Redhat or CentOS then you probably you don't have it, unless you rolled your own. You can find out by checking the krbPasswordExpiration attribute of the user. It shouldn't be there. In that case you can try to set it (http://www.therebel.eu/2015/08/setting-password-expiry-in-ipa/) or update your password policy to a lifetime of say 10 years or so (dont go beyond 2038)
... View more
10-05-2016
07:38 PM
@Gobi Subramani I would suggest that you download and install HDP. It can handle creating the data flow for you. Here's an example of it collecting logs. Instead of writing to an Event bus you could use putHDFS connector and it would write it to hdfs for you. There isn't a lot of trickery to get the date/folder to work, you just need to ${now()} in place of the folder name to get the schema you are looking for. If you look around there are lots of walk throughs and templates. I have included a pic of a simple flow that would likely solve your issue.
... View more
09-27-2016
09:54 AM
2 Kudos
hcc-58591.zipHive RegexSerDe can be used to extract columns from the input file using regular expressions. It's used only to deserialize data, while
data serialization is not supported (and obviously not needed). The initial motivation to create such a SerDe was to process Apache web logs. There are two classes available:
org.apache.hadoop.hive.contrib.serde2.RegexSerDe, introduced in Hive-0.4 by HIVE-662, and
org.apache.hadoop.hive.serde2.RegexSerDe, a built-in class introduced in Hive-0.10 by HIVE-1719 The former is kept to facilitate easier migration for legacy apps, while the letter is recommended for the new apps.
The SerDe works by matching columns in the table definition with regex groups defined and captured by the regular expression.
A regex group is defined by parenthesis "(...)" inside the regex. Note that this is one of common mistakes by beginners who spend time creating
great regular expressions but displace or fail to mark regex groups.
The new, built-in version supports following primitive column types: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, STRING, BOOLEAN and DECIMAL, in
contrast to the "Contrib" version which supported only STRING column type. Regarding the number of columns in the table definition and the
number of regex group, they must match, otherwise a warning is printed and the table is not populated.
On individual lines, if a row matches the regex but has less than expected groups, the missing groups and table fields will be NULL.
If a row matches the regex but has more than expected groups, the additional groups are just ignored. If a row doesn't match the regex
then all fields will be NULL. The regex is provided as a SerDe required property called "input.regex".
Another supported property is "input.regex.case.insensitive" which can be "true" or "false" (default), while ""output.format.string" supported by the
contrib version is not supported any more. As an example consider a tab separated text input file composed of 5 fields: id int, city_org string, city_en string,
country string, ppl float, and we'd like to create a table using only 3 of those 5 fileds, namely:
id, city_org, and ppl, meaning that we'd like to ignore 3rd and 4th column. (Of course we can do the same using a view, but
for the sake of the discussion let's do it using RegexSerDe.) We can define our table as: $ hdfs dfs -mkdir -p hive/serde/regex
$ hdfs dfs -put allcities.utf8.tsv hive/serde/regex
hive> CREATE EXTERNAL TABLE citiesr1 (id int, city_org string, ppl float) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ('input.regex'='^(\\d+)\\t([^\\t]*)\\t\\S+\\t\\S+\\t(\\d++.\\d++).*') LOCATION '/user/it1/hive/serde/regex'; Note that the regex contains 3 regex groups capturing the first, second and fifth field on each line, corresponding to 3 table columns:
(\\d+), the leading integer id composed of 1 or more digits, ([^\\t]*), a string, everything except tab, positioned between 2nd and 3rd delimiting tabs. If we know that the column contains no spaces we can
also use "\\S+" in our example this is not the case, (however, we are making such assumption about the 3rd and the 4th field) and (\\d++.\\d++).*'), a float with at least 1 digit before and after the decimal point. Input sample (files used in examples are available in the attachment): 110 La Coruña Corunna Spain 0.37
112 Cádiz Cadiz Spain 0.4
120 Köln Cologne Germany 0.97
hive> select * from citiesr1 where id>100 and id<121;
110 La Coruña 0.37
112 Cádiz 0.4
120 Köln 0.97 Now, let's consider a case when some fields are missing in the input file, and we attempt to read it using the same regex
used for the table above: $ hdfs dfs -mkdir -p hive/serde/regex2
$ hdfs dfs -put allcities-flds-missing.utf8.tsv hive/serde/regex2
hive> CREATE EXTERNAL TABLE citiesr2 (id int, city_org string, ppl float) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ('input.regex'='^(\\d+)\\t([^\\t]*)\\t\\S+\\t\\S+\\t(\\d++.\\d++).*') LOCATION '/user/it1/hive/serde/regex2'; Input sample: 2<tab>大阪<tab>Osaka<tab><tab>
31<tab>Якутск<tab>Yakutsk<tab>Russia
121<tab>München<tab>Munich<tab><tab>1.2 On lines 1 and 3 we have 5 fields, but some are empty, while on the second line we have only 4 fields and 3 tabs. If we attempt to read the file using the regex given for table citiesr1 we'll end up with all NULLs on these 3 lines because the regex doesn't match these lines.
To rectify the problem we can change the regex slightly to allow for such cases: hive> CREATE EXTERNAL TABLE citiesr3 (id int, city_org string, ppl float) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ('input.regex'='^(\\d+)\\t([^\\t]*)\\t[^\\t]*\\t[^\\t]*[\\t]*(.*)') LOCATION '/user/it1/hive/serde/regex2'; The first 2 groups are unchanged, however we have replaced both "\\S+" for unused columns with [^\\t]*, the last delimiting tab is optional, and the last group is not set to "(.*)" meaning everything after the last tab
including empty string. With this changes, the above 3 lines become: hive> select * from citiesr3 where id in (2, 31, 121);
2 大阪 NULL
31 Якутск NULL
121 München 1.2 The real power of RegexSerDe is that it can operate not only on delimiter boundaries, as shown above, but also inside individual columns. Besides processing web logs and extracting desired fields and patterns from the input file another common use case of RegexSerDe is to read
files with multi-character field delimiters because "FIELDS TERMINATED BY" doesn't support them. (However, since Hive-0.14 there is also a contributed MultiDelimitSerDe which supports multi-char delimiters.) Note: All tests done on a HDP-2.4.0 cluster running Hive-1.2.1. Related questions: regex pattern for hive regex serde
... View more
Labels:
01-05-2017
01:54 PM
Hi Predrag, We are using MultiDelimitSerDe that as far as i understand is built on top of LazySimpleSerDe it looks like the serialization.encoding param does not have any effect. File encoding is: ISO-8859 text what ever encoding i place into SERDEPROPERTIES does not have any effect...do you know maybe what might be the issue? we are using hortonworks hdp 2.5.0.0 and the table ddl is as follows: CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES_1 (
list of columns)
PARTITIONED BY (
columns)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES ( "field.delim"="||", "serialization.encoding"="ISO8859_1")
LOCATION 'file/location'
tblproperties("skip.header.line.count"="1"); Regards, dalibor
... View more
09-08-2016
06:30 AM
The more the nodes in a ZK ensemble (quorum) the faster the reads but the slower the writes. That's because a read can be done from any node, but a write is not complete before all nodes are updated. On top of that, early versions of Kafka (0.8.2 and older) keep Kafka offsets on ZK. Therefore, as already suggested by @mqureshi, it's the best to start by creating a dedicated ZK for Kafka, I'd go for 3 nodes, and keep the 5-node ZK for everything else. Beefing up the number of ZK's to 7 or more is a resounding No. Regarding the installation and management of the new Kafka ZK, it's pretty straightforward to install it manually, just follow the steps in one of "Non-Ambari cluster installation guides" like this one. You can also try to create a cluster composed of only Kafka and ZK and manage it by its own Ambari instance.
... View more