Member since
09-24-2015
816
Posts
488
Kudos Received
189
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 3124 | 12-25-2018 10:42 PM | |
| 14035 | 10-09-2018 03:52 AM | |
| 4700 | 02-23-2018 11:46 PM | |
| 2420 | 09-02-2017 01:49 AM | |
| 2837 | 06-21-2017 12:06 AM |
10-05-2016
10:36 AM
Did you copy /etc/spark/conf/ ? Also, create spark user and copy its headless keytab. You can find all details here.
... View more
10-05-2016
08:22 AM
You need a Oozie coordinator and a workflow with 2 actions (if I understand your conditions correctly). Set the coordinator frequency = "0/30 * * * *" to run the workflow on every full hour, and 30 minutes after every hour. In your coordinator set a property named for example DATE and set its value like below, see here for details ${coord:formatTime(coord:nominalTime(), 'yyyy-MM-dd')} In your workflow.xml, create 2 actions, the first will be an fs action to crate that directory, for example: <mkdir path='${nameNode}/user/user1/data/${wf.conf("DATE")}' /> And the second action will be a hive or hive2 action to "copy the data from the table and put it in to that generated table". [For another coordinator example see this, and click "Previous page" links to find examples of the property file and a workflow with multiple actions, including hive actions.] And finally, input-events and datasets is used if the condition for the coordinator to start is availability of a new dataset. uri-template is used to define a dataset, and output-events refers to the coordinator output. You can find more details here: datasets, coordinator concepts, and an example of a coordinator based on input-events. If you run your coordinator by time frequency you don't need that.
... View more
10-03-2016
10:40 AM
@Sundar Lakshmanan If "Yess", can you please accept Ayub's answer to help us manage answered questions. Tnx!
... View more
10-01-2016
03:18 AM
If you are not using Kerberos, can you open bin/kafka-console-consumer.sh in an editor and comment out Kerberos related "export" command from the script. In my related article, at the bottom, you can find a sed command how to comment out those commands in all kafka scripts.
... View more
09-30-2016
03:43 AM
1 Kudo
Hive default encoding is UTF8, and therefore setting serialization.encoding to UTF8 on a file in UTF8 is unnecessary. However, if you are facing troubles, there is a high probability that your input file is using another character set. In that case set 'serialization.encoding' to the encoding of the input file. A quick search show that the default charset of Sql server is ISO-8859-1 (alias latin1), so you can try 'serialization.encoding'='ISO-8859-1'. For examples see my recent article on Hive charsets.
... View more
09-30-2016
02:20 AM
It seems that totmiles is double, while the 3rd field in table 'riskfactor' is BIGINT. As the erros says "Target field must be of HCat type {DOUBLE}".
... View more
09-29-2016
10:21 AM
Well, you said you had only 2 columns 🙂 For more columns you can either change the regex, or try MultiDelimitSerDe if you are on Hive-0.14 or newer. By the way, inspired by your question I wrote an article about RegexSerDe.
... View more
09-27-2016
09:54 AM
2 Kudos
hcc-58591.zipHive RegexSerDe can be used to extract columns from the input file using regular expressions. It's used only to deserialize data, while
data serialization is not supported (and obviously not needed). The initial motivation to create such a SerDe was to process Apache web logs. There are two classes available:
org.apache.hadoop.hive.contrib.serde2.RegexSerDe, introduced in Hive-0.4 by HIVE-662, and
org.apache.hadoop.hive.serde2.RegexSerDe, a built-in class introduced in Hive-0.10 by HIVE-1719 The former is kept to facilitate easier migration for legacy apps, while the letter is recommended for the new apps.
The SerDe works by matching columns in the table definition with regex groups defined and captured by the regular expression.
A regex group is defined by parenthesis "(...)" inside the regex. Note that this is one of common mistakes by beginners who spend time creating
great regular expressions but displace or fail to mark regex groups.
The new, built-in version supports following primitive column types: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, STRING, BOOLEAN and DECIMAL, in
contrast to the "Contrib" version which supported only STRING column type. Regarding the number of columns in the table definition and the
number of regex group, they must match, otherwise a warning is printed and the table is not populated.
On individual lines, if a row matches the regex but has less than expected groups, the missing groups and table fields will be NULL.
If a row matches the regex but has more than expected groups, the additional groups are just ignored. If a row doesn't match the regex
then all fields will be NULL. The regex is provided as a SerDe required property called "input.regex".
Another supported property is "input.regex.case.insensitive" which can be "true" or "false" (default), while ""output.format.string" supported by the
contrib version is not supported any more. As an example consider a tab separated text input file composed of 5 fields: id int, city_org string, city_en string,
country string, ppl float, and we'd like to create a table using only 3 of those 5 fileds, namely:
id, city_org, and ppl, meaning that we'd like to ignore 3rd and 4th column. (Of course we can do the same using a view, but
for the sake of the discussion let's do it using RegexSerDe.) We can define our table as: $ hdfs dfs -mkdir -p hive/serde/regex
$ hdfs dfs -put allcities.utf8.tsv hive/serde/regex
hive> CREATE EXTERNAL TABLE citiesr1 (id int, city_org string, ppl float) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ('input.regex'='^(\\d+)\\t([^\\t]*)\\t\\S+\\t\\S+\\t(\\d++.\\d++).*') LOCATION '/user/it1/hive/serde/regex'; Note that the regex contains 3 regex groups capturing the first, second and fifth field on each line, corresponding to 3 table columns:
(\\d+), the leading integer id composed of 1 or more digits, ([^\\t]*), a string, everything except tab, positioned between 2nd and 3rd delimiting tabs. If we know that the column contains no spaces we can
also use "\\S+" in our example this is not the case, (however, we are making such assumption about the 3rd and the 4th field) and (\\d++.\\d++).*'), a float with at least 1 digit before and after the decimal point. Input sample (files used in examples are available in the attachment): 110 La Coruña Corunna Spain 0.37
112 Cádiz Cadiz Spain 0.4
120 Köln Cologne Germany 0.97
hive> select * from citiesr1 where id>100 and id<121;
110 La Coruña 0.37
112 Cádiz 0.4
120 Köln 0.97 Now, let's consider a case when some fields are missing in the input file, and we attempt to read it using the same regex
used for the table above: $ hdfs dfs -mkdir -p hive/serde/regex2
$ hdfs dfs -put allcities-flds-missing.utf8.tsv hive/serde/regex2
hive> CREATE EXTERNAL TABLE citiesr2 (id int, city_org string, ppl float) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ('input.regex'='^(\\d+)\\t([^\\t]*)\\t\\S+\\t\\S+\\t(\\d++.\\d++).*') LOCATION '/user/it1/hive/serde/regex2'; Input sample: 2<tab>大阪<tab>Osaka<tab><tab>
31<tab>Якутск<tab>Yakutsk<tab>Russia
121<tab>München<tab>Munich<tab><tab>1.2 On lines 1 and 3 we have 5 fields, but some are empty, while on the second line we have only 4 fields and 3 tabs. If we attempt to read the file using the regex given for table citiesr1 we'll end up with all NULLs on these 3 lines because the regex doesn't match these lines.
To rectify the problem we can change the regex slightly to allow for such cases: hive> CREATE EXTERNAL TABLE citiesr3 (id int, city_org string, ppl float) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ('input.regex'='^(\\d+)\\t([^\\t]*)\\t[^\\t]*\\t[^\\t]*[\\t]*(.*)') LOCATION '/user/it1/hive/serde/regex2'; The first 2 groups are unchanged, however we have replaced both "\\S+" for unused columns with [^\\t]*, the last delimiting tab is optional, and the last group is not set to "(.*)" meaning everything after the last tab
including empty string. With this changes, the above 3 lines become: hive> select * from citiesr3 where id in (2, 31, 121);
2 大阪 NULL
31 Якутск NULL
121 München 1.2 The real power of RegexSerDe is that it can operate not only on delimiter boundaries, as shown above, but also inside individual columns. Besides processing web logs and extracting desired fields and patterns from the input file another common use case of RegexSerDe is to read
files with multi-character field delimiters because "FIELDS TERMINATED BY" doesn't support them. (However, since Hive-0.14 there is also a contributed MultiDelimitSerDe which supports multi-char delimiters.) Note: All tests done on a HDP-2.4.0 cluster running Hive-1.2.1. Related questions: regex pattern for hive regex serde
... View more
Labels:
09-27-2016
06:02 AM
5 Kudos
hcc-58548.zipUntil recently, Hive could only read and write UTF-8 text files, and no other character sets were supported forcing people to
convert their possibly huge and/or multiple input files to UTF-8 using "iconv" or other such utility which can be cumbersome
(for example, iconv supports only files smaller than 16G), and time-consuming.
However, since Hive-0.14, by way of LazySimpleSerDe, conversion from and to other character sets is supported.
The SerDe can be specified by, for example
hive> CREATE TABLE mytable(...) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES("serialization.encoding"='SJIS');
or by declaring the table in a usual way, omitting the SerDe, specifying the field delimiter, and then doing:
hive> ALTER TABLE mytable SET SERDEPROPERTIES ('serialization.encoding'='SJIS');
where, instead of "SJIS" one can insert the encoding of the input file (note that char-set self-detection is not supported), like for example
one of ISO-8859-x encodings for Indo-Europian languages, SJIS for Japanese, GBK for Chinese and so on. Using SJIS and ISO-8859-1 Character Sets SJIS has been widely used in Japan, specially on feature mobile phones, because it results in smaller text files. ISO-8859-x have been used in Europe, and they also result in smaller text files.
[Files used in examples are available in the attachment.] Let's create an external table on an SJIS file, with fields: id, city name in the native language, city name in English, country, and population (in millions).
$ hdfs dfs -mkdir -p hive/sjis/cj
$ hdfs dfs -put cjcities.sjis hive/sjis/cj
hive> CREATE EXTERNAL TABLE city_sjis(id int, city_org string, city_en string, cnt string, ppl float) row format delimited fields terminated by ',' location '/user/it1/hive/sjis/cj';
Hive attempts to interpret SJIS as UTF-8, and that results in a garbled print-out:
hive> select * from city_sjis limit 1;
1����TokyoJapan33.8
After setting the character set, the print-out is correct. Note that SJIS also supports Greek alphabet and Russian cyrillic.
hive> ALTER TABLE city_sjis SET SERDEPROPERTIES ('serialization.encoding'='SJIS');
hive> select * from city_sjis;
1 東京 Tokyo Japan 33.8
2 大阪 Osaka Japan 16.7
11 北京 Beijing China 13.2
12 廣州 Guangzhou China 15.3
21 Αθηνα Athens Greece 3.7
31 Якутск Yakutsk Russia 0.6
Conditions expressed in non-ascii characters also work:
hive> select * from city_sjis where city_org like '%京';
1 東京 Tokyo Japan 33.8
11 北京 Beijing China 13.2
Similarly, for ISO-8859-1:
$ hdfs dfs -mkdir -p hive/sjis/eu
$ hdfs dfs -put eucities.iso-8859-1 hive/sjis/eu
hive> CREATE EXTERNAL TABLE city_isolatin(id int, city_org string, city_en string, cnt string, ppl float) row format delimited fields terminated by ',' location '/user/it1/hive/sjis/eu';
hive> ALTER TABLE city_isolatin SET SERDEPROPERTIES ('serialization.encoding'='ISO-8859-1');
hive> select * from city_isolatin;
110 La Coruña Corunna Spain 0.37
112 Cádiz Cadiz Spain 0.4
120 Köln Cologne Germany 0.97
121 München Munich Germany 1.2
130 Tårnby Tarnby Danmark 0.04
140 Tønsberg Tonsberg Norway 0.05
150 Besançon Bisanz France 0.12
hive> select * from city_isolatin where city_org like '%å%';
130TårnbyTarnbyDanmark0.04
Note that we can also use 'latin1', a well-known alias of 'ISO-8859-1' (or 'IBM819', 'CP819'). We can now consolidate both tables into a single, UTF-8 one:
hive> create table city as select * from city_sjis;
hive> insert into city select * from city_isolatin;
hive> select * from city where id=1 or id=110;
1 東京 Tokyo Japan 33.8
110 La Coruña Corunna Spain 0.37
The above examples show how to read files from other encodings, however one can also write files to other encodings,
by setting the 'serialization.encoding' on the newly created table and doing "INSERT INTO" that table. About UTF-16
Conversion from/to UTF-16 works as well, however, the behavior might be somewhat unexpected.
Writing to a table with encoding set to 'UTF-16' creates a text file so that its
every line begins with the so-called BOM (byte order mark) character U+FEFF (2 bytes).
This is a non-standard behavior since BOM is usually placed only once, at the very beginning of the file. Writing to a table with encoding set to 'UTF-16LE' or 'UTF-16BE' (Low/Big Endian) creates a text file with no BOM's,
this is a standard behavior. In either case, delimiters such as Hive default Ctr-A, or comma or tab are represented in 2 bytes, however
the new line LF character is represented in only 1 byte (0x0A), instead of UTF-16 2-byte equivalent 0x000A. Keep this in mind when attempting to read a UTF-16 file in Hive. If your file doesn't satisfy above conditions
you can consider converting your file to UTF-8 using for example iconv with the "-f UTF-16" or "-f UTF-16LE" options, or convert/create input files programmatically to satisfy above conditions. Related articles: Hive table with UTF-16 data Note: All tests were done on a HDP-2.4.0 cluster running Hive-1.2.1.
... View more
Labels:
09-22-2016
03:17 AM
DROP the current table (files on HDFS are not affected for external tables), and create a new one with the same name pointing to your S3 location.
... View more