Member since
09-24-2015
816
Posts
488
Kudos Received
189
Solutions
09-02-2017
03:27 AM
I just tried and this also works on HDP-2.6.0 and I believe other 2.6.x. Instead of the jar in the article I used the latest version at http://repo.hortonworks.com/content/repositories/releases/com/hortonworks/spark-llap/1.0.0.2.5.5.5-2/spark-llap-1.0.0.2.5.5.5-2-assembly.jar. And regarding the copy targets, it's enough to copy the assembly jar only to /usr/hdp/current/spark-client/lib on nodes where this directory already exists. I guess it can be also placed on hdfs, under /hdp but I haven't tried.
... View more
08-23-2017
12:48 AM
The same problem here @Dongjoon Hyun, the cluster is not connected to the Internet, and browsing http://repo.hortonworks.com/content/groups/public/com/hortonworks/spark/spark-llap-assembly_2.11/1.1.3-2.1/ returns no jars, only pom for Maven.
... View more
04-01-2017
12:35 AM
3 Kudos
Cloudbreak is a popular, easy to use HDP component for cluster deployment on various cloud environments including
Azure, AWS, OpenStac and GCP. This article shows how to create an Azure application for Cloudbreak using Azure CLI. Note: To do this, you need access to "Owner" account on your Azure subscription. "Developer" and other roles are not enough.
Download and install Azure CLI using instructions provided here. CLI versions are available for Windows, Mac-OS and Linux https://docs.microsoft.com/en-us/cli/azure/install-azure-cli
Type "az" to make sure the CLI is available and in your command path. Login to your Azure account in your web browser, and then also login from your command line: az login
To sign in, use a web browser to open the page https://aka.ms/devicelogin and enter the code HPBCSXTPJ to authenticate.
Follow the instructions on the web page. When done you will see confirmation on the command line that your login was successful. Run the following command. You can freely choose values to enter here including dummy URIs. Identifier URI and the homepage are never used on Azure but they are
required. Also make sure that identifier URI is unique on your subscription. So, instead of "mycbdapp" you may choose a more
descriptive name.
URIs are dummy, never used, but required az ad app create --identifier-uris http://mycbdapp.com --display-name mycbdapp --homepage http://mycbdapp.com
Ignore the output of this command, including appId, that's not the one we need! Choose your password, and run the following command az ad sp create-for-rbac --name "mycbdapp" --password "mytopsecretpassword" --role Owner
{
"appId": "c19a48f3-492f-a87b-ac4a-b1d8e456f14e",
"displayName": "mycbdapp",
"name": "http://mycbdapp",
"password": "mytopsecretpassword",
"tenant": "891fd956-21c9-4c40-bfa7-ab88c1d8364c"
}
Now login to your Cloudbreak instance, select "manage credentials", "+ create credential", and on the
"Configure credential" page select Azure and fill the form like on the screenshot.
Use appId, password, and tenant ID from the
output above. Add you Azure subscription ID, and paste the public key of your ssh key pair your created before
(this will be used to provide ssh access to cluster machines to the "cloudbreak" user).
Then, proceed by providing other settings, and enjoy HDP on Cloudbreak!
... View more
Labels:
02-24-2017
05:05 AM
+1 for a nice article! I had to add "library(ggplot2)" in steps 4 and 6 which provides ggplot function.
... View more
12-10-2016
06:23 AM
Hi @Ron Ballard, the list of encodings depends on what's installed on your machines. You can list them using for example "iconv -l". CP1252 is installed on Hortonworks Sandbox and has another 2 aliases: $ iconv -l | grep 1252
CP1252 MS-ANSI WINDOWS-1252 Regarding your second question, the answer is Yes, and that's exactly what I'm doing above. "city" is using the default UTF-8 encoding and city_sjis is in SJIS: create table city as select * from city_sjis;
... View more
09-27-2016
09:54 AM
2 Kudos
hcc-58591.zipHive RegexSerDe can be used to extract columns from the input file using regular expressions. It's used only to deserialize data, while
data serialization is not supported (and obviously not needed). The initial motivation to create such a SerDe was to process Apache web logs. There are two classes available:
org.apache.hadoop.hive.contrib.serde2.RegexSerDe, introduced in Hive-0.4 by HIVE-662, and
org.apache.hadoop.hive.serde2.RegexSerDe, a built-in class introduced in Hive-0.10 by HIVE-1719 The former is kept to facilitate easier migration for legacy apps, while the letter is recommended for the new apps.
The SerDe works by matching columns in the table definition with regex groups defined and captured by the regular expression.
A regex group is defined by parenthesis "(...)" inside the regex. Note that this is one of common mistakes by beginners who spend time creating
great regular expressions but displace or fail to mark regex groups.
The new, built-in version supports following primitive column types: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, STRING, BOOLEAN and DECIMAL, in
contrast to the "Contrib" version which supported only STRING column type. Regarding the number of columns in the table definition and the
number of regex group, they must match, otherwise a warning is printed and the table is not populated.
On individual lines, if a row matches the regex but has less than expected groups, the missing groups and table fields will be NULL.
If a row matches the regex but has more than expected groups, the additional groups are just ignored. If a row doesn't match the regex
then all fields will be NULL. The regex is provided as a SerDe required property called "input.regex".
Another supported property is "input.regex.case.insensitive" which can be "true" or "false" (default), while ""output.format.string" supported by the
contrib version is not supported any more. As an example consider a tab separated text input file composed of 5 fields: id int, city_org string, city_en string,
country string, ppl float, and we'd like to create a table using only 3 of those 5 fileds, namely:
id, city_org, and ppl, meaning that we'd like to ignore 3rd and 4th column. (Of course we can do the same using a view, but
for the sake of the discussion let's do it using RegexSerDe.) We can define our table as: $ hdfs dfs -mkdir -p hive/serde/regex
$ hdfs dfs -put allcities.utf8.tsv hive/serde/regex
hive> CREATE EXTERNAL TABLE citiesr1 (id int, city_org string, ppl float) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ('input.regex'='^(\\d+)\\t([^\\t]*)\\t\\S+\\t\\S+\\t(\\d++.\\d++).*') LOCATION '/user/it1/hive/serde/regex'; Note that the regex contains 3 regex groups capturing the first, second and fifth field on each line, corresponding to 3 table columns:
(\\d+), the leading integer id composed of 1 or more digits, ([^\\t]*), a string, everything except tab, positioned between 2nd and 3rd delimiting tabs. If we know that the column contains no spaces we can
also use "\\S+" in our example this is not the case, (however, we are making such assumption about the 3rd and the 4th field) and (\\d++.\\d++).*'), a float with at least 1 digit before and after the decimal point. Input sample (files used in examples are available in the attachment): 110 La Coruña Corunna Spain 0.37
112 Cádiz Cadiz Spain 0.4
120 Köln Cologne Germany 0.97
hive> select * from citiesr1 where id>100 and id<121;
110 La Coruña 0.37
112 Cádiz 0.4
120 Köln 0.97 Now, let's consider a case when some fields are missing in the input file, and we attempt to read it using the same regex
used for the table above: $ hdfs dfs -mkdir -p hive/serde/regex2
$ hdfs dfs -put allcities-flds-missing.utf8.tsv hive/serde/regex2
hive> CREATE EXTERNAL TABLE citiesr2 (id int, city_org string, ppl float) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ('input.regex'='^(\\d+)\\t([^\\t]*)\\t\\S+\\t\\S+\\t(\\d++.\\d++).*') LOCATION '/user/it1/hive/serde/regex2'; Input sample: 2<tab>大阪<tab>Osaka<tab><tab>
31<tab>Якутск<tab>Yakutsk<tab>Russia
121<tab>München<tab>Munich<tab><tab>1.2 On lines 1 and 3 we have 5 fields, but some are empty, while on the second line we have only 4 fields and 3 tabs. If we attempt to read the file using the regex given for table citiesr1 we'll end up with all NULLs on these 3 lines because the regex doesn't match these lines.
To rectify the problem we can change the regex slightly to allow for such cases: hive> CREATE EXTERNAL TABLE citiesr3 (id int, city_org string, ppl float) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ('input.regex'='^(\\d+)\\t([^\\t]*)\\t[^\\t]*\\t[^\\t]*[\\t]*(.*)') LOCATION '/user/it1/hive/serde/regex2'; The first 2 groups are unchanged, however we have replaced both "\\S+" for unused columns with [^\\t]*, the last delimiting tab is optional, and the last group is not set to "(.*)" meaning everything after the last tab
including empty string. With this changes, the above 3 lines become: hive> select * from citiesr3 where id in (2, 31, 121);
2 大阪 NULL
31 Якутск NULL
121 München 1.2 The real power of RegexSerDe is that it can operate not only on delimiter boundaries, as shown above, but also inside individual columns. Besides processing web logs and extracting desired fields and patterns from the input file another common use case of RegexSerDe is to read
files with multi-character field delimiters because "FIELDS TERMINATED BY" doesn't support them. (However, since Hive-0.14 there is also a contributed MultiDelimitSerDe which supports multi-char delimiters.) Note: All tests done on a HDP-2.4.0 cluster running Hive-1.2.1. Related questions: regex pattern for hive regex serde
... View more
Labels:
09-27-2016
06:02 AM
5 Kudos
hcc-58548.zipUntil recently, Hive could only read and write UTF-8 text files, and no other character sets were supported forcing people to
convert their possibly huge and/or multiple input files to UTF-8 using "iconv" or other such utility which can be cumbersome
(for example, iconv supports only files smaller than 16G), and time-consuming.
However, since Hive-0.14, by way of LazySimpleSerDe, conversion from and to other character sets is supported.
The SerDe can be specified by, for example
hive> CREATE TABLE mytable(...) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES("serialization.encoding"='SJIS');
or by declaring the table in a usual way, omitting the SerDe, specifying the field delimiter, and then doing:
hive> ALTER TABLE mytable SET SERDEPROPERTIES ('serialization.encoding'='SJIS');
where, instead of "SJIS" one can insert the encoding of the input file (note that char-set self-detection is not supported), like for example
one of ISO-8859-x encodings for Indo-Europian languages, SJIS for Japanese, GBK for Chinese and so on. Using SJIS and ISO-8859-1 Character Sets SJIS has been widely used in Japan, specially on feature mobile phones, because it results in smaller text files. ISO-8859-x have been used in Europe, and they also result in smaller text files.
[Files used in examples are available in the attachment.] Let's create an external table on an SJIS file, with fields: id, city name in the native language, city name in English, country, and population (in millions).
$ hdfs dfs -mkdir -p hive/sjis/cj
$ hdfs dfs -put cjcities.sjis hive/sjis/cj
hive> CREATE EXTERNAL TABLE city_sjis(id int, city_org string, city_en string, cnt string, ppl float) row format delimited fields terminated by ',' location '/user/it1/hive/sjis/cj';
Hive attempts to interpret SJIS as UTF-8, and that results in a garbled print-out:
hive> select * from city_sjis limit 1;
1����TokyoJapan33.8
After setting the character set, the print-out is correct. Note that SJIS also supports Greek alphabet and Russian cyrillic.
hive> ALTER TABLE city_sjis SET SERDEPROPERTIES ('serialization.encoding'='SJIS');
hive> select * from city_sjis;
1 東京 Tokyo Japan 33.8
2 大阪 Osaka Japan 16.7
11 北京 Beijing China 13.2
12 廣州 Guangzhou China 15.3
21 Αθηνα Athens Greece 3.7
31 Якутск Yakutsk Russia 0.6
Conditions expressed in non-ascii characters also work:
hive> select * from city_sjis where city_org like '%京';
1 東京 Tokyo Japan 33.8
11 北京 Beijing China 13.2
Similarly, for ISO-8859-1:
$ hdfs dfs -mkdir -p hive/sjis/eu
$ hdfs dfs -put eucities.iso-8859-1 hive/sjis/eu
hive> CREATE EXTERNAL TABLE city_isolatin(id int, city_org string, city_en string, cnt string, ppl float) row format delimited fields terminated by ',' location '/user/it1/hive/sjis/eu';
hive> ALTER TABLE city_isolatin SET SERDEPROPERTIES ('serialization.encoding'='ISO-8859-1');
hive> select * from city_isolatin;
110 La Coruña Corunna Spain 0.37
112 Cádiz Cadiz Spain 0.4
120 Köln Cologne Germany 0.97
121 München Munich Germany 1.2
130 Tårnby Tarnby Danmark 0.04
140 Tønsberg Tonsberg Norway 0.05
150 Besançon Bisanz France 0.12
hive> select * from city_isolatin where city_org like '%å%';
130TårnbyTarnbyDanmark0.04
Note that we can also use 'latin1', a well-known alias of 'ISO-8859-1' (or 'IBM819', 'CP819'). We can now consolidate both tables into a single, UTF-8 one:
hive> create table city as select * from city_sjis;
hive> insert into city select * from city_isolatin;
hive> select * from city where id=1 or id=110;
1 東京 Tokyo Japan 33.8
110 La Coruña Corunna Spain 0.37
The above examples show how to read files from other encodings, however one can also write files to other encodings,
by setting the 'serialization.encoding' on the newly created table and doing "INSERT INTO" that table. About UTF-16
Conversion from/to UTF-16 works as well, however, the behavior might be somewhat unexpected.
Writing to a table with encoding set to 'UTF-16' creates a text file so that its
every line begins with the so-called BOM (byte order mark) character U+FEFF (2 bytes).
This is a non-standard behavior since BOM is usually placed only once, at the very beginning of the file. Writing to a table with encoding set to 'UTF-16LE' or 'UTF-16BE' (Low/Big Endian) creates a text file with no BOM's,
this is a standard behavior. In either case, delimiters such as Hive default Ctr-A, or comma or tab are represented in 2 bytes, however
the new line LF character is represented in only 1 byte (0x0A), instead of UTF-16 2-byte equivalent 0x000A. Keep this in mind when attempting to read a UTF-16 file in Hive. If your file doesn't satisfy above conditions
you can consider converting your file to UTF-8 using for example iconv with the "-f UTF-16" or "-f UTF-16LE" options, or convert/create input files programmatically to satisfy above conditions. Related articles: Hive table with UTF-16 data Note: All tests were done on a HDP-2.4.0 cluster running Hive-1.2.1.
... View more
Labels:
09-07-2016
06:13 AM
Interesting, so the JIRA removed the "empty regions are not merged away" clause. If so, I'd not enable normalization of pre-split tables.
... View more
09-06-2016
09:21 AM
In your example, 2 zero-size regions have been merged, while the logic page says: "empty" regions (less than 1MB, with the previous note) are not merged away. This is by design to prevent normalization from undoing the pre-splitting of a table. Can you kindly explain why.
... View more
05-19-2016
04:04 AM
Hi @Stephen Redmond, sorry I missed your comment. No, haven't done tests with compression. I'll let you know if I find something. Also, you can file a question on HCC, copying your comment, to get wider attention. Tnx.
... View more