Member since
09-24-2015
816
Posts
488
Kudos Received
189
Solutions
08-08-2022
02:45 AM
Hello @hbasetest You wish to enable Normalizer at Cluster Level irrespective of the Table Level Setting i.e. NORMALIZATION_ENABLED be True or False. As far as I believe, We would require Table Level enabling. Having said that, If you can Open a Post on the same by using the Steps shared by @VidyaSargur, Our fellow Community Gurus can get back to you sooner, as compared to a Comment on an Article written in 2016.
... View more
12-04-2020
01:27 AM
@sshimpi I am getting this failure error when I do gradle clean build : The 'sonar-runner' plugin has been deprecated and is scheduled to be removed in Gradle 3.0. please use the official plugin from SonarQube (http://docs.sonarqube.org/display/SONAR/Analyzing+with+Gradle). :clean :buildInfo :compileJava FAILURE: Build failed with an exception. * What went wrong: Could not resolve all dependencies for configuration ':compile'. > Could not resolve com.sequenceiq:ambari-client20:2.0.1. Required by: com.sequenceiq:ambari-shell:0.1.DEV > Could not resolve com.sequenceiq:ambari-client20:2.0.1. > Could not get resource 'http://maven.sequenceiq.com/snapshots/com/sequenceiq/ambari-client20/2.0.1/ambari-client20-2.0.1.pom'. > Could not GET 'http://maven.sequenceiq.com/snapshots/com/sequenceiq/ambari-client20/2.0.1/ambari-client20-2.0.1.pom'. > maven.sequenceiq.com: Name or service not known > Could not resolve com.sequenceiq:ambari-client20:2.0.1. > Could not get resource 'http://maven.sequenceiq.com/release/com/sequenceiq/ambari-client20/2.0.1/ambari-client20-2.0.1.pom'. > Could not GET 'http://maven.sequenceiq.com/release/com/sequenceiq/ambari-client20/2.0.1/ambari-client20-2.0.1.pom'. > maven.sequenceiq.com > Could not resolve com.sequenceiq:ambari-client20:2.0.1. > Could not get resource 'http://maven.sequenceiq.com/releases/com/sequenceiq/ambari-client20/2.0.1/ambari-client20-2.0.1.pom'. > Could not GET 'http://maven.sequenceiq.com/releases/com/sequenceiq/ambari-client20/2.0.1/ambari-client20-2.0.1.pom'. > maven.sequenceiq.com * Try: Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. BUILD FAILED Total time: 11.708 secs Please help!
... View more
12-21-2017
08:53 AM
Also I've tried spark-llap on HDP-2.6.2.0 with Spark 1.6.3 and http://repo.hortonworks.com/content/repositories/releases/com/hortonworks/spark-llap/1.0.0.2.5.5.5-2/spark-llap-1.0.0.2.5.5.5-2-assembly.jar, but unfortunately, when I tried to execute a simple "select count" query in beeline, got the following error messages: 0: jdbc:hive2://node-05:10015/default> select count(*) from ods_order.cc_customer;
Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#56L])
+- TungstenExchange SinglePartition, None
+- TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#59L])
+- Scan LlapRelation(org.apache.spark.sql.hive.llap.LlapContext@690c5838,Map(table -> ods_order.cc_customer, url -> jdbc:hive2://node-01.hdp.wiseda.com.cn:10500))[] (state=,code=0) thriftserver-err-msg.txt and the log messages in thriftserver as shown in attached "thriftserver-err-msg.txt".
... View more
05-31-2019
08:38 AM
Hello, My ENV is HDP3.0, spark-2.11.2.3.1,hive3.0, kerberos enabled, I followed as mentioned above, and connected the sparkthriftserver to execute sql: explain select * from tb1,finally got the results: Physical plan:HiveTableScan,HiveTableRelation'''',org.apche.hadoop,hive.serde2.lazy.lazySimpleSerDe instead of llapRealtion. it seems that llap does not work. ps I use the package spark-llap_2-11-1.0.2.1-assembly.jar.
... View more
04-01-2017
12:35 AM
3 Kudos
Cloudbreak is a popular, easy to use HDP component for cluster deployment on various cloud environments including
Azure, AWS, OpenStac and GCP. This article shows how to create an Azure application for Cloudbreak using Azure CLI. Note: To do this, you need access to "Owner" account on your Azure subscription. "Developer" and other roles are not enough.
Download and install Azure CLI using instructions provided here. CLI versions are available for Windows, Mac-OS and Linux https://docs.microsoft.com/en-us/cli/azure/install-azure-cli
Type "az" to make sure the CLI is available and in your command path. Login to your Azure account in your web browser, and then also login from your command line: az login
To sign in, use a web browser to open the page https://aka.ms/devicelogin and enter the code HPBCSXTPJ to authenticate.
Follow the instructions on the web page. When done you will see confirmation on the command line that your login was successful. Run the following command. You can freely choose values to enter here including dummy URIs. Identifier URI and the homepage are never used on Azure but they are
required. Also make sure that identifier URI is unique on your subscription. So, instead of "mycbdapp" you may choose a more
descriptive name.
URIs are dummy, never used, but required az ad app create --identifier-uris http://mycbdapp.com --display-name mycbdapp --homepage http://mycbdapp.com
Ignore the output of this command, including appId, that's not the one we need! Choose your password, and run the following command az ad sp create-for-rbac --name "mycbdapp" --password "mytopsecretpassword" --role Owner
{
"appId": "c19a48f3-492f-a87b-ac4a-b1d8e456f14e",
"displayName": "mycbdapp",
"name": "http://mycbdapp",
"password": "mytopsecretpassword",
"tenant": "891fd956-21c9-4c40-bfa7-ab88c1d8364c"
}
Now login to your Cloudbreak instance, select "manage credentials", "+ create credential", and on the
"Configure credential" page select Azure and fill the form like on the screenshot.
Use appId, password, and tenant ID from the
output above. Add you Azure subscription ID, and paste the public key of your ssh key pair your created before
(this will be used to provide ssh access to cluster machines to the "cloudbreak" user).
Then, proceed by providing other settings, and enjoy HDP on Cloudbreak!
... View more
Labels:
02-24-2017
05:05 AM
+1 for a nice article! I had to add "library(ggplot2)" in steps 4 and 6 which provides ggplot function.
... View more
09-27-2016
09:54 AM
2 Kudos
hcc-58591.zipHive RegexSerDe can be used to extract columns from the input file using regular expressions. It's used only to deserialize data, while
data serialization is not supported (and obviously not needed). The initial motivation to create such a SerDe was to process Apache web logs. There are two classes available:
org.apache.hadoop.hive.contrib.serde2.RegexSerDe, introduced in Hive-0.4 by HIVE-662, and
org.apache.hadoop.hive.serde2.RegexSerDe, a built-in class introduced in Hive-0.10 by HIVE-1719 The former is kept to facilitate easier migration for legacy apps, while the letter is recommended for the new apps.
The SerDe works by matching columns in the table definition with regex groups defined and captured by the regular expression.
A regex group is defined by parenthesis "(...)" inside the regex. Note that this is one of common mistakes by beginners who spend time creating
great regular expressions but displace or fail to mark regex groups.
The new, built-in version supports following primitive column types: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, STRING, BOOLEAN and DECIMAL, in
contrast to the "Contrib" version which supported only STRING column type. Regarding the number of columns in the table definition and the
number of regex group, they must match, otherwise a warning is printed and the table is not populated.
On individual lines, if a row matches the regex but has less than expected groups, the missing groups and table fields will be NULL.
If a row matches the regex but has more than expected groups, the additional groups are just ignored. If a row doesn't match the regex
then all fields will be NULL. The regex is provided as a SerDe required property called "input.regex".
Another supported property is "input.regex.case.insensitive" which can be "true" or "false" (default), while ""output.format.string" supported by the
contrib version is not supported any more. As an example consider a tab separated text input file composed of 5 fields: id int, city_org string, city_en string,
country string, ppl float, and we'd like to create a table using only 3 of those 5 fileds, namely:
id, city_org, and ppl, meaning that we'd like to ignore 3rd and 4th column. (Of course we can do the same using a view, but
for the sake of the discussion let's do it using RegexSerDe.) We can define our table as: $ hdfs dfs -mkdir -p hive/serde/regex
$ hdfs dfs -put allcities.utf8.tsv hive/serde/regex
hive> CREATE EXTERNAL TABLE citiesr1 (id int, city_org string, ppl float) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ('input.regex'='^(\\d+)\\t([^\\t]*)\\t\\S+\\t\\S+\\t(\\d++.\\d++).*') LOCATION '/user/it1/hive/serde/regex'; Note that the regex contains 3 regex groups capturing the first, second and fifth field on each line, corresponding to 3 table columns:
(\\d+), the leading integer id composed of 1 or more digits, ([^\\t]*), a string, everything except tab, positioned between 2nd and 3rd delimiting tabs. If we know that the column contains no spaces we can
also use "\\S+" in our example this is not the case, (however, we are making such assumption about the 3rd and the 4th field) and (\\d++.\\d++).*'), a float with at least 1 digit before and after the decimal point. Input sample (files used in examples are available in the attachment): 110 La Coruña Corunna Spain 0.37
112 Cádiz Cadiz Spain 0.4
120 Köln Cologne Germany 0.97
hive> select * from citiesr1 where id>100 and id<121;
110 La Coruña 0.37
112 Cádiz 0.4
120 Köln 0.97 Now, let's consider a case when some fields are missing in the input file, and we attempt to read it using the same regex
used for the table above: $ hdfs dfs -mkdir -p hive/serde/regex2
$ hdfs dfs -put allcities-flds-missing.utf8.tsv hive/serde/regex2
hive> CREATE EXTERNAL TABLE citiesr2 (id int, city_org string, ppl float) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ('input.regex'='^(\\d+)\\t([^\\t]*)\\t\\S+\\t\\S+\\t(\\d++.\\d++).*') LOCATION '/user/it1/hive/serde/regex2'; Input sample: 2<tab>大阪<tab>Osaka<tab><tab>
31<tab>Якутск<tab>Yakutsk<tab>Russia
121<tab>München<tab>Munich<tab><tab>1.2 On lines 1 and 3 we have 5 fields, but some are empty, while on the second line we have only 4 fields and 3 tabs. If we attempt to read the file using the regex given for table citiesr1 we'll end up with all NULLs on these 3 lines because the regex doesn't match these lines.
To rectify the problem we can change the regex slightly to allow for such cases: hive> CREATE EXTERNAL TABLE citiesr3 (id int, city_org string, ppl float) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ('input.regex'='^(\\d+)\\t([^\\t]*)\\t[^\\t]*\\t[^\\t]*[\\t]*(.*)') LOCATION '/user/it1/hive/serde/regex2'; The first 2 groups are unchanged, however we have replaced both "\\S+" for unused columns with [^\\t]*, the last delimiting tab is optional, and the last group is not set to "(.*)" meaning everything after the last tab
including empty string. With this changes, the above 3 lines become: hive> select * from citiesr3 where id in (2, 31, 121);
2 大阪 NULL
31 Якутск NULL
121 München 1.2 The real power of RegexSerDe is that it can operate not only on delimiter boundaries, as shown above, but also inside individual columns. Besides processing web logs and extracting desired fields and patterns from the input file another common use case of RegexSerDe is to read
files with multi-character field delimiters because "FIELDS TERMINATED BY" doesn't support them. (However, since Hive-0.14 there is also a contributed MultiDelimitSerDe which supports multi-char delimiters.) Note: All tests done on a HDP-2.4.0 cluster running Hive-1.2.1. Related questions: regex pattern for hive regex serde
... View more
Labels:
01-05-2017
01:54 PM
Hi Predrag, We are using MultiDelimitSerDe that as far as i understand is built on top of LazySimpleSerDe it looks like the serialization.encoding param does not have any effect. File encoding is: ISO-8859 text what ever encoding i place into SERDEPROPERTIES does not have any effect...do you know maybe what might be the issue? we are using hortonworks hdp 2.5.0.0 and the table ddl is as follows: CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES_1 (
list of columns)
PARTITIONED BY (
columns)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES ( "field.delim"="||", "serialization.encoding"="ISO8859_1")
LOCATION 'file/location'
tblproperties("skip.header.line.count"="1"); Regards, dalibor
... View more
05-19-2016
04:04 AM
Hi @Stephen Redmond, sorry I missed your comment. No, haven't done tests with compression. I'll let you know if I find something. Also, you can file a question on HCC, copying your comment, to get wider attention. Tnx.
... View more
02-10-2016
11:12 AM
5 Kudos
I just completed my first Express upgrade (EU) using Ambari-2.2.0, from HDP-2.2.8 to HDP-2.3.4 and here are my observations and issues I encountered. The cluster has 12 nodes, 2 masters and 10 workers with configured Namenode HA and RM HA running on RHEL-6.5 using Java-7. Installed Hadoop components: HDFS, MR2, Yarn, Hive, Tez, HBase, Pig, Sqoop, Oozie, ZooKeeper, and AmbariMetrics. About 2 weeks before this EU, the cluster was upgraded from HDP-2.1.10 and Ambari-1.7.1. Please use this as a reference: based on cluster settings and previous history (previous upgrade or fresh install), the issues will differ, and the problems I had should by no means considered to be representative, and taking place during every EU.
It's good idea to backup all cluster supporting data-bases in advance, in my case Ambari, Hive metastore, Oozie and Hue (although Hue cannot be upgraded by Ambari) There is no need to prepare or download HDP.repo file in advance, Ambari will crate the file, now called HDP-2.3.4.repo and will distribute it to all nodes. The upgrade consists of registering a new HDP version, installing that new version on all nodes, and after that starting the upgrade. After starting the upgrade Ambari found that we can also do Rolling upgrade by enabling yarn.timeline-service.recovery.enabled (now false), but instead we decided to do the Express Upgrade (EU). There was only one warning for EU, that some *-env.sh files will be over-written. That was fine, however I backed up all those files for easier comparison with new files after the upgrade. The upgrade started well, and everything was looking great: ZooKeeper, HDFS Name nodes and Data nodes, and Resource managers were all successfully upgraded and restarted. And then, when it looked like it would be effortless on my part, there was the first set-back: Node managers, all 6 of them could not start after the upgrade. Before starting the upgrade I chose to ignore all failures (on both master and worker components) and decided to keep on going and fix NMs later. Upgrade and restart of MapReduce2 and HBase was successful, and then the upgrade wizard tried to do service checks of components upgraded up to that point. As expected, ZK, HDFS and HBase were successful, but Yarn and MR2 tests failed. At that point I decided to see can I fix NMs. A cool feature of EU is that one can pause the upgrade at any time, inspect Ambari dashboard, do manual fixes and restart the EU when ready. Back to failed NMs, the wizard log was just saying (for every NM) that it cannot find it in the list of started NMs which was not very helpful. So, I checked the log on one of NMs, it was saying: 2016-02-05 13:16:52,503 FATAL nodemanager.NodeManager (NodeManager.java:initAndStartNodeManager(540)) - Error starting NodeManagerorg.apache.hadoop.service.ServiceStateException:
org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 2 missing files; e.g.: /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/000035.sst And indeed, in that directory I had 000040.sst but sadly no 000035.sst. I realized that it is my yarn.nodemanager.recovery.dir and because my Yarn NM recovery was enabled, NM tried to recover its state to the one before it was stopped. All our jobs were stopped and we didn't mind about recovering NM states, so after backing up the directory I decided to delete all files in it, and try to start NM manually. Luckily, that worked! The command to start a NM manually, as done by Ambari, as yarn user: $ ulimit -c unlimited; export HADOOP_LIBEXEC_DIR=/usr/hdp/current/hadoop-client/libexec && /usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh --config /usr/hdp/current/hadoop-client/conf start nodemanager
After that EU was smooth, it upgraded Hive, Oozie, Pig, Sqoop, Tez and passed all service checks. At the very end, one can finalize the upgrade, or "Finalize later". I decided to finalize later and inspect the cluster. I noticed that ZKFC are still running on old version 2.2.8 and tried to restart HDFS hoping that ZKFC will be started using the new version. They didn't and on top of that I couldn't start NNs! I realized that because HDFS upgrade was not finalized I need the "-rollingUpgarde started" flag, so I started NNs manually, as hdfs user (Note: this is only required if you want to restart NNs before finalizing the upgrade): $ ulimit -c unlimited; /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh --config /usr/hdp/current/hadoop-client/conf start namenode -rollingUpgrade started
After finalizing the upgrade and restarting HDFS, everything was running on HDP new version. In addition, I did following check to make sure the old version is not used any more:
hdp-select status | grep 2\.2\.8 returns nothing ls -l /usr/hdp/current | grep 2\.2\.8 returns nothing ps -ef | grep java | grep 2\.2\.8 returns nothing or something not related to HDP After finalizing the upgrade Oozie service check was failing. I realized that Oozie share lib in HDFS is now in /user/oozie/share/lib_20160205182129, where the date/time in the directory name is derived from the time of creation. However, permissions were insufficient, all jars had 644 permissions instead of 755. So, as hdfs user I changed permissions and after that Oozie service check was all right: $ hdfs dfs -chmod -R 755 /user/oozie/share/lib_20160205182129
Pig service check was also failing. I found that pig-env.sh was wrong still having HCAT_HOME, HIVE_HOME, PIG_CLASSPATH and PIG_OPTS pointing to jars in now non-existent /usr/lib/hive and /usr/lib/hive-catalog directories. I commented out everything leaving only: JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64
HADOOP_HOME=${HADOOP_HOME:-/usr}
if [ -d "/usr/lib/tez" ]; then
PIG_OPTS="$PIG_OPTS -Dmapreduce.framework.name=yarn"
fi
Fixed templeton.libjars which got scrambled during the upgrade templeton.libjars=/usr/hdp/${hdp.version}/zookeeper/zookeeper.jar,/usr/hdp/${hdp.version}/hive/lib/hive-common.jar At this point all service checks were successful, and additional tests running Pi, Teragen/Terasort, simple Hive and Pig jobs were completing without issues. And so, my first EU was over! Despite these minor setbacks it was much faster than doing it all manually. Give it a try when you have a chance
... View more
Labels: