Member since
10-07-2015
21
Posts
32
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3885 | 03-16-2017 02:27 PM | |
3603 | 12-22-2016 10:41 PM | |
747 | 10-29-2015 01:39 AM |
10-19-2017
06:34 PM
1 Kudo
Bucketing is a tricky feature. Users sometimes bucket their tables believing it will solve all performance problems and are subsequently disappointed. Users should be confident they understand all implications of bucketing before they use it, and only use it if other approaches do not work. There are two major reasons to use bucketing: 1) The bucket join. If two large tables are "compatible" in terms of identical partition keys and compatible numbers of buckets, an optimization can kick in to accelerate the join. Note that all other joins will proceed as normal. Since bucketing creates additional files this can harm performance. Consider this only when you have an extremely expensive join and the problem cannot be addressed any other way. 2) Bucket pruning. Bucket pruning was added in Hive 2.X. If you are using Hive 1.X there is no bucket pruning. You should first understand that with ORCFile, there are inline indexes that allow large portions of the file to be skipped if it is known the records cannot match (e.g. in a point lookup). If your table is ordered by customer ID and your point lookup is based on customer ID, minimal I/O is needed to find matching records in ORC and bucketing is not helpful. Bucketing is appropriate if you partition by one key and sort by another, commonly you will sort by a timestamp. If your sort and partition keys do not match, bucket pruning (in Hive 2.X) can help point lookup queries. Generally speaking I tell users that bucketing is an optimization of last resort. Usually there are better ways to get what you want.
... View more
07-11-2017
10:26 PM
1 Kudo
If you are using Hive 2 or later (including Hive LLAP), you no longer need the dummy table, statements like: INSERT INTO table test_array SELECT 1, array('a','b'); will work.
... View more
07-07-2017
03:36 PM
5 Kudos
Hybrid Procedural SQL On Hadoop (HPL/SQL) is a tool that implements procedural SQL for Hive. Lately, many people have investigated running HPL/SQL on HDP 2.6 and have encountered problems. These instructions tell you how to work around these problems so that you can experiment with HPL/SQL. In HDP 2.6, HPL/SQL is considered a "technical preview" and is not supported by Hortonworks support subscriptions. There are known limitations in HPL/SQL that may make it unsuitable for your needs, so test thoroughly before you decide HPL/SQL is right for you. These instructions require cluster changes which are not appropriate for production clusters and should only be done on development clusters or sandboxes. We'll cover 2 ways of using HPL/SQL:
Using HiveServer Interactive (Preferred) Using an embedded metastore In either approach, you need to edit /etc/hive2/conf/hive-env.sh and change line 30 from: export HIVE_CONF_DIR=/usr/hdp/current/hive-server2-hive2/conf/conf.server to export HIVE_CONF_DIR=${HIVE_CONF_DIR:-/usr/hdp/current/hive-server2-hive2/conf/conf.server} Again, do not do this on a production cluster. Note that hive-env.sh will be overwritten every time you restart HiveServer Interactive and this modification will need to be repeated for HPL/SQL to be used. Option 1 (Preferred): Using HPL/SQL with HiveServer Interactive:
First, start HiveServer Interactive through Ambari and edit hive-env.sh as mentioned above. After editing hive-env.sh you will need to place this configuration into /usr/hdp/current/hive-server2-hive2/conf/hplsql-site.xml <configuration>
<property>
<name>hplsql.conn.default</name>
<value>hiveconn</value>
</property>
<property>
<name>hplsql.conn.hiveconn</name>
<value>org.apache.hive.jdbc.HiveDriver;jdbc:hive2://ambari.example.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-hive2</value>
</property>
<property>
<name>hplsql.conn.convert.hiveconn</name>
<value>true</value>
</property>
</configuration>
You will need to replace the value for hplsql.conn.hiveconn with the value of HiveServer2 Interactive JDBC URL as shown in the Hive service page in Ambari. Proceed to the Validation Phase below. Option 2: Using Embedded Metastore
To use an embedded metastore, HPL/SQL clients will need access to the database backing the metastore (e.g. MySQL), so will need a hive-site.xml that contains credentials to the database. Ambari sets up two hive-site.xml files, one without passwords in /etc/hive2/conf and one with passwords in /etc/hive2/conf/conf.server, only visible to certain users. You will need the one with credentials.
Because of this security problem, use this approach only if you can't use HiveServer for some reason.
Run these commands to clone the Hive configurations, including passwords: sudo cp -r /etc/hive2/conf/conf.server conf
sudo chmod -R 755 conf
sudo cp /etc/hive2/2.6.1.0-129/0/hive-env.sh conf
Edit conf/hive-site.xml and change the value of hadoop.security.credential.provider.path to jceks://file/home/vagrant/conf/hive-site.jceks
export HIVE_CONF_DIR=/home/vagrant/conf (you will need subtitute your actual path here) Finally, place this configuration in /home/vagrant/conf/hplsql-site.xml (again, substitute your actual path) <configuration>
<property>
<name>hplsql.conn.default</name>
<value>hiveconn</value>
</property>
<property>
<name>hplsql.conn.hiveconn</name>
<value>org.apache.hive.jdbc.HiveDriver;jdbc:hive2://</value>
</property>
<property>
<name>hplsql.conn.convert.hiveconn</name>
<value>true</value>
</property>
</configuration> If you decided to look at the Embedded Metastore route, hopefully you read these instructions and decided the HiveServer Interactive route is a better choice. Validation Phase:
To confirm your setup, run: /usr/hdp/current/hive-server2-hive2/bin/hplsql -e 'select "hello world";' If your setup is correct you will see hello world printed to your console. For more information, HPL/SQL includes excellent documentation (http://www.hplsql.org/doc) and you should consult this for most questions.
... View more
- Find more articles tagged with:
- Data Processing
- FAQ
- Hive
- hplsql
Labels:
07-01-2017
03:53 PM
2 Kudos
This failure to
start HiveServer Interactive / Hive LLAP is due to a known problem in Ambari 2.5 where certain keytab files are not generated if you enable LLAP after
your cluster is Kerberized. The Ambari Kerberos wizard generates keytabs for
all services that are present as of the time the Kerberos Wizard is run. If
HiveServer Interactive is not enabled when the wizard runs, certain essential keytabs will not be
present when you try to enable HiveServer Interactive / LLAP, nor are they generated at that time. There are two
options for resolving this problem: Regenerate
keytabs using the Ambari Kerberos wizard, refer to the Ambari
documentation for this process. On all cluster
nodes, copy the hive.service.keytab to hive.llap.zk.sm.keytab. If your keytabs
are stored in the default location, cp
/etc/security/keytabs/hive.service.keytab
/etc/security/keytabs/hive.llap.zk.sm.keytab
... View more
- Find more articles tagged with:
- Data Processing
- Hive
- Issue Resolution
- llap
Labels:
07-01-2017
03:44 PM
Maximum number of mappers will be bound by the number of splits calculated at split generation time. These settings impact split calculation: mapreduce.input.fileinputformat.split.minsize mapreduce.input.fileinputformat.split.maxsize Splits are grouped at the Tez layer based on these settings: tez.grouping.min-size tez.grouping.max-size If you want more mappers you can tune all these settings down. Note this will not guarantee lower latency, especially on small clusters.
... View more
07-01-2017
03:39 PM
Frédéric, Can you create a new Apache JIRA ticket with your DDL and view definition? HIVE-15970 looks unrelated given it applies to multi-insert which is only used for MERGE.
... View more
07-01-2017
03:17 PM
The right way to think about LATERAL VIEW is that it allows a table-generating function (UDTF) to be treated as a table source, so that it can be used like any other table, including selects, joins and more.
LATERAL VIEW is often used with explode, but explode is just one UDTF of many, a full list is available in the documentation.
To take an example:
select tf1.*, tf2.*
from (select 0) t
lateral view explode(map('A',10,'B',20,'C',30)) tf1
lateral view explode(map('A',10,'B',20,'C',30)) tf2;
This results in:
tf1.key
tf1.value
tf2.key
tf2.value
A
10
A
10
A
10
B
20
A
10
C
30
B
20
A
10
(5 rows were truncated)
The thing to see here is that this query is a cross product join between the tables tf1 and tf2. The LATERAL VIEW syntax allowed me to treat them as tables. The original question used "AS" syntax, which automatically maps the generated table's columns to column aliases. In my view it is much more powerful to leave them as tables and use their fully qualified table correlation identifiers.
These tables can be used in joins as well:
select tf1.*, tf2.*
from (select 0) t
lateral view explode(map('A',10,'B',20,'C',30)) tf1
lateral view explode(map('A',10,'B',20,'C',30)) tf2 where tf1.key = tf2.key;
Now we get:
tf1.key
tf1.value
tf2.key
tf2.value
A
10
A
10
B
20
B
20
C
30
C
30
... View more
06-30-2017
01:29 PM
For those looking for an easy graphical tool, the Hive View 2.0 (included with Ambari 2.5 and up) has the ability to view table and column level stats, and to compute them if they are missing. For more info see https://hortonworks.com/blog/3-great-reasons-to-try-hive-view-2-0/ Note that column stats are listed under table stats and you can see the individual column's statistics there.
... View more
05-15-2017
02:36 PM
If you're looking for a standalone tool to convert CSV to ORC have a look at https://github.com/cartershanklin/csv-to-orc It's a standalone Java tool that can run anywhere, including off of
your Hadoop cluster. It supports custom null strings, row skipping and
basic Hive types (no complex types currently)
... View more
03-16-2017
02:27 PM
1 Kudo
Add these to your pom.xml or to your Maven to resolve HDP dependencies: <repository>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
<id>hortonworks.extrepo</id>
<name>Hortonworks HDP</name>
<url>http://repo.hortonworks.com/content/repositories/releases</url>
</repository>
<repository>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
<id>hortonworks.other</id>
<name>Hortonworks Other Dependencies</name>
<url>http://repo.hortonworks.com/content/groups/public</url>
</repository> The first of these is for the main HDP artifacts. These second of these is needed for dependencies like jetty-sslengine.jar:6.1.26.hwx.
... View more
12-22-2016
10:41 PM
1 Kudo
With HDP 2.5 this is supported. This tutorial shows connecting Tableau to Phoenix end-to-end: http://hortonworks.com/hadoop-tutorial/bi-apache-phoenix-odbc/
... View more
12-22-2016
10:17 PM
2 Kudos
This error usually indicates you have defined aux jars in hive-site.xml. For now (HDP 2.5 and below), aux jars need to be set in the client rather than as a server-side property when using Hive on Tez. There is an improvement request tracking this.
... View more
09-13-2016
08:50 AM
4 Kudos
If you're not using Tez UI to debug your Hive queries you need to start today. Ambari installs Tez UI by default, but some people need a standalone option. Tez UI is easy to install and can go anywhere, meaning there's no reason for you to wonder why that Hive query won't run faster.
Why You Should Use Tez UI: Whether in Ambari or Standalone, here's some of the things you're missing if you don't use Tez UI: 1. Inventory of all queries executed, with search and filter.
2. Drill into a query and see the actual query text. Great if a BI tool is generating long-running queries.
3. Vertex Swimlane. This one is new in HDP 2.5 but it helps you quickly pinpoint the long-running parts of your query and optimize them. How To Install Tez UI Standalone: The Tez UI Installation Instructions give the details on installing Tez UI, but it can be a bit hard to put the pieces together. In this How-To we will cover installing Tez UI into Apache httpd on an HDP 2.5 installation. The installation instructions don't change much for different web servers or for app servers like Tomcat. Installing the UI in Apache httpd assuming a CentOS base OS. Install Tez and httpd: yum install httpd tez Create a directory for the UI: mkdir /var/www/html/tez-ui Change into this new directory: cd /var/www/html/tez-ui Extract the Tez UI WAR into this new directory: unzip /usr/hdp/current/tez-client/ui/tez-ui-0.7.0.2.5.0.0-1245.war Open /var/www/html/tez-ui/scripts/configs.js in a text editor and modify these values. Set timelineBaseUrl to your Application Timeline Server (ATS) endpoint. Usually something like: http://ats.example.com:8188 Set RMWebUrl to your Resource Manager (RM) endpoint. Usually something like http://rm.example.com:8088 Configuring Tez: You need to tell Tez to log its history to ATS using the logging service class. Also you should tell Tez where the UI is located, so it can cross link to the Tez UI from within the Resource Manager UI. All together these are needed in your tez-site.xml: <property>
<description>Enable Tez to use the Timeline Server for History Logging</description>
<name>tez.history.logging.service.class</name>
<value>org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService</value>
</property>
<property>
<description>URL for where the Tez UI is hosted</description>
<name>tez.tez-ui.history-url.base</name>
<value>http://httpd.example.com:80/tez-ui/</value>
</property> That's it. Now you can load Tez UI in your browser. Example: Here's an example configs.env and screenshot of the resulting UI. Notice that timeline and rm have both been modified from the original: [vagrant@hdp250 scripts]$ pwd
/var/www/html/tez-ui/config
[vagrant@hdp250 scripts]$ cat configs.env
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
ENV = {
hosts: {
/*
* Timeline Server Address:
* By default TEZ UI looks for timeline server at http://localhost:8188, uncomment and change
* the following value for pointing to a different address.
*/
timeline: "http://hdp250.example.com:8188",
/*
* Resource Manager Address:
* By default RM REST APIs are expected to be at http://localhost:8088, uncomment and change
* the following value to point to a different address.
*/
rm: "http://hdp250.example.com:8088",
/*
* Resource Manager Web Proxy Address:
* Optional - By default, value configured as RM host will be taken as proxy address
* Use this configuration when RM web proxy is configured at a different address than RM.
*/
//rmProxy: "http://localhost:8088",
},
/*
* Time Zone in which dates are displayed in the UI:
* If not set, local time zone will be used.
* Refer http://momentjs.com/timezone/docs/ for valid entries.
*/
//timeZone: "UTC",
/*
* yarnProtocol:
* If specified, this protocol would be used to construct node manager log links.
* Possible values: http, https
* Default value: If not specified, protocol of hosts.rm will be used
*/
//yarnProtocol: "<value>",
}; This configuration was used to generate the first screenshot in this document, above. CORS: On a multi-node cluster, chances are you will need to enable CORS in ATS to properly view data. If you connect and see an error about CORS not enabled, see the Apache Documentation for the relevant configurations you will need to set, most importantly you will need to enable yarn.timeline-service.http-cross-origin.enabled Note On Secure Clusters: On secure clusters your client web browser will need to be configured to talk to the Kerberized cluster. Depending on your browser you may need to kinit your client or use a plugin. A good way to confirm connectivity is to first see if your browser can connect to the Resource Manager UI. If yes, you should also be able to connect to the standalone Tez UI.
... View more
- Find more articles tagged with:
- Data Processing
- Hive
- How-ToTutorial
- tez
Labels:
09-10-2016
02:49 AM
Here is a sample Maven project that handles all dependencies. Includes instructions on using and adding your own UDFs. https://github.com/cartershanklin/sample-hiveudf
... View more
10-29-2015
01:39 AM
1 Kudo
I've never tried this approach, think of it as a science experiment. Set up a node label, label the 20 existing hosts and create a queue that defaults to that node label, submit Sqoop jobs to that queue alone. Your Sqoop jobs will only run on the existing 20 nodes. You could also go narrower and only have 1 host do the imports. Be careful because HDFS usage on that node will become much higher if you don't balance.
... View more
10-27-2015
04:52 PM
@amcbarnett@hortonworks.com Can you confirm you really needed the -D settings after you imported your cert into the truststore? These arguments you added are the defaults.
... View more