Member since
07-31-2019
346
Posts
259
Kudos Received
62
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2960 | 08-22-2018 06:02 PM | |
1709 | 03-26-2018 11:48 AM | |
4258 | 03-15-2018 01:25 PM | |
5100 | 03-01-2018 08:13 PM | |
1447 | 02-20-2018 01:05 PM |
02-07-2018
04:15 PM
HDP 2.64 is not supported on Windows. Linux, especially Centos 7, is perfect. You could do some Hadoop experiments in a VM and docker on Windows (https://hortonworks.com/tutorial/sandbox-deployment-and-install-guide/) Check out docker hub https://hub.docker.com/r/hortonworks/ambari-server/ https://hub.docker.com/u/hortonworks/
... View more
02-06-2018
06:26 PM
Hi @PJ, the honest truth is there is no good reason not to use ORC format. You can use another format like Parquet but it won't provide ACID, LLAP cache, or the same level of performance. I would say the decision is similar to not using indexes in a relational system or not running statistics. ORC is simply best practice for high performance data warehousing in Hive. Keep in mind that LLAP will allow you to cache raw text files. This may be an option if you have some strict SLA preventing you from incurring the conversion delay of the text file to ORC.
... View more
08-14-2017
02:43 PM
5 Kudos
Many organizations still ask the
question, “Can I run BI (Business Intelligence) workloads on Hadoop?” These workloads range from short, low-latency
ad-hoc queries to canned or operational reporting. The primary concerns center around user
experience. Will a query take too long
to return an answer? How quickly can I
change my mind with a report and drill down other dimensional attributes? For
almost 20 years vendors have engineered highly customized solutions to solve
these problems. Many times these
solutions require fine-tuned appliances that tightly integrate hardware and
software in order to squeeze out every last drop of performance. The challenges with these solutions
are mainly around cost and maintenance. These solutions become cost-prohibitive
at scale and require large teams to manage and operate. The ideal solution is
one that affordably scales but retains the same performance advantages as your
appliance. Your analysts should not see the difference between the costly
appliance and the more affordable solution. Hadoop is the solution and this
article aims to dispel the myth that BI workloads cannot run on Hadoop by
pointing to the solution components. When I talk to customers the first
thing they say when asking about SQL workloads on Hadoop is Hive is slow. This is largely to do with both competitors
FUD as well the history of Hive. Hive
grew up as a batch SQL engine because the early use cases where only concerned with
providing SQL access to MapReduce so that users would not need to know Java. Hive was seen as a way to increase the use of
a cluster over a larger user base. It
really wasn’t until the Hortonworks Stinger initiative
that a serious effort was made to make Hive into a faster query tool. The two main focuses of the Stinger effort
was around file format (ORC) and moving away from MapReduce to Tez. To be
clear, no one runs Hive on MapReduce anymore. If you are, you are doing it
wrong. Also, if
you are running Hive queries against CSV files or other formats then you are
also doing it wrong. Here is a great primer
to bookmark and make sure anyone working on Hive in your organization reads. Tez certainly did not alleviate the
confusion. Tez got Hive in the race but not across the finish line. Tez provided Hive with a more interactive
querying experience over large sets of data but what it did not provide is good
query performance for the typical ad-hoc, drilldown type querying we see in
most BI reporting. Do to the manner in which Tez and YARN spin up
and down containers and how containers are allocated on a per job basis, there
were limiting performance factors as well as concurrency issues. Hortonworks created LLAP
to solve these problems. Many customers
are confused by LLAP because they think it is a replacement for Hive. A better way to think about it is to look at
Hive as the query tool (the tool allowing you to use SQL language) and LLAP as
the resource manager for your query execution.
For the business user to use LLAP they do not need to change anything. You simply connect to the Hiveserver2
instance (you can use ODBC,
JDBC,
or the Hive
View) that has LLAP
enabled and you are on your way. The primary design purpose for LLAP
was to provide fast performance for ad-hoc querying over semi-large datasets
(1TB-10TB) using standard BI tools such as Tableau, Excel, Microstrategy, or
PowerBI. In addition to performance,
because of the manner in which LLAP manages memory and utilizes Slider, LLAP
also provides for a high level of concurrency without the cost of container
startups. In summary, you can run ad-hoc
queries today on HDP by using Hive with LLAP: Geisinger
Teradata offload https://www.youtube.com/watch?v=UzgsczrdWbg Comcast SQL
benchmarks https://www.youtube.com/watch?v=dS1Ke-_hJV0 Your company can now begin
offloading workloads from your appliances and running those same queries on
HDP. In the next articles I will address
the other components for BI workloads: ANSI compliance and OLAP. For more information around Hive, feel free to
checkout the following book: https://github.com/Apress/practical-hive
... View more
Labels:
07-05-2017
05:58 PM
Good to hear. Thanks for the update! Please accept my answer if you feel like it helped. Thanks!
... View more
06-28-2017
02:17 PM
Hi @Scott Shaw Thank you so much.. We will test this feature..
... View more
10-23-2018
04:44 PM
Its better not to disturb the properties on the statistics usage like hive.compute.query.using.stats. It impacts the way the statistics are used in your query for performance optimization and execution plans. It has tremendous influence on execution plans, the statistics stored depends on the file format as well. Therefore definitely not a solution to change any property with regards to statistics. The real reason for count not working correctly is the statistics not updated in the hive due to which it returns 0. When a table is created first, the statistics is written with no data rows. Thereafter any data append/change happens hive requires to update this statistics in the metadata. Depending on the circumstances hive might not be updating this real time. Therefore running the ANALYZE command recomputes this statistics to make this work correctly.
... View more
05-10-2017
01:24 PM
Thanks! I would be interested to learn more when you are ready to announce it.
... View more
11-28-2018
12:10 AM
Is there anyway to debug the io cache component to find out why it's not caching
... View more
11-29-2016
08:06 PM
1 Kudo
@Dagmawi Mengistu Is "ambari-server.hostname" is actually your ambari server hostname ? Can you try changing it to "*" and then retest the same? something like this: hadoop.proxyuser.ec2-user.groups = *
hadoop.proxyuser.ec2-user.hosts = *
hadoop.proxyuser.admin.groups = *
hadoop.proxyuser.admin.groups = *
NOTE: If you are running ambari-server daemon under an account name of root then you should add hadoop.proxyuser.root.groups = *
hadoop.proxyuser.root.hosts = * . Also your error indicates that it is not able to write inside the "/user/admin/hive/job/...." directory, which indicates that you have logged in to ambari hive view as "admin" user, so you must do the following: su -l hdfs -c "hdfs dfs -mkdir /user/admin"
su -l hdfs -c "hdfs dfs -chown admin:hdfs /user/admin" .
... View more