About SQLShaw

SQLShaw · ‎11-02-2017

Hi @Sebastien F Hive has been documented at running on 300+ PB of raw storage at Facebook. The largest cluster is 4,500+ nodes at Yahoo. Yahoo Japan was able to run 100,000 queries per hour and LLAP ran 100 million rows/s per node. Hive\Tez scales to 100's of PB. LLAP is meant for smaller data sets (1-10 TB) which are typical for standard BI type workloads. With that being said, LLAP allows you to utilize SSD for cache so you can extend this to 100's TB (if you can afford that much SSD storage). Hope this helps!

SQLShaw · ‎09-04-2017

Hi @Gnanasekaran G, is there an OVERRIDE command in your statement? You may be running into this issue - https://issues.apache.org/jira/browse/HIVE-4605

SQLShaw · ‎08-17-2017

@Alberto Ramon You can setup multiple H2 instances and then create a DNS with their IPs. You would then point the Hive View to that DNS. See this related article on how to load balance H2. https://community.hortonworks.com/questions/110277/load-balancing-hiveserver2-over-knox.html. This is a supported configuration.

SQLShaw · ‎08-17-2017

@Bala Vignesh N V is correct. create table <new table> as select * from <external_table>; Will create a new empty table with the same as the external columns. The external table has to be created prior to executing the CTAS.

SQLShaw · ‎08-15-2017

@Gopalakrishnan Veeran As a starting point, only Hive will provide you ACID capabilities so if you want to perform updates, merge, or any other CDC capability than HIve is where you want to start. A combination of Hive, LLAP, Tez, and ORC will give you the best performance with the best flexibility. LLAP will handle your ad-hoc type query patterns by using a shared, distributed cache. For longer running queries at scale, Hive with Tez has been proven most reliable. In addition, Hive is the only SQL in Hadoop tool to be able to run all 99 TPC-DS queries with only trivial syntax changes. This is important when you are migrating for existing RDBMS systems. Though not quite ready for primetime you may want to take a look at HPLSQL http://www.hplsql.org/. We plan to begin introducing this into the product in future releases. You are also able to read text files directly with LLAP which eliminates the need to transform the data to the ORC format which can be time consuming for large files.

SQLShaw · ‎08-14-2017

Many organizations still ask the question, “Can I run BI (Business Intelligence) workloads on Hadoop?” These workloads range from short, low-latency ad-hoc queries to canned or operational reporting. The primary concerns center around user experience. Will a query take too long to return an answer? How quickly can I change my mind with a report and drill down other dimensional attributes? For almost 20 years vendors have engineered highly customized solutions to solve these problems. Many times these solutions require fine-tuned appliances that tightly integrate hardware and software in order to squeeze out every last drop of performance. The challenges with these solutions are mainly around cost and maintenance. These solutions become cost-prohibitive at scale and require large teams to manage and operate. The ideal solution is one that affordably scales but retains the same performance advantages as your appliance. Your analysts should not see the difference between the costly appliance and the more affordable solution. Hadoop is the solution and this article aims to dispel the myth that BI workloads cannot run on Hadoop by pointing to the solution components. When I talk to customers the first thing they say when asking about SQL workloads on Hadoop is Hive is slow. This is largely to do with both competitors FUD as well the history of Hive. Hive grew up as a batch SQL engine because the early use cases where only concerned with providing SQL access to MapReduce so that users would not need to know Java. Hive was seen as a way to increase the use of a cluster over a larger user base. It really wasn’t until the Hortonworks Stinger initiative that a serious effort was made to make Hive into a faster query tool. The two main focuses of the Stinger effort was around file format (ORC) and moving away from MapReduce to Tez. To be clear, no one runs Hive on MapReduce anymore. If you are, you are doing it wrong. Also, if you are running Hive queries against CSV files or other formats then you are also doing it wrong. Here is a great primer to bookmark and make sure anyone working on Hive in your organization reads. Tez certainly did not alleviate the confusion. Tez got Hive in the race but not across the finish line. Tez provided Hive with a more interactive querying experience over large sets of data but what it did not provide is good query performance for the typical ad-hoc, drilldown type querying we see in most BI reporting. Do to the manner in which Tez and YARN spin up and down containers and how containers are allocated on a per job basis, there were limiting performance factors as well as concurrency issues. Hortonworks created LLAP to solve these problems. Many customers are confused by LLAP because they think it is a replacement for Hive. A better way to think about it is to look at Hive as the query tool (the tool allowing you to use SQL language) and LLAP as the resource manager for your query execution. For the business user to use LLAP they do not need to change anything. You simply connect to the Hiveserver2 instance (you can use ODBC, JDBC, or the Hive View) that has LLAP enabled and you are on your way. The primary design purpose for LLAP was to provide fast performance for ad-hoc querying over semi-large datasets (1TB-10TB) using standard BI tools such as Tableau, Excel, Microstrategy, or PowerBI. In addition to performance, because of the manner in which LLAP manages memory and utilizes Slider, LLAP also provides for a high level of concurrency without the cost of container startups. In summary, you can run ad-hoc queries today on HDP by using Hive with LLAP: Geisinger Teradata offload https://www.youtube.com/watch?v=UzgsczrdWbg Comcast SQL benchmarks https://www.youtube.com/watch?v=dS1Ke-_hJV0 Your company can now begin offloading workloads from your appliances and running those same queries on HDP. In the next articles I will address the other components for BI workloads: ANSI compliance and OLAP. For more information around Hive, feel free to checkout the following book: https://github.com/Apress/practical-hive

SQLShaw · ‎07-06-2017

@Frank Welsch the LLAP blogs links back to your HCC article.

SQLShaw · ‎07-05-2017

Good to hear. Thanks for the update! Please accept my answer if you feel like it helped. Thanks!

SQLShaw · ‎07-05-2017

Hi @sai saiedfar, Are you able to select any other values? Maybe try a different browser and see if allows you to select it.

SQLShaw · ‎07-03-2017

Hi @Abhijeet Rajput, Previous to HDP 2.6 you'll need to use the solution outlined in #2. HDP 2.6 includes Hive MERGE so you can now create a staging table and execute a MERGE statement against an ACID enabled table. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Merge

Online	Offline
Last Visited	‎06-25-2024 10:10 AM

Member Since	‎07-31-2019 06:56 AM
Last Visited	‎06-25-2024 10:10 AM
Posts	346
Kudos received	257

Cloudera Community

Re: Regarding to activate HIVE ACID transactions o...

Re: Hive 1.2.1++

Re: What is the fastest way to load data into Apac...

Re: Do i have to commit my insert statment in hive...

Re: Deploying hortonworks sandbox VM to cluster

Re: what is huge datasets for Hive ?

Re: Hive (on tez) : Hive Runtime Error while closi...

Re: HiveServer2 HA and HDP Views

Re: Create empty table if no data in source table

Re: Is really Hive on Tez with ORC performance bet...

Part 1 - Ad-hoc Query Workloads on HDP

Re: Updates to Apache Hive, Apache Tez Docs, & Hiv...

Re: can not set hive.auto.convert.join to false

Re: can not set hive.auto.convert.join to false

Re: hive incremental updates