About oscaricardo4

oscaricardo4 · ‎03-27-2016

Hi I have a hadoop single node cluster and also hive installed. And I have one hive database with some hive tables stored in hdfs. Now I want to do some sql queries in that hive tables using Spark SQL. Someone already did this? What is the process to achieve this? We need to create again the tables now in spark? Or we can acess direct the hive tables with spark sql? Im trying to find some article about this but it seems always that we need to create again the tables with spark sql, and load data in that tables again, but Im not understanding why if we alredy have this in hive!

oscaricardo4 · ‎03-12-2016

Thanks for your help. I just not understand why thins link: https://github.com/databricks/spark-perf that you share is needed. Can you explain? In the first step I instal lhadoop, then Install hive and create the schema. Then I can use spark sql to execute queries against the hive schema right? So why that link its necessary? Thanks again!

oscaricardo4 · ‎03-06-2016

I want to execute the tpch queries with spark to test spark performance, I already read a lot about this subject but I still have some doubts. The main doubt is this: I already have generated the files relative to each tpch table, but now where do we store this tables? Where its suppose to create the database schema? So that we can acess that database with spark sql. More details: For what I already learn, the Spark SQL enables spark to acess a database and execute SQL queries without the need of Hive, right? So, if I want to use Spark SQL to execute the tpch queries, already having the files relatives to each tpch table, now, where I create the database schema with that table files? It is necessary to create in Hive? Cant be in Spark SQL? Because, I already see a lot of studies where people store the tpch tables on hive and then execute the tpch queries with spark sql against that hive tables. But if we create the database schema in hive and then we acess with spark sql that tables, in fact we are using hive and not spark sql, right? In terms of performance, we are not really testing spark sql performance, but hiveql instead? The questions can be a little basic, but I already read a lot about this subject but I still have that doubts.

oscaricardo4 · ‎03-05-2016

Because I want to test tpch queries with spark not with hive. But, its necessary to use hive as a intermediate to execute queries with spark?

oscaricardo4 · ‎03-05-2016

Thanks again. I read the link but Im still with doubt. It is really necessary install hadoop, then hive, than create the database schema in hive and load the database data in hive and then use the spark to query the hive database? Its not possible install hadoop, then load the tpc-h schema and data in hadoop and query the hadoop data with spark? Im reading a lot of documentation but I really didnt understand the best solution for this.

oscaricardo4 · ‎02-28-2016

And also, hive is really necessary? We can´t have only the hadoop cluster with the tables data and execute queries with spark against the hadoop without hive?

oscaricardo4 · ‎02-28-2016

Thanks again for your help. So ok, the first step is to setup a hadoop cluster. But the link that you share "https://github.com/databricks/spark-perf" has a step that has this title "Running on exinsting spark cluster". So if we want to execute some queries with park it isnt possible create a spark cluster with 4 nodes and store the tables there instead of create an hadoop cluster?

oscaricardo4 · ‎02-28-2016

Thanks for your help, so first I need to install hadoop cluster and uploade the tables (.tbl) into hadoop? And then also create the schema and store tables into hive?

oscaricardo4 · ‎02-28-2016

Hi, Im studing spark, because I read some studies about it and it seems amazing to process large volumes of data. So I was thinking to expriment this, generating 100gb of data with some benchmark like tpc and execute the queries with spark using 2 nodes, but Im with some doubts how to do this. I need to install hadoop two hadoop nodes to store the tpc tables? And then execute the queries with spark against hdfs? But how we can create the tpc schema and store the tables in hadoop hdfs?Is it possible? Or it´s not necessary install hadoop and we need to use hive instead? I reading some articles about this but but I m getting a bit confused. Thanks for your attention!

Online	Offline
Last Visited	‎06-22-2016 06:49 PM

Member Since	‎02-28-2016 12:59 PM
Last Visited	‎06-22-2016 06:49 PM
Posts	9
Kudos received	12

Cloudera Community

query hive tables with spark sql

Re: How spark works to analyze huge databases

Test spark sql performance

Re: How spark works to analyze huge databases

Re: How spark works to analyze huge databases

Re: How spark works to analyze huge databases

Re: How spark works to analyze huge databases

Re: How spark works to analyze huge databases

How spark works to analyze huge databases