About oscaricardo4

shikhar_agarwal · ‎03-22-2017

Hi Artem, I'm currently stuck in a particular use case where in I'm trying to access Hive Table data using spark.read.jdbc as shown below: export SPARK_MAJOR_VERSION=2 spark-shell import org.apache.spark.sql.{DataFrame, Row,SparkSession} val connectionProperties = new java.util.Properties() val hiveQuery = "(SELECT * from hive_table limit 10) tmp" val hiveResult = spark.read.jdbc("jdbc:hive2://hiveServerHostname:10000/hiveDBName;user=hive;password=hive", hiveQuery, connectionProperties).collect() But when I check for the results in hiveResult it's just empty. Could you please suggest what's going on here? I know we can access Hive tables using HiveSesssion and I've successfully tried that but is it possible to run hive queries and access Hive data using the above method?

pat_alwell · ‎02-07-2017

@Jan J You don't need to store your data in a database per say; you can store columnar files in S3 or HDFS and upload them using Spark's API. This will allow you to use the features of Spark SQL as opposed to the HiveQL shim. A sample application would look as follows: from pyspark import SparkContext from pyspark.sql import SparkSession sc = SparkContext() spark = SparkSession(sc) #import parquetfiles from HDFS df_customer = spark.read.parquet("hdfs://customer.parquet") df_orders = spark.read.parquet("hdfs://orders.parquet") #Create temp-tables for the purpose of querying our now structured data df_customer.registerTempTable("customer") df_orders.registerTempTable("orders") #Define a function to run our query set and highlight the actual differences amongst the filtered predicates and joins # init vars runtimes = [] def runBenchmarkQuery(query,message): print("Starting: " + message); #start time queryStartTime = datetime.now() #run the query and show the result spark.sql(query).show() #end time queryStopTime = datetime.now() runTime = (queryStopTime-queryStartTime).seconds print("Runtime: %s seconds" % (runTime)) runtimes.append( runTime ) print ("Finishing: " + message); return #TPCH Query 1 runBenchmarkQuery(""" SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'AFRICA' AND o_orderdate >= cast('1993-01-01' as date) AND o_orderdate < add_months(cast('1993-01-01' as date), '12') GROUP BY n_name ORDER BY revenue desc """, "Run 1")

oscaricardo4 · ‎03-12-2016

Thanks for your help. I just not understand why thins link: https://github.com/databricks/spark-perf that you share is needed. Can you explain? In the first step I instal lhadoop, then Install hive and create the schema. Then I can use spark sql to execute queries against the hive schema right? So why that link its necessary? Thanks again!

Online	Offline
Last Visited	‎06-22-2016 06:49 PM

Member Since	‎02-28-2016 12:59 PM
Last Visited	‎06-22-2016 06:49 PM
Posts	9
Kudos received	12

Cloudera Community

Re: query hive tables with spark sql

Re: Test spark sql performance

Re: How spark works to analyze huge databases