Member since
02-28-2016
9
Posts
12
Kudos Received
0
Solutions
03-22-2017
04:27 AM
Hi Artem, I'm currently stuck in a particular use case where in I'm trying to access Hive Table data using spark.read.jdbc as shown below: export SPARK_MAJOR_VERSION=2 spark-shell import org.apache.spark.sql.{DataFrame, Row,SparkSession} val connectionProperties = new java.util.Properties() val hiveQuery = "(SELECT * from hive_table limit 10) tmp" val hiveResult = spark.read.jdbc("jdbc:hive2://hiveServerHostname:10000/hiveDBName;user=hive;password=hive", hiveQuery, connectionProperties).collect() But when I check for the results in hiveResult it's just empty. Could you please suggest what's going on here? I know we can access Hive tables using HiveSesssion and I've successfully tried that but is it possible to run hive queries and access Hive data using the above method?
... View more
02-07-2017
05:07 PM
@Jan J
You don't need to store your data in a database per say; you can store columnar files in S3 or HDFS and upload them using Spark's API. This will allow you to use the features of Spark SQL as opposed to the HiveQL shim. A sample application would look as follows: from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext()
spark = SparkSession(sc)
#import parquetfiles from HDFS
df_customer = spark.read.parquet("hdfs://customer.parquet")
df_orders = spark.read.parquet("hdfs://orders.parquet")
#Create temp-tables for the purpose of querying our now structured data
df_customer.registerTempTable("customer")
df_orders.registerTempTable("orders")
#Define a function to run our query set and highlight the actual differences amongst the filtered predicates and joins
# init vars
runtimes = []
def runBenchmarkQuery(query,message):
print("Starting: " + message);
#start time
queryStartTime = datetime.now()
#run the query and show the result
spark.sql(query).show()
#end time
queryStopTime = datetime.now()
runTime = (queryStopTime-queryStartTime).seconds
print("Runtime: %s seconds" % (runTime))
runtimes.append( runTime )
print ("Finishing: " + message);
return
#TPCH Query 1
runBenchmarkQuery("""
SELECT
n_name,
sum(l_extendedprice * (1 - l_discount)) as revenue
FROM
customer,
orders,
lineitem,
supplier,
nation,
region
WHERE
c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND l_suppkey = s_suppkey
AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey
AND n_regionkey = r_regionkey
AND r_name = 'AFRICA'
AND o_orderdate >= cast('1993-01-01' as date)
AND o_orderdate < add_months(cast('1993-01-01' as date), '12')
GROUP BY
n_name
ORDER BY
revenue desc
""", "Run 1")
... View more
03-12-2016
04:33 PM
1 Kudo
Thanks for your help. I just not understand why thins link: https://github.com/databricks/spark-perf that you share is needed. Can you explain? In the first step I instal lhadoop, then Install hive and create the schema. Then I can use spark sql to execute queries against the hive schema right? So why that link its necessary? Thanks again!
... View more