Member since
02-07-2017
1
Post
0
Kudos Received
0
Solutions
02-07-2017
05:07 PM
@Jan J
You don't need to store your data in a database per say; you can store columnar files in S3 or HDFS and upload them using Spark's API. This will allow you to use the features of Spark SQL as opposed to the HiveQL shim. A sample application would look as follows: from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext()
spark = SparkSession(sc)
#import parquetfiles from HDFS
df_customer = spark.read.parquet("hdfs://customer.parquet")
df_orders = spark.read.parquet("hdfs://orders.parquet")
#Create temp-tables for the purpose of querying our now structured data
df_customer.registerTempTable("customer")
df_orders.registerTempTable("orders")
#Define a function to run our query set and highlight the actual differences amongst the filtered predicates and joins
# init vars
runtimes = []
def runBenchmarkQuery(query,message):
print("Starting: " + message);
#start time
queryStartTime = datetime.now()
#run the query and show the result
spark.sql(query).show()
#end time
queryStopTime = datetime.now()
runTime = (queryStopTime-queryStartTime).seconds
print("Runtime: %s seconds" % (runTime))
runtimes.append( runTime )
print ("Finishing: " + message);
return
#TPCH Query 1
runBenchmarkQuery("""
SELECT
n_name,
sum(l_extendedprice * (1 - l_discount)) as revenue
FROM
customer,
orders,
lineitem,
supplier,
nation,
region
WHERE
c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND l_suppkey = s_suppkey
AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey
AND n_regionkey = r_regionkey
AND r_name = 'AFRICA'
AND o_orderdate >= cast('1993-01-01' as date)
AND o_orderdate < add_months(cast('1993-01-01' as date), '12')
GROUP BY
n_name
ORDER BY
revenue desc
""", "Run 1")
... View more