<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Test spark sql performance in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Test-spark-sql-performance/m-p/112567#M21999</link>
    <description>&lt;P&gt;I want to execute the tpch queries with spark to test spark performance, I already read a lot about this subject but I still have some doubts.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;The main doubt is this:&lt;/STRONG&gt; I already have generated the files relative to each tpch table, but now where do we store this tables? Where its suppose to create the database schema? So that we can acess that database with spark sql.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;More details:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;For what I already learn, the Spark SQL enables spark to acess a database and execute SQL queries without the need of Hive, right? So, if I want to use Spark SQL to execute the tpch queries,  already having the files relatives to each tpch table, now, where I create the database schema with that table files? It is necessary to create in Hive? Cant be in Spark SQL?&lt;/P&gt;&lt;P&gt;Because, I already see a lot of studies where people store the tpch tables on hive and then execute the tpch queries with spark sql against that hive tables. But if we create the database schema in hive and then we acess with spark sql that tables, in fact we are using hive and not spark sql, right? In terms of performance, we are not really testing spark sql performance, but hiveql instead?&lt;/P&gt;&lt;P&gt;The questions can be a little basic, but I already read a lot about this subject but I still have that doubts.&lt;/P&gt;</description>
    <pubDate>Sun, 06 Mar 2016 08:59:43 GMT</pubDate>
    <dc:creator>oscaricardo4</dc:creator>
    <dc:date>2016-03-06T08:59:43Z</dc:date>
    <item>
      <title>Test spark sql performance</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Test-spark-sql-performance/m-p/112567#M21999</link>
      <description>&lt;P&gt;I want to execute the tpch queries with spark to test spark performance, I already read a lot about this subject but I still have some doubts.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;The main doubt is this:&lt;/STRONG&gt; I already have generated the files relative to each tpch table, but now where do we store this tables? Where its suppose to create the database schema? So that we can acess that database with spark sql.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;More details:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;For what I already learn, the Spark SQL enables spark to acess a database and execute SQL queries without the need of Hive, right? So, if I want to use Spark SQL to execute the tpch queries,  already having the files relatives to each tpch table, now, where I create the database schema with that table files? It is necessary to create in Hive? Cant be in Spark SQL?&lt;/P&gt;&lt;P&gt;Because, I already see a lot of studies where people store the tpch tables on hive and then execute the tpch queries with spark sql against that hive tables. But if we create the database schema in hive and then we acess with spark sql that tables, in fact we are using hive and not spark sql, right? In terms of performance, we are not really testing spark sql performance, but hiveql instead?&lt;/P&gt;&lt;P&gt;The questions can be a little basic, but I already read a lot about this subject but I still have that doubts.&lt;/P&gt;</description>
      <pubDate>Sun, 06 Mar 2016 08:59:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Test-spark-sql-performance/m-p/112567#M21999</guid>
      <dc:creator>oscaricardo4</dc:creator>
      <dc:date>2016-03-06T08:59:43Z</dc:date>
    </item>
    <item>
      <title>Re: Test spark sql performance</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Test-spark-sql-performance/m-p/112568#M22000</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/3093/oscaricardo4.html" nodeid="3093"&gt;@Jan J&lt;/A&gt;&lt;P&gt;I would store in HDFS. I would leverage Hive...See this &lt;A href="http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables" target="_blank"&gt;http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables&lt;/A&gt;&lt;/P&gt;&lt;P&gt;When you use spark to read hive tables then you are using spark features to read data from hive. &lt;/P&gt;&lt;P&gt;In terms of performance, we are not really testing spark sql performance, but hiveql instead?  &lt;/P&gt;&lt;P&gt;You are testing performance of SparkSql feature with Hive&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;All the answers are in the overview &lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="http://spark.apache.org/docs/latest/sql-programming-guide.html#overview" target="_blank"&gt;http://spark.apache.org/docs/latest/sql-programming-guide.html#overview&lt;/A&gt;&lt;/P&gt;&lt;H2&gt;SQL&lt;A href="http://spark.apache.org/docs/latest/sql-programming-guide.html#sql"&gt;&lt;/A&gt;&lt;/H2&gt;&lt;P&gt;One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL. Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the &lt;A href="http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables"&gt;Hive Tables&lt;/A&gt; section. When running SQL from within another programming language the results will be returned as a &lt;A href="http://spark.apache.org/docs/latest/sql-programming-guide.html#DataFrames"&gt;DataFrame&lt;/A&gt;. You can also interact with the SQL interface using the&lt;A href="http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-spark-sql-cli"&gt;command-line&lt;/A&gt; or over &lt;A href="http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server"&gt;JDBC/ODBC&lt;/A&gt;.&lt;/P&gt;&lt;H2&gt;DataFrames&lt;A href="http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes"&gt;&lt;/A&gt;&lt;/H2&gt;&lt;P&gt;A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of &lt;A href="http://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources"&gt;sources&lt;/A&gt; such as: structured data files, tables in Hive, external databases, or existing RDDs.&lt;/P&gt;&lt;P&gt;The DataFrame API is available in &lt;A href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame"&gt;Scala&lt;/A&gt;, &lt;A href="http://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/DataFrame.html"&gt;Java&lt;/A&gt;, &lt;A href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame"&gt;Python&lt;/A&gt;, and &lt;A href="http://spark.apache.org/docs/latest/api/R/index.html"&gt;R&lt;/A&gt;.&lt;/P&gt;&lt;H2&gt;Datasets&lt;A href="http://spark.apache.org/docs/latest/sql-programming-guide.html#datasets"&gt;&lt;/A&gt;&lt;/H2&gt;&lt;P&gt;A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be &lt;A href="http://spark.apache.org/docs/latest/sql-programming-guide.html#creating-datasets"&gt;constructed&lt;/A&gt; from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.).&lt;/P&gt;&lt;P&gt;The unified Dataset API can be used both in &lt;A href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset"&gt;Scala&lt;/A&gt; and &lt;A href="http://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Dataset.html"&gt;Java&lt;/A&gt;. Python does not yet have support for the Dataset API, but due to its dynamic nature many of the benefits are already available (i.e. you can access the field of a row by name naturally &lt;CODE&gt;row.columnName&lt;/CODE&gt;). Full python support will be added in a future release.&lt;/P&gt;</description>
      <pubDate>Sun, 06 Mar 2016 09:28:29 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Test-spark-sql-performance/m-p/112568#M22000</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2016-03-06T09:28:29Z</dc:date>
    </item>
    <item>
      <title>Re: Test spark sql performance</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Test-spark-sql-performance/m-p/112569#M22001</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/3093/oscaricardo4.html" nodeid="3093"&gt;@Jan J&lt;/A&gt;
&lt;/P&gt;&lt;P&gt;You don't need to store your data in a database per say; you can store columnar files in S3 or HDFS and upload them using Spark's API. This will allow you to use the features of Spark SQL as opposed to the HiveQL shim. A sample application would look as follows: &lt;/P&gt;&lt;PRE&gt;from pyspark import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext() 
spark = SparkSession(sc)


#import parquetfiles from HDFS
df_customer = spark.read.parquet("hdfs://customer.parquet") 
df_orders = spark.read.parquet("hdfs://orders.parquet") 

#Create temp-tables for the purpose of querying our now structured data 
df_customer.registerTempTable("customer") 
df_orders.registerTempTable("orders") 

#Define a function to run our query set and highlight the actual differences amongst the filtered predicates and joins 

# init vars 
runtimes = [] 

def runBenchmarkQuery(query,message):
	print("Starting: " + message);   
	#start time
  	queryStartTime = datetime.now()   
	#run the query and show the result   
	spark.sql(query).show()   
	#end time   
	queryStopTime = datetime.now()   
	runTime = (queryStopTime-queryStartTime).seconds   
	print("Runtime: %s seconds" % (runTime))   
	runtimes.append( runTime )
	print ("Finishing: " + message);   
	return

#TPCH Query 1
runBenchmarkQuery("""
    SELECT
    n_name,
    sum(l_extendedprice * (1 - l_discount)) as revenue
FROM
    customer,
    orders,
    lineitem,
    supplier,
    nation,
    region
WHERE
    c_custkey = o_custkey
    AND l_orderkey = o_orderkey
    AND l_suppkey = s_suppkey
    AND c_nationkey = s_nationkey
    AND s_nationkey = n_nationkey
    AND n_regionkey = r_regionkey
    AND r_name = 'AFRICA'
    AND o_orderdate &amp;gt;= cast('1993-01-01' as date)
    AND o_orderdate &amp;lt; add_months(cast('1993-01-01' as date), '12')
GROUP BY
    n_name
ORDER BY
    revenue desc
""", "Run 1")

&lt;/PRE&gt;</description>
      <pubDate>Wed, 08 Feb 2017 01:07:24 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Test-spark-sql-performance/m-p/112569#M22001</guid>
      <dc:creator>pat_alwell</dc:creator>
      <dc:date>2017-02-08T01:07:24Z</dc:date>
    </item>
  </channel>
</rss>

