This is the first of a series of short articles on using Apache Spark with Hortonworks HDP for beginners. If you’re reading this, you don’t need need me to define what Spark is, there are numerous references on the web that can speak about that its API, being data structure centric and in my opinion one of the most important Open Source projects. The intent of this article is let you know what helped me get started using Spark on the Hortonworks Data Platform. This is not a tutorial.
I’m assuming you have access to a an HDP 2.5.3 cluster or to the Hortonworks Sandbox for HDP 2.5.3 or above. Also, I’m going to assume that you are familiar with SQL, Apache Hive, and using the Linux/Unix Bourne Shell.
The problem I was having, which pushed me to use Spark was that using Hive for data processing was limiting. There were things I was used to using in my past life as an Oracle DBA that were not available. Hive is a fantastic product, but Hive SQL didn’t give me all the bells and whistles to do my job, plus complex Hive SQL statements can be time consuming.
In my case, I needed to summarize allot of time series data and then store that summarized data in a hive table so others could query it using Apache Zeppelin. For the sake of this article, I’ll keep the table layout simple:
The example below will illustrate using the spark command line to summarize data from a hive table and
There are a couple ways to run spark commands, but I prefer using a command line. The command line tool or more precisely the Spark repl is called spark-shell. See http://spark.apache.org/docs/latest/quick-start.html. Another good option is to use Apache zeppelin, but we will use spark-shell.
Starting up the spark-shell is very easy and is executed from the linux shell prompt by typing:
The standard spark-shell is verbose, which you can turn off. Google for how to do this. Executing spark-shell will bring you to the scala> prompt.
From the scala> prompt, the first thing we’ll do is create a data from with all the contents of the txn_detail table. But before executing a piece of SQL we need to define a sql context object
scala> val sqlcontext = new org.apache.spark.sql.SQLContext(sc);
Next, the command below will execute a SQL statement to query all rows from the txn_detail table and put the result set into a Spark dataframe called ‘dataframe_A’.
scala> val dataframe_A = sqlContext.sql(‘’’
Select txn_date, txn_action, txn_value from txn_detail
Now that we have data in a dataframe we can summarize it grouping on either txn_action or txn_date.