Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar
Contributor

This is the first of a series of short articles on using Apache Spark with Hortonworks HDP for beginners. If you’re reading this, you don’t need need me to define what Spark is, there are numerous references on the web that can speak about that its API, being data structure centric and in my opinion one of the most important Open Source projects. The intent of this article is let you know what helped me get started using Spark on the Hortonworks Data Platform. This is not a tutorial.

I’m assuming you have access to a an HDP 2.5.3 cluster or to the Hortonworks Sandbox for HDP 2.5.3 or above. Also, I’m going to assume that you are familiar with SQL, Apache Hive, and using the Linux/Unix Bourne Shell.

The problem I was having, which pushed me to use Spark was that using Hive for data processing was limiting. There were things I was used to using in my past life as an Oracle DBA that were not available. Hive is a fantastic product, but Hive SQL didn’t give me all the bells and whistles to do my job, plus complex Hive SQL statements can be time consuming.

In my case, I needed to summarize allot of time series data and then store that summarized data in a hive table so others could query it using Apache Zeppelin. For the sake of this article, I’ll keep the table layout simple:

txn_details

txn_date string,

txn_action string,

txn_value number

The example below will illustrate using the spark command line to summarize data from a hive table and

There are a couple ways to run spark commands, but I prefer using a command line. The command line tool or more precisely the Spark repl is called spark-shell. See http://spark.apache.org/docs/latest/quick-start.html. Another good option is to use Apache zeppelin, but we will use spark-shell.

Starting up the spark-shell is very easy and is executed from the linux shell prompt by typing:

$ spark-shell

The standard spark-shell is verbose, which you can turn off. Google for how to do this. Executing spark-shell will bring you to the scala> prompt.

From the scala> prompt, the first thing we’ll do is create a data from with all the contents of the txn_detail table. But before executing a piece of SQL we need to define a sql context object

scala> val sqlcontext = new org.apache.spark.sql.SQLContext(sc);

Next, the command below will execute a SQL statement to query all rows from the txn_detail table and put the result set into a Spark dataframe called ‘dataframe_A’.

scala> val dataframe_A = sqlContext.sql(‘’’

Select txn_date, txn_action, txn_value from txn_detail

“””);

Now that we have data in a dataframe we can summarize it grouping on either txn_action or txn_date.

Summarize on txn_date

scala> dataframe_A.groupBy($“txn_date”).agg(sum(“txn_value”).alias(“txn_value”)).show()

+-------------+------------------+

|txn_date | txn_value|

+-------------+-------------------+

| 2015-12-27| 22.0|

| 2015-12-28| 74.0|

| 2015-11-20| 59.0|

| 2015-12-29| 44.0|

| 2015-11-21| 98.0|

| 2015-11-22| 52.0|

| 2015-11-23| 35.0|

| 2015-11-24| 31.0|

| 2015-11-25| 62.0|

| 2015-11-26| 74.0|

| 2015-11-27| 14.0|

| 2015-09-21| 25.0|

| 2015-10-20| 17.0|

| 2015-09-22| 14.0|

| 2015-11-29| 14.0|

| 2015-10-21| 21.0|

| 2015-09-23| 54.0|

| 2016-12-01| 42.0|

| 2015-10-22| 52.0|

| 2015-09-24| 73.0|

+-------------+------------------+

only showing top 20 rows

Summarize on txn_action

scala> dataframe_A.groupBy($“txn_action”).agg(sum(“txn_value”).alias(“txn_value”)).show()

+-------------+------------------+

|txn_action | txn_value|

+-------------+-------------------+

| Open | 11.0|

| Close | 99.0|

+-------------+------------------+

Let’s store the summarized results for txn_date into a separate dataframe and then save those results off to a hive table.

Save the result set into a new dataframe

scala> val dataframe_B = dataframe_A.groupBy($“txn_date”).agg(sum(“txn_value”).alias(“txn_value”));

Create a temporary table. This will allow us to query it as like any other hive table.

scala> dataframe_B.registerTempTable(“txn_date_temp”);

Create a hive table and save the data

scala> sqlContext.sql(“””create table hive_txn_data as select * from txn_data_temp”””);

Now that you have the data summarized in the hive_txn_data hive table, users can query data from the table using Apache Zeppelin or any other tool.

Summary

There are numerous ways to perform this type of work, but using Spark is very efficient to summarize and execute calculations. In coming articles, I’ll discuss other functions of spark.

For additional Hortonworks tutorials check out: http://hortonworks.com/tutorials/

1,022 Views
webinar banner
Version history
Last update:
‎01-25-2017 06:00 PM
Updated by:
Contributors
meetups banner