Community Articles

vvaks · ‎04-27-2016

Data Federation discussions are becoming more and more common place as organizations embark on their Big Data Journey. New data platforms like the Hortonworks Connected platform (HDP+HDF), NoSQL, and NewSQL data stores are reducing the cost and difficulty of storing and working with vast volumes of data. This is empowering organizations to leverage and monetize their data like never before. However, legacy data infrastructures still play an important role in the overall technology architecture. In order to fully realize the power of the new and traditional data platforms, it is often necessary to integrate the data. One obvious approach is to simply move the data from where it sits in the existing data platform over to the target data platform. However, in many cases it is desirable to leave the data in place and enable a "Federation" tier to act as a single point of access to data from multiple sources. For details on the concepts and implementation of Data Federation see https://community.hortonworks.com/articles/27387/virtual-integration-of-hadoop-with-external-system.....

This article focuses on how to use SparkSQL to integrate, expose, and accelerate multiple sources of data from a single "Federation Tier". First, it is important to point out that SparkSQL is not a pure Data Federation tool and hence does not have some of the really advanced capabilities generally associated with Data Federation. SparkSQL does not facilitate predicate push down to the source system beyond the query that defines what data from the underlying source should be made available through SparkSQL. Also, because it was not designed to be a true "Data Federation" engine, there is no "user friendly" interface to easily setup the external sources, the schemas associated with the target data, or the ingest of the target data. All of this work has to be done through the SparkSQL API and requires relatively advanced knowledge Spark and data architecture principles in general. For these reasons, SparkSQL will not be the right solution in every Data Federation scenario. However, what SparkSQL lacks in terms of an "easy button" it makes up for in versatility, relatively low cost, sheer processing potential, and in-memory capabilities.

SparkSQL exposes most of it's capabilities via the Data Frame API and the SQL context. Data can be ingested into Spark's native data structure (RDD) from an RDBMS, from HDFS (supports Hive/HBase/Phoenix), and generally any source that has an API that Spark can access (HTTP/JDBC/ODBC/NoSQL/Cloud Storage). The Data Frame allows the definition of a schema and then the application of that schema to the RDD containing the target data. Once the data has been transformed into a Data Frame with a schema, it is a single line of code away from becoming what looks exactly like a relational table. That table can then be stored in Hive (assuming Hive context was created) if it needs to be accessed on a regular basis or registered as a temp table that will exist only as long as the parent Spark application and it's executors (the application can run indefinitely). If a enough resources are available and really fast query response are required, any or all of the tables can cached and made available in-memory. Assuming a properly tuned infrastructure, and a clear understanding of how and when the data changes, this can make query response times extremely fast. Imagine caching the main fact table and leveraging map joins for the dimension tables.

All of the tables that have been registered can then be made available for access as a JDBC/ODBC data source via the Spark thrift server. The Spark thrift server supports virtually the same API and many of the features supported by the battle tested Hive thrift server. At this point, OLAP and reporting BI tools can be used to display data from far and wide across the organization's data enterprise architecture.

As stated earlier, it is certainly not the right choice in every situation and must be thought out carefully. However, it should be noted that this very design pattern is being used by large traditional software vendors to enhance their existing product sets. One great example of this is SAP Vora which extends the capabilities of Spark to enable an organization to greatly augment the processing and storage capabilities of HANA by leveraging Spark on Hadoop. There is definitely value in the work that vendors are doing to make SparkSQL more accessible. However, because Spark is open source, it can also be implemented without a capital acquisition cost.

In general, SparkSQL is an excellent option for data processing and data federation. It can greatly improve BI performance and range of available data. This design pattern is not for the fait of heart but when implemented properly can lead to great progress for an organization on the Big Data Journey.

For a working example of using SparkSQL for Data Federation check out:

https://community.hortonworks.com/content/repo/29883/sparksql-data-federation-demo.html

Cloudera Community

Community Articles

Using Spark to Virtually Integrate Hadoop with External Systems

Apache Hive

Apache Spark

Apache Zeppelin