Created on 12-16-2016 06:03 PM - edited 08-17-2019 07:22 AM
Apache Spark has ignited an explosion of data exploration on very large data sets. Spark played a big role in making general purpose distributed compute accessible. Anyone with some level of skill in Python, Scala, Java, and now R, can just sit down and start exploring data at scale. It also democratized Data Science by offering ML as a series of black boxes. Training of Artificial Intelligence is now possible for those of us who do not have PHDs in Statistics and Mathematics. Now Spark SQL is also helping to bring data exploration to the business unit directly. In partnership with Apache Hive, Spark has been enabling users to explore very large data sets using SQL expression. However, in order to truly make Spark SQL available for ad-hoc access by business analysts using BI tools, fine grain security and governance are necessary. Spark provides strong authentication via Kerberos and wire encryption via SSL. However, up to this point, Authorization was only possible via HDFS ACLs. This approach works relatively well when Spark is used as a general purpose compute framework. That is, using Java/Scala/Python to express logic that cannot be encapsulated in a SQL statement. However, when structured schema with columns and rows is applied, fine grain security becomes a challenge. Data in the same table may belong to two different groups, each with their own regulatory requirements. Data may have regional restrictions, time based availability restrictions, departmental restrictions, ect.
Currently, Spark does not have a built in authorization sub-system. It tries to read the data set as instructed and either succeeds or fails based on file system permissions. There is no way to define a pluggable module that contains an instructions set for fine grain authorization. This means that authorization policy enforcement must be performed somewhere outside of Spark. In other words, some other system has to tell Spark that it is not allowed to read the data because it contains a restricted column. At this point there are two likely solutions. The first is to create and authorization subsystem within Spark itself. The second is to configure Spark to read the file system through a daemon that is external to Spark. The second option is particularly attractive because it can provide benefits far beyond just security. Thus, the community created LLAP (Live Long and Process). LLAP is a collection of long lived daemons that works in tandem with the HDFS Data Node service. LLAP is optional and modular so it can be turned on or off. At the moment, Apache Hive has the most built in integration with LLAP. However, the intent of LLAP is to generally provide benefits to applications running in Yarn. When enabled, LLAP provides numerous performance benefits: - Processing Offload - IO Optimization - Caching Since the focus of this articles security for Spark, refer to the LLAP Apache wiki for more details on LLAP. https://cwiki.apache.org/confluence/display/Hive/LLAP
With LLAP enabled, Spark reads from HDFS go directly through LLAP. Besides conferring all of the aforementioned benefits on Spark, LLAP is also a natural place to enforce fine grain security policies. The only other capability required is a centralized authorization system. This need is met by Apache Ranger. Apache Ranger provides centralized authorization and audit services for many components that run on Yarn or rely on data from HDFS. Ranger allows authoring of security policies for: - HDFS - Yarn - Hive (Spark with LLAP) - HBase - Kafka - Storm - Solr - Atlas - Knox Each of the above services integrate with Ranger via a plugin that pulls the latest security policies, caches them, and then applies them at run time.
Now that we have defined how fine grain authorization and audit can be applied to Spark, let's review the overall architecture.
Notice that there was no need to create any type of view abstraction over the data. The only action required for fine grain security enforcement is to configure a security policy in Ranger and enable LLAP. Ranger also provides column masking and row filtering capabilities. Masking policy is similar to a column level policy. The main difference is that all columns are returned but the restricted columns contain only asterisks or a hash of the original value.
Ranger also provides that ability to apply Row level security. Using a Row level security policy, users can be prevented from seeing some of the rows in a table but still display all rows not restricted by policy. Consider a scenario where Financial Managers should only be able to see clients assigned to them. Row level policy from Ranger would instruct Hive to return a query plan that includes a predicate. That predicate filters out all customers not assigned to the Financial Manager trying access the data. Spark receives the modified query plan and initiates processing, reading data through LLAP. LLAP ensures that the predicate is applied and that the restricted rows are not returned. With such an array of fine grain security capabilities, Spark can now be exposed directly to BI tools via a Thrift Server. Business Analyst can now wield the power of Apache Spark.
In general, LLAP integration has the potential to greatly enhance Spark from both a performance and security perspective. Fine grain security will help to bring the benefits of Spark to the business. Such a development should help to fuel more investment, collection, and exploration of data. If you would like to test out this capability for yourself, check out the following tutorial: