Community Articles

Find and share helpful community-sourced technical articles.
avatar
Explorer

SAS users are very excited about leveraging Hadoop with their existing deployments. This article will cover the basic concepts of SAS Access to Hadoop, and add on product from SAS which can be deployed into existing SAS enviroments. This product give a SAS user the ability to leverage HDFS storage, access HiveServer2 and execute inline within a SAS program, HDFS, Pig, Hive and MapReduce programs. Lets start by discussing how to leverage HDFS storage.

1) Leveraging HDFS for flat files using the SAS Filename Statement

The SAS Filename statement allows a SAS programmer to setup a pointer to an inbound or outbound filesystem directory. With Hadoop, the SAS Filename statement can reference an HDFS directory. Once the file reference is established, this file reference can be used within a SAS Data Step on an Infile or File statement. This enables SAS programmers to read and write flat files to and from HDFS inline within their programs.

2) Leveraging HDFS for SAS Libraries using the Libname Statement

SAS also implemented the SPDE engine on the Libname statement to support leveraging HDFS to store SAS tables or data sets. Once a library reference is established leveraging the Libname statement, SAS programmers can use this libref on a Data or Set statement within a SAS Data Step, or as input to a SAS procedure. There are minor limitations to leveraging this method over a standard file system for SAS libraries. SAS documentation will provide these details.

3) Accessing directly, HiveServer2

SAS had implemented a Libname statement to setup a SAS library reference to HiveServer2. It is available for mostly read access to Hive tables. Once a SAS library reference has been established (this leverages a JDBC connection), SAS programmers can leverage HiveServer2 tables from within their SAS programs, as input to a SET statement or on a DATA= statement within a SAS procedure. SAS has implemented, a dynamic Push Down In Database capabilties to take standard Statistical procedures like Proc Summary, Means, Freq used by SAS programmers with HiveServer2. This capability will generate a complex HiveQL statement for the users and send this over to HiveServer2 for execution. This allows a significant portion of the math to take place in Hadoop.

4) Executing HDFS, Pig, Hive, and MapReduce inline within a SAS program

SAS created Proc Hadoop, a procedure available with this product, to enable SAS programmers to execute, inline within a SAS program, any HDFS, Pig, Hive, or MapReduce script or program that has been created outside of SAS.

I hope you find this information useful as you get stated using SAS Access to Hadoop.

Regards,

@MarkLochbihler

Partner Engineering, Hortonworks

8,160 Views
Comments
avatar

@Mark Lochbihler what if I want to explore a hadoop cluster with SAS Enterprise Guide, SAS Access to Hadoop is the only option or is there another possibility?

What about SAS Data Loader?

avatar
Explorer

Hey Mark, good article!

Thought I'd resurface this by adding a note on point (3) above, for those who want to set up multiple HS2s and load balance as per

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_hadoop-high-availability/content/ch_multi...

And here is a sample LIBNAME syntax that would be used to connect from the SAS application:

libname h2 hadoop URI=”jdbc:hive2://<server1>,<server2>,<server3>/default; serviceDiscoveryMode=zooKeeper;zooKeeperNamespace="hiveserver2″ user=&sysuserid server=”dummy”;