Member since
09-29-2015
155
Posts
205
Kudos Received
18
Solutions
09-26-2016
03:04 PM
8 Kudos
SAP HANA Vora is an in-memory processing engine that runs on a Hadoop cluster and is tightly integrated with Spark. It is designed for handling big data. SAP HANA Vora makes available OLAP-style capabilities on Hadoop, provides deeper integration with SAP HANA, enabling high-performance enterprise analytics, and delivers contextual insights by combining corporate data in SAP HANA with big data stored in Hadoop systems.
In this multi-part guide , I will show you how to spin up an SAP HANA instance in AWS and a Vora + HDP installation on 2nd node. We will utilize Apache Zeppelin to interact with SAP HANA using a Vora interpreter .
In this scenario you will be able to join data from other various data sources like HDFS and RDBMs to join to Hana data. This is a Federated Query approach to multiple data sources. "Federation" tier to act as a single point of access to data from multiple sources. For details on the concepts of Data Federation see "Virtual Integration of Hadoop with External Systems" .
SAP HANA Vora enables OLAP analysis of Hadoop data through data hierarchy enhancements in SparkSQL and compiled queries for accelerated processing across nodes. It democratizes data access for data scientists and developers to easily enrich their datasets in Hadoop and other data sources like RDBMs, json, txt, etc.
HDP stack allows you natively to do "federated querying" using the Spark engine, see Using Spark to Virtually Integrate Hadoop with External Systems , using VORA you get native connectivity to HANA and the additional UDF functions like hierarchies. To easily spin up the HANA and Vora with HDP we will utilize Amazon Cloud AWS services. You have an option to spin up in Amazon or Microsoft the HANA , however the Vora + HDP instance is only available using Amazon. For the simplicity we will use Amazon for now. In the future article I will create a how to install SAP Vora with HDP walk thru. This is the official install doc SAP_HANA_Vora_Installation_Admin_Guide First we will need to spin up a HANA instance, you will need to register for the SAP Cloud Appliance Library - the free service to manage your SAP solutions in the public cloud. Make sure you have an account there before proceeding with this tutorial.Once you register and sign in: On the left, click on SOLUTIONS to see the systems available for use. Search for "developer" in the search box to find the HANA developer edition. Choose "SAP HANA Vora, 1.2, developer edition" Once you’ve found the instance through the search, you need to “activate” it. Activating an instance connects it to your account on Amazon AWS. After the solution is activated, the link next to it should change to Create Instance. Finally, click the “Create Instance” link on this solution to start the setup wizard. The wizard will take you through a few simple steps and then you will have your instance up and running. These steps are outlined below. Choose your account, select your region, enter a name for your instance and password for your instance. This is the “simple” setup and only requires those couple of items to generate your instance. Enter a password for your system. Configure the schedule for the virtual machine. This option allows you to define a specific date when the machine will shut down, or a schedule when it should be running. The virtual machine will suspend on the date you set. Click Next when you have set a run schedule, or a suspend date. After the process of creating the VM starts, you will be prompted to download your “Key Pair”. Make sure to download the "pem" file you will need this to ssh back to the created instance. It will take about 10-25 minutes for your VM to start. You can see your instance status by clicking on the INSTANCE tab of the Cloud Appliance Library main screen. Next lets spin up the VORA instance from SAP Cloud Appliance Library: On the left, click on SOLUTIONS to see the systems available for use. Search for "developer" in the search box to find the HANA Vora 1.2, developer edition. Walk through the wizard to spin up the Vora instance. Make sure to select the same AWS region as the SAP HANA instance as the two systems will need to communicate and you dont want to cross geo-boundaries. Remember the master password, i created same as the HANA installation. It is important that you click Download and store a file with a private key. You will use it to connect to the instance’s host using ssh client Once your instance of SAP HANA Vora is fully activated you can see it among your CAL’s Instances with Active status. You can see the 2 instances as well in your AWS account In the next article Part 2 we will explore how to Configure SAP HANA Vora HDP Ambari References: https://community.hortonworks.com/articles/27387/virtual-integration-of-hadoop-with-external-system.html https://community.hortonworks.com/content/kbentry/29928/using-spark-to-virtually-integrate-hadoop-with-ext.html http://help.sap.com/Download/Multimedia/hana_vora/SAP_HANA_Vora_Installation_Admin_Guide_en.pdf http://go.sap.com/developer/tutorials/hana-setup-cloud.html http://help.sap.com/hana_vora_re http://go.sap.com/developer/tutorials/vora-setup-cloud.html http://help.sap.com/Download/Multimedia/hana_vora/SAP_HANA_Vora_Installation_Admin_Guide_en.pdf
... View more
Labels:
08-24-2016
01:39 PM
@Alexander is there a full list of these hdi scripts available? If not how did you discover the ones above?
... View more
08-23-2016
04:16 PM
It seems the new version of Sandbox does not have R pre-installed. Its an easy installation procedure: sudo yum install -y epel-release
sudo yum update -y
sudo yum install -y R
... View more
08-23-2016
04:15 PM
It seems the new version of Sandbox does not have R pre-installed. Its an easy installation procedure : sudo yum install -y epel-release
sudo yum update -y
sudo yum install -y R
... View more
06-14-2016
05:56 PM
For full guide on the ambari quickstart on vagrant follow the apache doc: https://cwiki.apache.org/confluence/display/AMBARI/Quick+Start+Guide
... View more
04-01-2016
10:15 PM
Awesome write up @Ancil McBarnett !
... View more
04-01-2016
06:08 PM
1 Kudo
@eorgadn You should wrap the geoDistance functions as hive UDF’s it will be a lot friendlier for most people that will want to use it in hive.
... View more
04-01-2016
05:12 PM
11 Kudos
HDP Sandbox comes pre-installed with SparkR and R .
First let's setup the R Studio on the HDP Sandbox:
To download and install RStudio Server open a terminal window and execute the commands corresponding to get the 64-bit version
wget https://download2.rstudio.org/rstudio-server-rhel-0.99.893-x86_64.rpm
sudo yum install --nogpgcheck rstudio-server-rhel-0.99.893-x86_64.rpm
sudo rstudio-server verify-installation
sudo rstudio-server stop
sudo rstudio-server start
Now setup a local user on the hdp sandbox, to access the rstudio:
useradd alex
passwd xxxx
Next launch web-browser and point to : http://sandbox.hortonworks.com:8787/
Login using the local account created earlier above.
Next lets initialize the sparkr in "yarn-client" mode
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client/")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"),.libPaths()))
library(SparkR)
sc <- SparkR::sparkR.init(master = "yarn-client")
Run the code in rstudio , you should get the following output:
>
>
> Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client/")
>
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"),.libPaths()))
>
> library(SparkR)
Attaching package: ‘SparkR’
The following objects are masked from ‘package:stats’:
cov, filter, lag, na.omit, predict, sd, var
The following objects are masked from ‘package:base’:
colnames, colnames<-, intersect, rank, rbind, sample, subset, summary, table, transform
> sc <- SparkR::sparkR.init(master = "yarn-client")
Launching java with spark-submit command /usr/hdp/current/spark-client//bin/spark-submit sparkr-shell /tmp/RtmpVvKWS8/backend_port38582cab538c
16/04/01 16:28:00 INFO SparkContext: Running Spark version 1.6.0
16/04/01 16:28:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/01 16:28:01 INFO SecurityManager: Changing view acls to: azeltov
16/04/01 16:28:01 INFO SecurityManager: Changing modify acls to: azeltov
16/04/01 16:28:01 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(azeltov); users with modify permissions: Set(azeltov)
16/04/01 16:28:02 INFO Utils: Successfully started service 'sparkDriver' on port 51539.
16/04/01 16:28:02 INFO Slf4jLogger: Slf4jLogger started
16/04/01 16:28:02 INFO Remoting: Starting remoting
16/04/01 16:28:03 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@000.000.000.000:54056]
16/04/01 16:28:03 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 54056.
16/04/01 16:28:03 INFO SparkEnv: Registering MapOutputTracker
16/04/01 16:28:03 INFO SparkEnv: Registering BlockManagerMaster
16/04/01 16:28:03 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-533d709d-f345-4d26-8046-0ae68a009d13
16/04/01 16:28:03 INFO MemoryStore: MemoryStore started with capacity 511.5 MB
16/04/01 16:28:03 INFO SparkEnv: Registering OutputCommitCoordinator
16/04/01 16:28:03 INFO Server: jetty-8.y.z-SNAPSHOT
16/04/01 16:28:03 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
16/04/01 16:28:03 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/04/01 16:28:03 INFO SparkUI: Started SparkUI at http://000.000.000.000:4040
spark.yarn.driver.memoryOverhead is set but does not apply in client mode.
16/04/01 16:28:04 INFO TimelineClientImpl: Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
16/04/01 16:28:04 INFO RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/1000.000.000.000:8050
16/04/01 16:28:05 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
16/04/01 16:28:05 INFO Client: Requesting a new application from cluster with 1 NodeManagers
16/04/01 16:28:05 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (5120 MB per container)
16/04/01 16:28:05 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
16/04/01 16:28:05 INFO Client: Setting up container launch context for our AM
16/04/01 16:28:05 INFO Client: Setting up the launch environment for our AM container
16/04/01 16:28:05 INFO Client: Using the spark assembly jar on HDFS because you are using HDP, defaultSparkAssembly:hdfs://sandbox.hortonworks.com:8020/hdp/apps/2.4.0.0-169/spark/spark-hdp-assembly.jar
16/04/01 16:28:05 INFO Client: Preparing resources for our AM container
16/04/01 16:28:05 INFO Client: Using the spark assembly jar on HDFS because you are using HDP, defaultSparkAssembly:hdfs://sandbox.hortonworks.com:8020/hdp/apps/2.4.0.0-169/spark/spark-hdp-assembly.jar
16/04/01 16:28:05 INFO Client: Source and destination file systems are the same. Not copying hdfs://sandbox.hortonworks.com:8020/hdp/apps/2.4.0.0-169/spark/spark-hdp-assembly.jar
16/04/01 16:28:06 INFO Client: Uploading resource file:/tmp/spark-974da35e-9ded-485a-9a13-0fd9a020dbf0/__spark_conf__5714702140855539619.zip -> hdfs://sandbox.hortonworks.com:8020/user/azeltov/.sparkStaging/application_1459455786854_0007/__spark_conf__5714702140855539619.zip
16/04/01 16:28:06 INFO SecurityManager: Changing view acls to: azeltov
16/04/01 16:28:06 INFO SecurityManager: Changing modify acls to: azeltov
16/04/01 16:28:06 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(azeltov); users with modify permissions: Set(azeltov)
16/04/01 16:28:06 INFO Client: Submitting application 7 to ResourceManager
16/04/01 16:28:06 INFO YarnClientImpl: Submitted application application_1459455786854_0007
16/04/01 16:28:06 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1459455786854_0007 and attemptId None
16/04/01 16:28:07 INFO Client: Application report for application_1459455786854_0007 (state: ACCEPTED)
16/04/01 16:28:07 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1459528086330
final status: UNDEFINED
tracking URL: http://sandbox.hortonworks.com:8088/proxy/application_1459455786854_0007/
user: azeltov
16/04/01 16:28:08 INFO Client: Application report for application_1459455786854_0007 (state: ACCEPTED)
16/04/01 16:28:09 INFO Client: Application report for application_1459455786854_0007 (state: ACCEPTED)
16/04/01 16:28:10 INFO Client: Application report for application_1459455786854_0007 (state: ACCEPTED)
16/04/01 16:28:11 INFO Client: Application report for application_1459455786854_0007 (state: ACCEPTED)
16/04/01 16:28:12 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(null)
16/04/01 16:28:12 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> sandbox.hortonworks.com, PROXY_URI_BASES -> http://sandbox.hortonworks.com:8088/proxy/application_1459455786854_0007), /proxy/application_1459455786854_0007
16/04/01 16:28:12 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
16/04/01 16:28:12 INFO Client: Application report for application_1459455786854_0007 (state: RUNNING)
16/04/01 16:28:12 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host:000.000.000.000
ApplicationMaster RPC port: 0
queue: default
start time: 1459528086330
final status: UNDEFINED
tracking URL: http://sandbox.hortonworks.com:8088/proxy/application_1459455786854_0007/
user: azeltov
16/04/01 16:28:12 INFO YarnClientSchedulerBackend: Application application_1459455786854_0007 has started running.
16/04/01 16:28:12 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35844.
16/04/01 16:28:12 INFO NettyBlockTransferService: Server created on 35844
16/04/01 16:28:12 INFO BlockManagerMaster: Trying to register BlockManager
16/04/01 16:28:12 INFO BlockManagerMasterEndpoint: Registering block manager 000.000.000.000:35844 with 511.5 MB RAM, BlockManagerId(driver,000.000.000.000, 35844)
16/04/01 16:28:12 INFO BlockManagerMaster: Registered BlockManager
16/04/01 16:28:12 INFO EventLoggingListener: Logging events to hdfs:///spark-history/application_1459455786854_0007
16/04/01 16:28:18 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (sandbox.hortonworks.com:43282) with ID 1
16/04/01 16:28:18 INFO BlockManagerMasterEndpoint: Registering block manager sandbox.hortonworks.com:58360 with 511.5 MB RAM, BlockManagerId(1, sandbox.hortonworks.com, 58360)
16/04/01 16:28:33 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
Validate that the sparkUI is started, look for this message in your output like the one above:
16/04/01 16:28:03 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/04/01 16:28:03 INFO SparkUI: Started SparkUI at http://000.000.000.000:4040
Next lets try running some sparkR code:
sqlContext <-sparkRSQL.init(sc)
path <-file.path("file:///usr/hdp/2.4.0.0-169/spark/examples/src/main/resources/people.json")
peopleDF <-jsonFile(sqlContext, path)
printSchema(peopleDF)
You should get this output:
> printSchema(peopleDF)
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
Now lets try to do some basic dataframe analysis
# Register this DataFrame as a table.
registerTempTable(peopleDF, "people")
teenagers <- sql(sqlContext, "SELECT name FROM people WHERE age >= 13 AND age <= 19")
teenagersLocalDF <- collect(teenagers)
print(teenagersLocalDF)
... View more
Labels:
04-01-2016
02:21 PM
11 Kudos
First lets create a sample file in S3: In the AWS Console , Go to S3 and create a bucket “S3Demo” and pick your region. Upload the file manually by using the upload button (example file name used later in scala: S3HDPTEST.csv) In the HDP 2.4.0 Sandbox : Download the aws sdk for java https://aws.amazon.com/sdk-for-java/ Uploaded it to the hadoop directory. You should see the aws-java-sdk-1.10.65.jar in /usr/hdp/2.4.0.0-169/hadoop/ [root@sandbox bin]# ll /usr/hdp/2.4.0.0-169/hadoop/
total 242692
-rw-r--r-- 1 root root 32380018 2016-03-31 22:02 aws-java-sdk-1.10.65.jar
drwxr-xr-x 2 root root 4096 2016-02-29 18:05 bin
drwxr-xr-x 2 root root 12288 2016-02-29 17:49 client
lrwxrwxrwx 1 root root 25 2016-03-31 21:08 conf -> /etc/hadoop/2.4.0.0-169/0
drwxr-xr-x 2 root root 4096 2016-02-29 17:46 etc
-rw-r--r-- 1 root root 17366 2016-02-10 06:44 hadoop-annotations-2.7.1.2.4.0.0-169.jar
lrwxrwxrwx 1 root root 40 2016-02-29 17:46 hadoop-annotations.jar -> hadoop-annotations-2.7.1.2.4.0.0-169.jar
-rw-r--r-- 1 root root 71534 2016-02-10 06:44 hadoop-auth-2.7.1.2.4.0.0-169.jar
lrwxrwxrwx 1 root root 33 2016-02-29 17:46 hadoop-auth.jar -> hadoop-auth-2.7.1.2.4.0.0-169.jar
-rw-r--r-- 1 root root 103049 2016-02-10 06:44 hadoop-aws-2.7.1.2.4.0.0-169.jar
lrwxrwxrwx 1 root root 32 2016-02-29 17:46 hadoop-aws.jar -> hadoop-aws-2.7.1.2.4.0.0-169.jar
-rw-r--r-- 1 root root 138488 2016-02-10 06:44 hadoop-azure-2.7.1.2.4.0.0-169.jar
lrwxrwxrwx 1 root root 34 2016-02-29 17:46 hadoop-azure.jar -> hadoop-azure-2.7.1.2.4.0.0-169.jar
-rw-r--r-- 1 root root 3469432 2016-02-10 06:44 hadoop-common-2.7.1.2.4.0.0-169.jar
-rw-r--r-- 1 root root 1903274 2016-02-10 06:44 hadoop-common-2.7.1.2.4.0.0-169-tests.jar
lrwxrwxrwx 1 root root 35 2016-02-29 17:46 hadoop-common.jar -> hadoop-common-2.7.1.2.4.0.0-169.jar
lrwxrwxrwx 1 root root 41 2016-02-29 17:46 hadoop-common-tests.jar -> hadoop-common-2.7.1.2.4.0.0-169-tests.jar
-rw-r--r-- 1 root root 159484 2016-02-10 06:44 hadoop-nfs-2.7.1.2.4.0.0-169.jar
lrwxrwxrwx 1 root root 32 2016-02-29 17:46 hadoop-nfs.jar -> hadoop-nfs-2.7.1.2.4.0.0-169.jar
drwxr-xr-x 5 root root 4096 2016-03-31 20:27 lib
drwxr-xr-x 2 root root 4096 2016-02-29 17:46 libexec
drwxr-xr-x 3 root root 4096 2016-02-29 17:46 man
-rw-r--r-- 1 root root 210216729 2016-02-10 06:44 mapreduce.tar.gz
drwxr-xr-x 2 root root 4096 2016-02-29 17:46 sbin Change directory to spark/bin [root@sandbox bin]# cd /usr/hdp/2.4.0.0-169/spark/bin Start the Spark Scala shell with right aws jars dependencies: ./spark-shell --master yarn-client --jars /usr/hdp/2.4.0.0-169/hadoop/hadoop-aws-2.7.1.2.4.0.0-169.jar,/usr/hdp/2.4.0.0-169/hadoop/hadoop-auth.jar,/usr/hdp/2.4.0.0-169/hadoop/aws-java-sdk-1.10.65.jar --driver-memory 512m --executor-memory 512m Now for some scala code to configure the aws secret keys in hadoopConf val hadoopConf = sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", "xxxxxxx")
hadoopConf.set("fs.s3.awsSecretAccessKey", "xxxxxxx")
and now read the file from s3 bucket val myLines = sc.textFile("s3n://s3hdptest/S3HDPTEST.csv");
myLines.count();
print count;
... View more
Labels:
03-18-2016
07:04 PM
1 Kudo
@Neeraj Sabharwal was able to get passed the dependencies issue with R 3.2.3 by explicitly specifigying the repo: repos="http://xyz.xxx.abc.edu") install.packages("evaluate", repos="http://xyz.xxx.abc.edu")
... View more