About azeltov

azeltov · ‎03-09-2017

@vshukla @Bikas

azeltov · ‎02-17-2017

I ended up looking the livy.log file for the actual error message. My user that i was trying to run as did not have a home directory in HDFS. So created a home dir with proper permissions and got it working.

azeltov · ‎02-02-2017

Create a HANA table for demo purposes first CREATE COLUMN TABLE "CODEJAMMER"."STORE_ADDRESS" ( ID bigint not null primary key , "STREETNUMBER" INTEGER CS_INT, "STREET" NVARCHAR(200), "LOCALITY" NVARCHAR(200), "STATE" NVARCHAR(200), "COUNTRY" NVARCHAR(200)) UNLOAD PRIORITY 5 AUTO MERGE ; insert into "CODEJAMMER"."STORE_ADDRESS" (ID,STREETNUMBER,STREET,LOCALITY,STATE,COUNTRY) values(1,555,'Madison Ave','New York','NY','America'); insert into "CODEJAMMER"."STORE_ADDRESS" (ID,STREETNUMBER,STREET,LOCALITY,STATE,COUNTRY) values(2,95,'Morten Street','New York','NY','USA'); insert into "CODEJAMMER"."STORE_ADDRESS" (ID,STREETNUMBER,STREET,LOCALITY,STATE,COUNTRY) values(3,2395,'Broadway Street','New York','NY','USA'); Configure the SAP JDBC Driver in Spark config to where the driver is saved on ALL the nodes . My example: /tmp/ngdbc.jar You can also load the ngdbc.jar using the zeppelin depency if you dont want to system wide access: z.reset() // clean up previously added artifact and repository // add artifact from filesystem z.load("/tmp/ngdbc.jar") Check out zeppelin docs for zeppelin dependency loading details: https://zeppelin.apache.org/docs/latest/interpreter/spark.html#3-dynamic-dependency-loading-via-sparkdep-interpreter Lets test it out ( the notebook is uploaded to github, see below): Notebook: https://raw.githubusercontent.com/zeltovhorton/sap_hana_demo/master/SparkToHana.json Zeppelin Notebook Viewer: https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL3plbHRvdmhvcnRvbi9zYXBfaGFuYV9kZW1vL21hc3Rlci9TcGFya1RvSGFuYS5qc29u Code : %spark val url="jdbc:sap://54.234.139.2:30015/?currentschema=XXX" val prop = new java.util.Properties prop.setProperty("user","") prop.setProperty("password","") prop.setProperty("driver","com.sap.db.jdbc.Driver") //hana table val store_address = sqlContext.read.jdbc(url,"STORE_ADDRESS",prop) store_address.registerTempTable("store_address") <strong> </strong> %spark //hive tables val sales = sqlContext.sql("select storekey, productkey, salesamount from atlas_factsales limit 10") sales.registerTempTable("sales") %spark sqlContext.sql("select s.salesamount, s.productkey, a.state, a.country from sales s inner join store_address a where a.id in (1)").show() +-----------+----------+-----+-------+ |salesamount|productkey|state|country| +-----------+----------+-----+-------+ | 307.26| 177| NY|America| | 1490.0| 2180| NY|America| | 2299.9| 2329| NY|America| | 413.512| 1360| NY|America| | 6990.0| 193| NY|America| | 11184.3| 1412| NY|America| %sql select s.salesamount, s.productkey, a.state, a.country from sales s inner join store_address a where a.id in (1)

azeltov · ‎02-02-2017

Prerequisites: * SAP HANA - Instructions to setup a Cloud HANA on AWS or Azure https://community.hortonworks.com/content/kbentry/58427/getting-started-with-sap-hana-and-vora-with-hdp-us.html * HDP 2.5.x We will use Spark shell, scala code and data frames to access HANA using JDBC driver. Start the spark shell with ngdbc.jar driver. spark-shell --master yarn-client --jars /tmp/ngdbc.jar scala> val url="jdbc:sap://xxxx:30015/?currentschema=CODEJAMMER" url: String = jdbc:sap://xxxx:30015/?currentschema=CODEJAMMER scala> | val prop = new java.util.Properties prop: java.util.Properties = {} scala> prop.setProperty("user","xxxx") res1: Object = null scala> prop.setProperty("password","xxxx") res2: Object = null scala> prop.setProperty("driver","com.sap.db.jdbc.Driver") res3: Object = null scala> scala> val emp_address = sqlContext.read.jdbc(url,"EMPLOYEE_ADDRESS",prop) emp_address: org.apache.spark.sql.DataFrame = [ID: bigint, STREETNUMBER: int, STREET: string, LOCALITY: string, STATE: string, COUNTRY: string] scala> emp_address.show 17/02/02 20:17:19 INFO SparkContext: Starting job: show at <console>:32 ..... 17/02/02 20:17:23 INFO DAGScheduler: Job 0 finished: show at <console>:32, took 4.586219 s +---+------------+---------------+--------+-----+-------+ | ID|STREETNUMBER| STREET|LOCALITY|STATE|COUNTRY| +---+------------+---------------+--------+-----+-------+ | 1| 555| Madison Ave|New York| NY|America| | 2| 95| Morten Street|New York| NY| USA| | 3| 2395|Broadway Street|New York| NY| USA| +---+------------+---------------+--------+-----+-------+ Gotchas: If you see this error: org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: com.sap.db.jdbc.topology.Host The issue is resolved with the latest SPS12+ driver. I had to upgrade my sap driver.

azeltov · ‎02-02-2017

Prerequisites: * NIFI 1.0+ * SAP HANA - Instructions to setup a Cloud HANA on AWS or Azure: https://community.hortonworks.com/content/kbentry/58427/getting-started-with-sap-hana-and-vora-with-hdp-us.html SETUP: HANA (Source Database) In this setup we will create a table in HANA table. First follow this HCC article "Demo data in SAP Vora Using Eclipse HANA Modelling tools - Part 3" You will need to download Eclipse Neon - Eclipse IDE for Java Developers to connect to the SAP HANA that we setup in Part 1 . After you setup eclipse we will need to configure Eclipse to install HANA Modelling tools that will allow us to connect to SAP HANA and execute sql scripts to setup demo data that we will use from SAP Vora. After you establish connection to your HANA system. Run this code: DROP TABLE "CODEJAMMER"."EMPLOYEE_ADDRESS"; CREATE COLUMN TABLE "CODEJAMMER"."EMPLOYEE_ADDRESS" ( ID bigint not null primary key generated by default as IDENTITY, "STREETNUMBER" INTEGER CS_INT, "STREET" NVARCHAR(200), "LOCALITY" NVARCHAR(200), "STATE" NVARCHAR(200), "COUNTRY" NVARCHAR(200)) UNLOAD PRIORITY 5 AUTO MERGE ; insert into "CODEJAMMER"."EMPLOYEE_ADDRESS" (STREETNUMBER,STREET,LOCALITY,STATE,COUNTRY) values(555,'Madison Ave','New York','NY','America'); insert into "CODEJAMMER"."EMPLOYEE_ADDRESS" (STREETNUMBER,STREET,LOCALITY,STATE,COUNTRY) values(95,'Morten Street','New York','NY','USA'); SELECT * FROM "CODEJAMMER"."EMPLOYEE_ADDRESS"; Now lets setup the NIFI workflow. Final result will look like this: Nifi Setup : This is a simple NIFI setup, the queryDatabase table processor is only available as part of default processors from version 0.6 of Nifi. Drop an instance of the QueryDatabaseTable processor on your canvas. Right click to configure and fill out the Required fields , plus you click on the error to setup the DB connection pool, see settings below: The limitation with this processor as it is not a true CDC and relies on one column. If the data is reloaded into the column with older data the data will not be replicated into HDFS or any other destination. This processor does not rely on Transactional logs or redo logs . Next configure the putHDFS processor configure the Hadoop Core-site.xml and hdfs-site.xml and destination HDFS directory Now lets start all the processors. Validate that you got data by checking provenance in your processor. You can also check what is the max id column state. The last auto increment ID will be displayed . Right click on QueryDatabaseTable processor and select View State: Testing CDC Now insert a new record in HANA and validate that the record is in JSON format in HDFS insert into "CODEJAMMER"."EMPLOYEE_ADDRESS" (STREETNUMBER,STREET,LOCALITY,STATE,COUNTRY) values(2395,'Broadway Street','New York','NY','USA'); Looking into HDFS we see the new JSON record:

azeltov · ‎01-20-2017

Nice works @slachterman !!!

azeltov · ‎12-08-2016

Nice Job! I did a similar article with a zeppelin + livy + AD/LDAP in case you want to check out livy steps: https://community.hortonworks.com/articles/65449/ow-to-setup-a-multi-user-active-directory-backed-z.html

azeltov · ‎11-22-2016

@Sebastian can you take a screenshot or copy and paste your interpreter settings pls?

azeltov · ‎11-15-2016

Pre-Requisites This is a continuation of of "How to setup a multi user (Active Directory backed) zeppelin integrated with ldap and using Livy Rest server" You should have a running HDP cluster, Zeppelin with Livy, Ranger. Tested on HDP 2.5. I will use 3 user accounts for demoing security: hadoopadmin, sales1, hr1. You can substitute accordingly to your environment. These users are synced from an LDAP/AD in my use case Spark and HDFS security We will Configure Ranger policies to: Protect /sales HDFS dir - so only sales group has access to it Protect sales hive table - so only sales group has access to it Create /sales dir in HDFS as hadoopadmin Create /sales dir in HDFS as hadoopadmin sudo -u hadoopadmin hadoop fs -mkdir /sales sudo -u hadoopadmin hadoop fs -chown sales:hadoop /sales Now login as sales1 and attempt to access it before adding any Ranger HDFS policy sudo su - sales1 hdfs dfs -ls /sales It fails with authorization error: Permission denied: user=sales1, access=READ_EXECUTE, inode="/sales":hadoopadmin:hdfs:d--------- Lets create a policy in ranger that will allow the user now access /sales directory Login into Ranger UI e.g. at http://RANGER_HOST_PUBLIC_IP:6080/index.html as admin/admin In Ranger, click on 'Audit' to open the Audits page and filter by below. Service Type: HDFS User: sales1 Notice that Ranger captured the access attempt and since there is currently no policy to allow the access, it was "Denied" Create an HDFS Policy in Ranger, follow below steps: On the 'Access Manager' tab click HDFS > (clustername)_hadoop Click 'Add New Policy' button to create a new one allowing sales group users access to /sales dir: Policy Name: sales dir Resource Path: /sales Group: sales Permissions : Execute Read Write Add Wait 30s for policy to take effect Now try accessing the dir again as sales1 and now there is no error seen hdfs dfs -ls /sales In Ranger, click on 'Audit' to open the Audits page and filter by below: Service Type: HDFS User: sales1 Notice that Ranger captured the access attempt and since this time there is a policy to allow the access, it was Allowed Now lets copy test data for sales user - run in command line as 'sales1' user <code>wget https://raw.githubusercontent.com/roberthryniewicz/datasets/master/airline-dataset/flights/flights.csv -O /tmp/flights.csv # remove existing copies of dataset from HDFS hadoop fs -rm -r -f /sales/airflightsdelays # create directory on HDFS hadoop fs -mkdir /sales/airflightsdelays # put data into HDFS hadoop fs -put /tmp/flights.csv /sales/airflightsdelays/ hadoop fs -cat /sales/airflightsdelays/flights.csv | head You should see an output similar to : <code>[sales1@az1secure0 ~]$ wget https://raw.githubusercontent.com/roberthryniewicz/datasets/master/airline-dataset/flights/flights.csv -O /tmp/flights.csv --2016-11-15 16:28:24-- https://raw.githubusercontent.com/roberthryniewicz/datasets/master/airline-dataset/flights/flights.csv Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.52.133 Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.52.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 9719582 (9.3M) [text/plain] Saving to: ‘/tmp/flights.csv’ 100%[=========================================================================================>] 9,719,582 8.44MB/s in 1.1s 2016-11-15 16:28:26 (8.44 MB/s) - ‘/tmp/flights.csv’ saved [9719582/9719582] [sales1@az1secure0 ~]$ # remove existing copies of dataset from HDFS [sales1@az1secure0 ~]$ hadoop fs -rm -r -f /sales/airflightsdelays 16/11/15 16:28:28 INFO fs.TrashPolicyDefault: Moved: 'hdfs://az1secure0.field.hortonworks.com:8020/sales/airflightsdelays' to trash at: hdfs://az1secure0.field.hortonworks.com:8020/user/sales1/.Trash/Current/sales/airflightsdelays [sales1@az1secure0 ~]$ [sales1@az1secure0 ~]$ # create directory on HDFS [sales1@az1secure0 ~]$ hadoop fs -mkdir /sales/airflightsdelays [sales1@az1secure0 ~]$ [sales1@az1secure0 ~]$ # put data into HDFS [sales1@az1secure0 ~]$ hadoop fs -put /tmp/flights.csv /sales/airflightsdelays/ [sales1@az1secure0 ~]$ [sales1@az1secure0 ~]$ hadoop fs -cat /sales/airflightsdelays/flights.csv | head Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay 2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA 2008,1,3,4,754,735,1002,1000,WN,3231,N772SW,128,145,113,2,19,IAD,TPA,810,5,10,0,,0,NA,NA,NA,NA,NA 2008,1,3,4,628,620,804,750,WN,448,N428WN,96,90,76,14,8,IND,BWI,515,3,17,0,,0,NA,NA,NA,NA,NA 2008,1,3,4,926,930,1054,1100,WN,1746,N612SW,88,90,78,-6,-4,IND,BWI,515,3,7,0,,0,NA,NA,NA,NA,NA 2008,1,3,4,1829,1755,1959,1925,WN,3920,N464WN,90,90,77,34,34,IND,BWI,515,3,10,0,,0,2,0,0,0,32 2008,1,3,4,1940,1915,2121,2110,WN,378,N726SW,101,115,87,11,25,IND,JAX,688,4,10,0,,0,NA,NA,NA,NA,NA 2008,1,3,4,1937,1830,2037,1940,WN,509,N763SW,240,250,230,57,67,IND,LAS,1591,3,7,0,,0,10,0,0,0,47 2008,1,3,4,1039,1040,1132,1150,WN,535,N428WN,233,250,219,-18,-1,IND,LAS,1591,7,7,0,,0,NA,NA,NA,NA,NA 2008,1,3,4,617,615,652,650,WN,11,N689SW,95,95,70,2,2,IND,MCI,451,6,19,0,,0,NA,NA,NA,NA,NA cat: Unable to write to output stream. Now login as sales1 user in Zeppelin. Make sure livy is configured. <code>%livy.spark val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/sales/airflightsdelays/") // Read all flights df.printSchema // Show a subset of columns with "select" df.select("UniqueCarrier", "FlightNum", "DepDelay", "ArrDelay", "Distance").show Your output should look like this: <code>df: org.apache.spark.sql.DataFrame = [Year: int, Month: int, DayofMonth: int, DayOfWeek: int, DepTime: string, CRSDepTime: int, ArrTime: string, CRSArrTime: int, UniqueCarrier: string, FlightNum: int, TailNum: string, ActualElapsedTime: string, CRSElapsedTime: int, AirTime: string, ArrDelay: string, DepDelay: string, Origin: string, Dest: string, Distance: int, TaxiIn: string, TaxiOut: string, Cancelled: int, CancellationCode: string, Diverted: int, CarrierDelay: string, WeatherDelay: string, NASDelay: string, SecurityDelay: string, LateAircraftDelay: string] +-------------+---------+--------+--------+--------+ |UniqueCarrier|FlightNum|DepDelay|ArrDelay|Distance| +-------------+---------+--------+--------+--------+ | WN| 335| 8| -14| 810| | WN| 3231| 19| 2| 810| | WN| 448| 8| 14| 515| | WN| 1746| -4| -6| 515| .... Now open a new browser for another user session hr1. Create a new notebook as hr userr and re-run similar steps <code>%livy.spark val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/sales/airflightsdelays/") // Read all flights df.printSchema // Show a subset of columns with "select" df.select("UniqueCarrier", "FlightNum", "DepDelay", "ArrDelay", "Distance").show Your output should look like this: <code>org.apache.hadoop.security.AccessControlException: Permission denied: user=hr2, access=EXECUTE, inode="/sales/airflightsdelays":hadoopadmin:hdfs:d--------- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) Lets login to YARN Resource Manager UI to confirm that the zeppelin sessions ran as the end user sales1 and hr2 : Stay tuned for part 2 where we will show how to Protect sales hive table - so only sales group has access to it

azeltov · ‎11-15-2016

You must use Livy integration for the user tokens to be passed.

Online	Offline
Last Visited	‎08-14-2019 06:45 PM

Member Since	‎09-29-2015 01:18 AM
Last Visited	‎08-14-2019 06:45 PM
Posts	155
Kudos received	171

Cloudera Community

Re: LivyServer exception

Re: Ranger Dynamic query rewrite available for hiv...

Re: HDP 2.5 + Zeppelin 0.6 + LDAP : Interpreters a...

Re: How to import External Libraries for Livy Inte...

Re: Is Zeppelin in HDP 2.5 support multi-tenancy o...

Re: Dataframe Insert into ORC table is slow compar...

Re: LivyServer exception

Federated SparkSQL using SAP HANA and HDP Hive usi...

Using Spark 1.6.x dataframes to access SAP HANA ...

Basic CDC on SAP HANA using Apache NIFI

Re: Enabling TLS for the Zeppelin UI

Re: How to configure Zeppelin for Active Directory...

Re: Using Hive through JDBCInterpreter in Zeppelin

Securing Spark with Ranger using Zeppelin and Liv...

Re: Is Zeppelin in HDP 2.5 support multi-tenancy o...