About nsabharwal

nsabharwal · ‎01-18-2016

Original post Use case: Access control on table customer, exclude column SSN. User Hive has access to see only name column. Access on column SSN is restricted. Column level security can be controlled in couple of clicks by Ranger UI.

nsabharwal · ‎01-18-2016

Setup your environment Gist Demo Source

nsabharwal · ‎01-17-2016

User demouser needs access to /landingzone. Super user HDFS created a directory called /landingzone and user demouser does not have access to it. We will use Ranger to control the access.

nsabharwal · ‎01-16-2016

Original 1) Setup Azure account 2) Setup CloudBreak account Very important steps : Applies to Azure only Create a test network in Azure before you start creating cloudbreak credentials. In your local machine, run the following and accept default values. openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout azuretest.key -out azuretest.pem You will see 2 files as listed below. -rw-r--r-- 1 nsabharwal staff 1346 May 7 17:00 azuretest.pem --> We need this file to create credentials in cloudbreak. -rw-r--r-- 1 nsabharwal staff 1679 May 7 17:00 azuretest.key --> We need this to login into the host after cluster deployment. chmod 400 azuretest.key --> otherwise, you will receiver bad permission error for example: ssh -i azuretest.key ubuntu@<server> Very important: check your openssl version and if it's latest version then run the following and use azuretest_login.key to login openssl rsa -in azuretest.key-out azuretest_login.key hw11326:jumk nsabharwal$ openssl version OpenSSL 0.9.8zc 15 Oct 2014 Latest version of openssl creates .key with -----BEGIN PRIVATE KEY----- Old openssl creates keys with ( we need this) -----BEGIN RSA PRIVATE KEY----- Login to cloudbreak portal and create Azure credential Once you fill the information and hit create credentials then you will get a file from cloudbreak that needs to be uploaded into the Azure portal. I saved it as azuretest.cert Login to Azure portal ( switch to classic mode in case you are using new portal) click Settings --> Manage Certificates then upload the bottom of the screen. There are 2 more actions In CloudBreak windows 1) Create a template You can change the instance type & volume type as per your setup. 2) Create a blueprint - You can grab sample blueprints here ( You may have to format the blueprint in case there is any issue) Once all this done then you are all set to deploy the cluster select the credential and hit create cluster Create cluster window handy commands to login into docker login into your host ssh -i azuretest.key ubuntu@fqdn " New announcement: Just found out that user needs to be cloudbreak instead of ubuntu " ssh -i azuretest.key cloudbreak@fqdn Once you are in the shell , sudo su - docker ps docker exec -it <container id> bash [root@azuretest ~]# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES f493922cd629 sequenceiq/docker-consul-watch-plugn:1.7.0-consul "/start.sh" 2 hours ago Up 2 hours consul-watch 100e7c0b6d3d sequenceiq/ambari:2.0.0-consul "/start-agent" 2 hours ago Up 2 hours ambari-agent d05b85859031 sequenceiq/consul:v0.4.1.ptr "/bin/start -adverti 2 hours ago Up 2 hours consul [root@test~]# docker exec -it 100e7c0b6d3d bash bash-4.1# docker commands Happy Hadooping!!!! Note: For the latest information and changes, please see https://github.com/sequenceiq/cloudbreak Hadoop Cloud Computing Big Data

nsabharwal · ‎01-07-2016

Original Post Hive and Google Cloud Storage Google’s Cloud Platform provides the infrastructure to perform MapReduce data analysis using open source software such as Hadoop with Hive and Pig. Google's Compute Engine provides the compute power and Cloud Storage is used to store the input and output of the MapReduce jobs. HDP deployment using CloudBreak. Before we deploy HDP in GCE, we need to setup account in GCE and CloudBreak. Signup for free trial account on https://cloud.google.com/free-trial/ Step1) Login into your google dashboard and then click Create a project. For example: I created a project called hadoop-install Step 2) Create credentials. Click Create new Client ID and then choose Service account. Click Okay got it and it will download JSON key (We won’t be using this file). You will see Client ID, Email address and Certificate fingerprints in the same window after downloading JSON key. There will be an option to Generate new P12 key. Step 3) Enable API Default API when you login Search google compute and click Google Compute Engine. You will see an option to Enable API that you need to click. These are the API that I have with enabled status For HDP deployment, you would need Project-id, Email address and P12 key file. GCE setup completed so let’s move on to CloudBreak setup. Signup for CloudBreak account. https://accounts.sequenceiq.com/ Login url https://cloudbreak.sequenceiq.com/ Once you are logged into the Cloudbreak UI then setup GCP credentials You will need project id and following details from the Credentials tab Email address My Cloudbreak UI looks like the following. We will be creating credentials, template and blueprint for HDP deployment and this is only one time process. Credentials: Under manage credentials, choose GCP. Name – Credential name Description – As you like Project ID – hadoop-install (get this value from google dashboard) Service Account Email Address – Credentials tab in google dashboard “Email address” under Service account Service Account Key – Upload the file that you did rename as hadoop.12 SSH public key – Mac users can copy the content of id_rsa.pub. Windows users needs to get this from putty (google search – putty public ssh keys) Template: Next step is to manage resources (create template) Name – Template name Description – As you like Instance-Type – You can choose as per your requirement (I chose n1-standard-2 for this test) Volume Type – Magnetic/SSD Attached volumes per instance – 1 for this test Volume Size – 100GB (Increase this value as per your requirement) Blueprint You can download the blueprint from here. Copy the content and paste it into the create blueprint window. I am saving the blueprint as hivegoogle. In case, you receive blue print error while creating blueprint in CloudBreak then you can usejsonvalidate to validate/format the blueprint. Cluster Deployment Select your credentials Click create cluster Clustername: Name your cluster Region: Choose region to deploy the cluster Network: Choose network Blueprint: Choose blueprint created, hivegoogle Hostgroup configuration: cbgateway , master and slave – I am using minviable-gcp but you can choose the template as per you own choice. Click “create and start cluster” You can see the progress in the Event history. Final snapshot of the cluster looks like this: Verify google cloud related settings and provide project.id & google cloud service email. You can find these details from the google dashboard. Verify tez.aux.uris and make sure to copy gcs connector at this location. I have covered copy process in the environment setup section as below. Let’s setup the environment setup before running hdfs and hive commands. We need hadoop.p12 and gcs connector in all the nodes. Copy hadoop.p12 as defined in Ambari parameter google.cloud.auth.service.account.keyfile You can upload hadoop.p12 in dropbox and do wget or you can copy from your localhost. Copy hadoop.p12 from local machine to VM instance. Cloudbreak uses docker containers to deploy the cluster so we need to copy file from local desktop to vm instance then copy it into the container. First, from the localhost to the vm instance (External IP can be found from google dashboard under VM Instances) HW11326:.ssh nsabharwal$ scp ~/Downloads/hadoop.p12 cloudbreak@130.211.184.135:/tmp hadoop.p12 100% 2572 2.5KB/s 00:00 HW11326:.ssh nsabharwal$ Login to vm instance HW11326:.ssh nsabharwal$ ssh location [hdfs@hdpgcp-1-1435537523061 ~]$ wget https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar --2015-06-28 21:05:59-- https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar Resolving storage.googleapis.com... 74.125.201.128, 2607:f8b0:4001:c01::80 Connecting to storage.googleapis.com|74.125.201.128|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 2494559 (2.4M) [application/java-archive] Saving to: `gcs-connector-latest-hadoop2.jar' 100%[============================================================================================================================================================>] 2,494,559 7.30M/s in 0.3s 2015-06-28 21:05:59 (7.30 MB/s) - `gcs-connector-latest-hadoop2.jar' saved [2494559/2494559] Copy the connector to HDFS location [hdfs@hdpgcp-1-1435537523061 ~]$ hdfs dfs -put gcs-connector-latest-hadoop2.jar /apps/tez/aux-jars/ [hdfs@hdpgcp-1-1435537523061 ~]$ Let’s create storage bucket called hivetest in the google storage. Login into your google compute engine account and click Storage. HDFS test We need to copy the connector into the hadoop-client location otherwise you will hit error “Google FileSystem not found” cp gcs-connector-latest-hadoop2.jar /usr/hdp/current/hadoop-client/lib/ [hdfs@hdpgcp-1-1435537523061 ~]$ hdfs dfs -ls gs://hivetest/ 15/06/28 21:15:32 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.0-hadoop2 15/06/28 21:15:33 WARN gcs.GoogleHadoopFileSystemBase: No working directory configured, using default: 'gs://hivetest/' Found 3 items drwx------ - hdfs hdfs 0 2015-06-28 15:29 gs://hivetest/ns drwx------ - hdfs hdfs 0 2015-06-28 12:44 gs://hivetest/test drwx------ - hdfs hdfs 0 2015-06-28 15:30 gs://hivetest/tmp [hdfs@hdpgcp-1-1435537523061 ~]$ Hive test bash-4.1# su - hive [hive@hdpgcptest-1-1435590069329 ~]$ hive hive> create table testns ( info string) location 'gs://hivetest/testns'; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found) hive> To avoid the above error, we have to copy gcs connector into all the nodes under hive-client cp /tmp/gcs-connector-latest-hadoop2.jar /usr/hdp/current/hive-client/lib Let’s run following Apache Hive test Data Set: http://seanlahman.com/files/database/lahman591-csv.zip We are writing to gs://hivetest hive> create table batting (col_value STRING) location 'gs://hivetest/batting'; OK Time taken: 1.518 seconds Run the following command to verify the location, 'gs://hivetest/batting' hive> show create table batting; OK CREATE TABLE `batting`( `col_value` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'gs://hivetest/batting' TBLPROPERTIES ( 'transient_lastDdlTime'='1435766262') Time taken: 0.981 seconds, Fetched: 12 row(s) hive> select count(1) from batting; Upload Batting.csv hive> drop table batting; You will notice that Batting.csv is deleted from the storage, as it was locally managed table. In case of external table, Batting.csv won’t be removed from the storage bucket. In case you want to test MR using Hive hive> add jar /usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar; Added [/usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar] to class path Added resources: [/usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar] hive> select count(1) from batting; Query ID = hive_20150702095454_c17ae70f-b77e-4599-87e6-022d9bb9a00d Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1435841827745_0003, Tracking URL = http://hdpgcptest-1-1435590069329.node.dc1.consul:8088/proxy/application_1435841827745_0003/ Kill Command = /usr/hdp/2.2.6.0-2800/hadoop/bin/hadoop job -kill job_1435841827745_0003 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2015-07-02 09:54:33,468 Stage-1 map = 0%, reduce = 0% 2015-07-02 09:54:42,947 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.2 sec 2015-07-02 09:54:51,719 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.6 sec MapReduce Total cumulative CPU time: 4 seconds 600 msec Ended Job = job_1435841827745_0003 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.6 sec HDFS Read: 187 HDFS Write: 6 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 600 msec OK 95196 Time taken: 29.855 seconds, Fetched: 1 row(s) hive> Sparksql First, copy gcs connector to spark-historyserver to avoid “Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found” export SPARK_CLASSPATH=/usr/hdp/current/spark-historyserver/lib/gcs-connector-latest-hadoop2.jar I am following this article for Spark test scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@140dcdc5 scala> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS batting ( col_value STRING) location 'gs://hivetest/batting' ") scala> sqlContext.sql("select count(*) from batting").collect().foreach(println) 15/07/01 15:38:42 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 187 bytes 15/07/01 15:38:42 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 286 ms on hdpgcptest-2-1435590069361.node.dc1.consul (1/1) 15/07/01 15:38:42 INFO YarnClientClusterScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool 15/07/01 15:38:42 INFO DAGScheduler: Stage 1 (collect at SparkPlan.scala:84) finished in 0.295 s [95196]15/07/01 15:38:42 INFO DAGScheduler: Job 0 finished: collect at SparkPlan.scala:84, took 8.872396 s

nsabharwal · ‎12-24-2015

Part 2 Linkedin Post Extending Blog 2 to look for starwars tweet Searching for Yoda, Love & Hate Let's see Tweets/Data with word YODA related to Starwars tweets Keyword : LOVE in STARWARS Source Giphy Word Hate Happy Hadooping!!!

nsabharwal · ‎12-24-2015

Part 1 Linkedin Post Part 1 - In case you missed the Introduction to Apache NiFi Assumption - HDP and NiFi installation is in place. You can use HDP Sandbox if you don't have the cluster. NiFi installation - You can follow Blog 1 End Goal of this tutorial is to display Tweets related to particular search terms. For example: My twitter id is allaboutbdata and following screenshot shows the tweet sent on Twitter and same tweet in HDFS/Hive and Solr. The whole setup was done using NiFi. Demo: Install HDP search: yum install -y lucidworks-hdpsearch Create user directory in HDFS & changer permissions sudo -u hdfs hadoop fs -mkdir /user/solr sudo -u hdfs hadoop fs -chown solr /user/solr chown -R solr:solr /opt/lucidworks-hdpsearch/solr Setup Solr su solr cd /opt/lucidworks-hdpsearch/solr/server/solr-webapp/webapp/banana/app/dashboards/ mv default.json default.json.orig wget https://raw.githubusercontent.com/abajwa-hw/ambari-nifi-service/master/demofiles/default.json Important : Must change the hostname if you are not using HDP sandbox , line number 740 Add <str>EEE MMM d HH:mm:ss Z yyyy</str> for tweets timestamp vi /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf/solrconfig.xml Start Solr in cloud mode and create a collection called tweets export JAVA_HOME=/usr/jdk64/jdk1.8.0_60/jre/ /opt/lucidworks-hdpsearch/solr/bin/solr start -c -z localhost:2181 /opt/lucidworks-hdpsearch/solr/bin/solr create -c tweets -d data_driven_schema_configs -s 1 -rf 1 Download the Twitter NiFi template from here Import the template by clicking the 3rd icon from the left as show below. Browse and import the xml file that you downloaded. Click X on the extreme right hand side at the Top to close the popup. Now, Let's load the template. Click 7th icon from the left side and drag it in the canvas Now, let's configure the Twitter template. Setup Twitter developer account to create an app. Once done then you need following information for GetTwitter processor. Start the flow...... Source Happy Hadoooping!!

nsabharwal · ‎12-24-2015

Linkedin Post Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data. Apache NiFi is based on technology previously called “Niagara Files” that was in development and used at scale within the NSA for the last 8 years and was made available to the Apache Software Foundation through the NSA Technology Transfer Program. Some of the use cases include, but are not limited to: Big Data Ingest– Offers a simple, reliable and secure way to collect data streams. IoAT Optimization– Allows organizations to overcome real world constraints such as limited or expensive bandwidth while ensuring data quality and reliability. Compliance– Enables organizations to understand everything that happens to data in motion from its creation to its final resting place, which is particularly important for regulated industries that must retain and report on chain of custody. Digital Security– Helps organizations collect large volumes of data from many sources and prioritize which data is brought back for analysis first, a critical capability given the time sensitivity of identifying security breaches.Source Demo Installation : Download , untar or unzip the package and modify conf/nifi.properties. I added nifi host and modified the port from 8080 to 9080 or deploy NiFi ambari service by using this Nifi UI http://nifihost:9080/nifi/ We are going to work on 3 use cases. Part 1 is focusing very basic use case. 1) Copy files from local filesystem into HDFS Processor - Remember this word because we will be playing with tons of processors while working on use cases. You will "drag" Processor on to the canvas. filter by "getfile" and click Add & then search "hdfs" for put. Now , we have GetFile and PutFile on to the canvas. Right click on the processor to see all the options. In this case, I am copying the data from /landing into HDFS /sourcedata. Right Click on the GetFile processor and it will give you the configuration option. Input directory /landing and in my case , I am keeping source file false. Now, let's configure PutHDFS. Add complete location of core-site.xml and hdfs-site.xml as shown below. You can label the processor as you like by clickingSettings and also, enable failure and success Now, let's setup the relationship between Get and Put. Drag that arrow with + sign to PutHDFS The following screenshot is from my demo environment. Happy Hadoooping!!!

nsabharwal · ‎12-23-2015

Original post A web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more. In few words " It's really cool tool to interact with Data" HDFS, Hive, Spark, Kylin, Flink This is from HDP latest Sandbox Continue to Blog 3 on NiFi Let's analyze Starwars data Hive Demo Table definition and Top 10 users based on tweet count Top 10 users who used the word "love" in #starwars Word hate used in #starwars Word yoda used in #starwars You can see the Tweet sent by my id in Zeppelin output. Spark I used this for the sentiment analysis. Replace %hive with %sql (Assuming that you have setup the Zeppelin correctly) Links Zeppelin Hortonworks and Zeppelin Happy Hadooping!!!

nsabharwal · ‎12-12-2015

Download connector http://hortonworks.com/hdp/addons/ **** Extract tar file **** **** Copy jar into sqoop-client/lib *** cp *.jar /usr/hdp/current/sqoop-client/lib/ **** Create tables in Teradata **** We will be importing data from /tmp/test , HDFS location **** Sqoop **** sqoop export --connect jdbc:teradata://Terdatahost/Database=DBName --connection-manager org.apache.sqoop.teradata.TeradataConnManager --username user --password passwd --table test --export-dir /tmp/test/ --batch

Online	Offline
Last Visited	‎07-18-2019 05:10 PM

Member Since	‎09-18-2015 05:49 PM
Last Visited	‎07-18-2019 05:10 PM
Posts	3,274
Kudos received	1119

Cloudera Community

Apache Ranger and Hive Column level Security

Apache Zeppelin and SparkR

Apache Ranger and HDFS

HDP deployment in (Azure) using CloudBreak

Hive and Google Cloud Storage

Apache NiFi & May the Force be with you

Apache NiFi - Part 2 (Twitter Flow)

Apache NiFi - Part 1 (Introduction)

Apache Zeppelin (Hive & Spark Demo)

HDFS to Teradata - example