About nsabharwal

nsabharwal · ‎01-10-2016

@Cary Walker This is the list of OS supported http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4-Win/bk_HDP_Install_Win/content/ref-9bdea823-d29d-47f2-9434-86d5460b9aa9.1.html

nsabharwal · ‎01-09-2016

@Sunile Manjee Interesting question 🙂 Core Jackson abstractions, basic JSON streaming API implementation http://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-core/2.2.3

nsabharwal · ‎01-07-2016

@John Smith Login to Sandbox environment from putty or mac shell http://ant.apache.org/srcdownload.cgi https://subversion.apache.org/packages.html

nsabharwal · ‎01-07-2016

@sivasaravanakumar k Found this easy example http://stackoverflow.com/questions/17716386/any-on... Use Hive Query like SELECT b.blogID, count(b.name) FROM comments a LATERAL VIEW json_tuple(a.value, 'blogID', 'name') b AS blogID, name group by b.blogID; and make as Json_tuple then you can easly use That json to D3js

nsabharwal · ‎01-07-2016

@Kuldeep Kulkarni Disk alerts https://community.hortonworks.com/articles/4870/ch... https://github.com/apache/ambari/tree/branch-2.1/a... We can use the source code of AMS and customize as per use case.

nsabharwal · ‎01-07-2016

@Gokul Devaraj Did it help?

nsabharwal · ‎01-07-2016

@sindhu seenivasan You have to upgrade the memory in your machine to get "100% free" working. Give a try to this http://hortonworks.com/blog/building-hadoop-vm-quickly-ambari-vagrant/

nsabharwal · ‎01-07-2016

Original Post Hive and Google Cloud Storage Google’s Cloud Platform provides the infrastructure to perform MapReduce data analysis using open source software such as Hadoop with Hive and Pig. Google's Compute Engine provides the compute power and Cloud Storage is used to store the input and output of the MapReduce jobs. HDP deployment using CloudBreak. Before we deploy HDP in GCE, we need to setup account in GCE and CloudBreak. Signup for free trial account on https://cloud.google.com/free-trial/ Step1) Login into your google dashboard and then click Create a project. For example: I created a project called hadoop-install Step 2) Create credentials. Click Create new Client ID and then choose Service account. Click Okay got it and it will download JSON key (We won’t be using this file). You will see Client ID, Email address and Certificate fingerprints in the same window after downloading JSON key. There will be an option to Generate new P12 key. Step 3) Enable API Default API when you login Search google compute and click Google Compute Engine. You will see an option to Enable API that you need to click. These are the API that I have with enabled status For HDP deployment, you would need Project-id, Email address and P12 key file. GCE setup completed so let’s move on to CloudBreak setup. Signup for CloudBreak account. https://accounts.sequenceiq.com/ Login url https://cloudbreak.sequenceiq.com/ Once you are logged into the Cloudbreak UI then setup GCP credentials You will need project id and following details from the Credentials tab Email address My Cloudbreak UI looks like the following. We will be creating credentials, template and blueprint for HDP deployment and this is only one time process. Credentials: Under manage credentials, choose GCP. Name – Credential name Description – As you like Project ID – hadoop-install (get this value from google dashboard) Service Account Email Address – Credentials tab in google dashboard “Email address” under Service account Service Account Key – Upload the file that you did rename as hadoop.12 SSH public key – Mac users can copy the content of id_rsa.pub. Windows users needs to get this from putty (google search – putty public ssh keys) Template: Next step is to manage resources (create template) Name – Template name Description – As you like Instance-Type – You can choose as per your requirement (I chose n1-standard-2 for this test) Volume Type – Magnetic/SSD Attached volumes per instance – 1 for this test Volume Size – 100GB (Increase this value as per your requirement) Blueprint You can download the blueprint from here. Copy the content and paste it into the create blueprint window. I am saving the blueprint as hivegoogle. In case, you receive blue print error while creating blueprint in CloudBreak then you can usejsonvalidate to validate/format the blueprint. Cluster Deployment Select your credentials Click create cluster Clustername: Name your cluster Region: Choose region to deploy the cluster Network: Choose network Blueprint: Choose blueprint created, hivegoogle Hostgroup configuration: cbgateway , master and slave – I am using minviable-gcp but you can choose the template as per you own choice. Click “create and start cluster” You can see the progress in the Event history. Final snapshot of the cluster looks like this: Verify google cloud related settings and provide project.id & google cloud service email. You can find these details from the google dashboard. Verify tez.aux.uris and make sure to copy gcs connector at this location. I have covered copy process in the environment setup section as below. Let’s setup the environment setup before running hdfs and hive commands. We need hadoop.p12 and gcs connector in all the nodes. Copy hadoop.p12 as defined in Ambari parameter google.cloud.auth.service.account.keyfile You can upload hadoop.p12 in dropbox and do wget or you can copy from your localhost. Copy hadoop.p12 from local machine to VM instance. Cloudbreak uses docker containers to deploy the cluster so we need to copy file from local desktop to vm instance then copy it into the container. First, from the localhost to the vm instance (External IP can be found from google dashboard under VM Instances) HW11326:.ssh nsabharwal$ scp ~/Downloads/hadoop.p12 cloudbreak@130.211.184.135:/tmp hadoop.p12 100% 2572 2.5KB/s 00:00 HW11326:.ssh nsabharwal$ Login to vm instance HW11326:.ssh nsabharwal$ ssh location [hdfs@hdpgcp-1-1435537523061 ~]$ wget https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar --2015-06-28 21:05:59-- https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar Resolving storage.googleapis.com... 74.125.201.128, 2607:f8b0:4001:c01::80 Connecting to storage.googleapis.com|74.125.201.128|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 2494559 (2.4M) [application/java-archive] Saving to: `gcs-connector-latest-hadoop2.jar' 100%[============================================================================================================================================================>] 2,494,559 7.30M/s in 0.3s 2015-06-28 21:05:59 (7.30 MB/s) - `gcs-connector-latest-hadoop2.jar' saved [2494559/2494559] Copy the connector to HDFS location [hdfs@hdpgcp-1-1435537523061 ~]$ hdfs dfs -put gcs-connector-latest-hadoop2.jar /apps/tez/aux-jars/ [hdfs@hdpgcp-1-1435537523061 ~]$ Let’s create storage bucket called hivetest in the google storage. Login into your google compute engine account and click Storage. HDFS test We need to copy the connector into the hadoop-client location otherwise you will hit error “Google FileSystem not found” cp gcs-connector-latest-hadoop2.jar /usr/hdp/current/hadoop-client/lib/ [hdfs@hdpgcp-1-1435537523061 ~]$ hdfs dfs -ls gs://hivetest/ 15/06/28 21:15:32 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.0-hadoop2 15/06/28 21:15:33 WARN gcs.GoogleHadoopFileSystemBase: No working directory configured, using default: 'gs://hivetest/' Found 3 items drwx------ - hdfs hdfs 0 2015-06-28 15:29 gs://hivetest/ns drwx------ - hdfs hdfs 0 2015-06-28 12:44 gs://hivetest/test drwx------ - hdfs hdfs 0 2015-06-28 15:30 gs://hivetest/tmp [hdfs@hdpgcp-1-1435537523061 ~]$ Hive test bash-4.1# su - hive [hive@hdpgcptest-1-1435590069329 ~]$ hive hive> create table testns ( info string) location 'gs://hivetest/testns'; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found) hive> To avoid the above error, we have to copy gcs connector into all the nodes under hive-client cp /tmp/gcs-connector-latest-hadoop2.jar /usr/hdp/current/hive-client/lib Let’s run following Apache Hive test Data Set: http://seanlahman.com/files/database/lahman591-csv.zip We are writing to gs://hivetest hive> create table batting (col_value STRING) location 'gs://hivetest/batting'; OK Time taken: 1.518 seconds Run the following command to verify the location, 'gs://hivetest/batting' hive> show create table batting; OK CREATE TABLE `batting`( `col_value` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'gs://hivetest/batting' TBLPROPERTIES ( 'transient_lastDdlTime'='1435766262') Time taken: 0.981 seconds, Fetched: 12 row(s) hive> select count(1) from batting; Upload Batting.csv hive> drop table batting; You will notice that Batting.csv is deleted from the storage, as it was locally managed table. In case of external table, Batting.csv won’t be removed from the storage bucket. In case you want to test MR using Hive hive> add jar /usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar; Added [/usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar] to class path Added resources: [/usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar] hive> select count(1) from batting; Query ID = hive_20150702095454_c17ae70f-b77e-4599-87e6-022d9bb9a00d Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1435841827745_0003, Tracking URL = http://hdpgcptest-1-1435590069329.node.dc1.consul:8088/proxy/application_1435841827745_0003/ Kill Command = /usr/hdp/2.2.6.0-2800/hadoop/bin/hadoop job -kill job_1435841827745_0003 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2015-07-02 09:54:33,468 Stage-1 map = 0%, reduce = 0% 2015-07-02 09:54:42,947 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.2 sec 2015-07-02 09:54:51,719 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.6 sec MapReduce Total cumulative CPU time: 4 seconds 600 msec Ended Job = job_1435841827745_0003 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.6 sec HDFS Read: 187 HDFS Write: 6 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 600 msec OK 95196 Time taken: 29.855 seconds, Fetched: 1 row(s) hive> Sparksql First, copy gcs connector to spark-historyserver to avoid “Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found” export SPARK_CLASSPATH=/usr/hdp/current/spark-historyserver/lib/gcs-connector-latest-hadoop2.jar I am following this article for Spark test scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@140dcdc5 scala> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS batting ( col_value STRING) location 'gs://hivetest/batting' ") scala> sqlContext.sql("select count(*) from batting").collect().foreach(println) 15/07/01 15:38:42 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 187 bytes 15/07/01 15:38:42 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 286 ms on hdpgcptest-2-1435590069361.node.dc1.consul (1/1) 15/07/01 15:38:42 INFO YarnClientClusterScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool 15/07/01 15:38:42 INFO DAGScheduler: Stage 1 (collect at SparkPlan.scala:84) finished in 0.295 s [95196]15/07/01 15:38:42 INFO DAGScheduler: Job 0 finished: collect at SparkPlan.scala:84, took 8.872396 s

nsabharwal · ‎01-06-2016

@Ram D Yes based on this http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-... Assign Node Labels to Cluster Nodes Use the following command format to add or replace node label assignments on cluster nodes: yarn rmadmin -replaceLabelsOnNode "<node1>:<port>=<label1> <node2>:<port>=<label2>" For example, the following commands assign node label "x" to "node-1.example.com", and node label "y" to "node-2.example.com". sudo su yarn yarn rmadmin -replaceLabelsOnNode "node-1.example.com=x node-2.example.com=y"

nsabharwal · ‎01-06-2016

@vijaya inturi You can use this to write into HDFS from existing table INSERT OVERWRITE DIRECTORY '/path/to/output/dir' SELECT * FROM table

Online	Offline
Last Visited	‎07-18-2019 05:10 PM

Member Since	‎09-18-2015 05:49 PM
Last Visited	‎07-18-2019 05:10 PM
Posts	3,274
Kudos received	1129

Cloudera Community

Re: Is Ranger KMS Encryption FIPS 140-2 compliant ...

Re: How to add another HiveServer for current meta...

Re: FQDNs - are they necessary?

Re: java.io.FileNotFoundException: (Is a director...

Re: Need Design/Architecture Suggestion on HDP & H...

Re: HDP on Windows, non-Server versions?

Re: 3rd jackson-core-*.jar on HDP install

Re: ANT installation

Re: how to visualize hortonworks hive table data i...

Re: How to add custom alerts in Ambari

Re: Unable to create root directory in HDFS.

Re: What is 100% free method to install HDP sandbo...

Hive and Google Cloud Storage

Re: Can we assign multiple nodes to a yarn node la...

Re: how to use Sqoop to export data from Hive to a...