Member since
06-05-2019
117
Posts
127
Kudos Received
11
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
390 | 12-17-2016 08:30 PM | |
285 | 08-08-2016 07:20 PM | |
674 | 08-08-2016 03:13 PM | |
511 | 08-04-2016 02:49 PM | |
611 | 08-03-2016 06:29 PM |
05-16-2018
11:20 PM
7 Kudos
In order to debug pairing DLM, you'll need the following pre-req: 1) Root access to the DPS VM Problem statement - have you received an error when pairing a cluster? Follow these step-by-step instructions to access the DLM log, to gain granular log information that will help you debug: 1) Run the "sudo docker ps" command to gain the container id for "dlm-app": In the image above, the container id for "dlm-app" is "83d879e9a45e". 2) Once you receive the container id, you can run the following command "sudo docker exec -it 83d879e9a45e /bin/tailf /usr/dlm-app/logs/application.log" This will give you insight into the DPS-DLM application, in the example above you'll see "ERROR". The error log will post once you click "pair" in the DLM UI. Using the information from the log, you'll be able to troubleshoot your issue.
... View more
- Find more articles tagged with:
- dlm
- dps
- Issue Resolution
- pair
- setup
04-03-2018
07:50 PM
4 Kudos
Reading and writing files to a MapR cluster (version 6) is simple, using the standard PutFile or GetFile, utilizing the MapR NFS. If you've searched high and low on how to do this, you've likely read articles and GitHub projects specifying steps. I've tried these steps without success, meaning whats out there is too complicated or out-dated to solve NiFi reading/writing to MapR. You don't need to re-compile the HDFS processors with the MapR dependencies, just follow the steps below: 1) Install the MapR client on each NiFi node #Install syslinux (for rpm install)
sudo yum install syslinux
#Download the RPM for your OS http://package.mapr.com/releases/v6.0.0/redhat/
rpm -Uvh mapr-client-6.0.0.20171109191718.GA-1.x86_64.rpm
#Configure the mapr client connecting with the cldb
/opt/mapr/server/configure.sh -c -N ryancicak.com -C cicakmapr0.field.hortonworks.com:7222 -genkeys -secure
#Once you have the same users/groups on your OS (as MapR), you will be able to use maprlogin password (allowing you to login with a Kerberos ticket)
#Prove that you can access the MapR FS
hadoop fs -ls /
2) Mount the MaprR FS on each NiFi node sudo mount -o hard,nolock cicakmapr0.field.hortonworks.com:/mapr /mapr *This will allow you to access the MapRFS on the mount point /mapr/yourclustername.com/location 3) Use the PutFile and GetFile processor referencing the /mapr directory on your NiFi nodes *Following 1-3 allows you to quickly read/write to MapR, using NiFi.
... View more
- Find more articles tagged with:
- HDFS
- How-ToTutorial
- mapr
- NiFi
- nifi-processor
- putfile
- solutions
Labels:
03-22-2018
07:07 PM
1 Kudo
Step-by-step instructions on how-to install a CRAN package from a local repo - without an internet connection.
I'll be installing the package called "tidyr". In order to fully install the package, I need to first download tidyr and all dependencies. To do this, I use the CRAN PACKAGE:
https://cran.r-project.org/src/contrib/PACKAGES
1) Finding "Package: tidyr", I can see the dependencies (imports):
Imports: dplyr (>= 0.7.0), glue, magrittr, purrr, Rcpp, rlang, stringi,
tibble, tidyselect
and from them, I can build a list of all the imported packages I need:
assertthat_0.2.0.tar.gz BH_1.66.0-1.tar.gz bindr_0.1.1.tar.gz bindrcpp_0.2.tar.gz cli_1.0.0.tar.gz crayon_1.3.4.tar.gz dplyr_0.7.4.tar.gz glue_1.2.0.tar.gz magrittr_1.5.tar.gz pillar_1.2.1.tar.gz plogr_0.1-1.tar.gz purrr_0.2.4.tar.gz Rcpp_0.12.16.tar.gz rlang_0.2.0.tar.gz stringi_1.1.7.tar.gz tibble_1.4.2.tar.gz tidyr_0.8.0.tar.gz tidyselect_0.2.4.tar.gz utf8_1.1.3.tar.gz R6_2.2.2.tar.gz pkgconfig_2.0.1.tar.gz
2) Download the packages from https://cran.r-project.org/src/contrib/
3) Create a directory to emulate the CRAN repo, in my example I created /tmp/ryantester/src/contrib
4) Create a PACKAGES file within /tmp/ryantester/src/contrib - since this tutorial covers tidyr, I'll include the necessary packages
Package: R6
Version: 2.2.2
Depends: R (>= 3.0)
Suggests: knitr, microbenchmark, pryr, testthat, ggplot2, scales
License: MIT + file LICENSE
NeedsCompilation: no
Package: pkgconfig
Version: 2.0.1
Imports: utils
Suggests: covr, testthat, disposables (>= 1.0.3)
License: MIT + file LICENSE
NeedsCompilation: no
Package: bindr
Version: 0.1.1
Suggests: testthat
License: MIT + file LICENSE
NeedsCompilation: no
Package: bindrcpp
Version: 0.2
Imports: Rcpp, bindr
LinkingTo: Rcpp, plogr
Suggests: testthat
License: MIT + file LICENSE
NeedsCompilation: yes
Package: plogr
Version: 0.1-1
Suggests: Rcpp
License: MIT + file LICENSE
NeedsCompilation: no
Package: BH
Version: 1.66.0-1
License: BSL-1.0
NeedsCompilation: no
Package: plogr
Version: 0.1-1
Suggests: Rcpp
License: MIT + file LICENSE
NeedsCompilation: no
Package: dplyr
Version: 0.7.4
Depends: R (>= 3.1.2)
Imports: assertthat, bindrcpp (>= 0.2), glue (>= 1.1.1), magrittr,
methods, pkgconfig, rlang (>= 0.1.2), R6, Rcpp (>= 0.12.7),
tibble (>= 1.3.1), utils
LinkingTo: Rcpp (>= 0.12.0), BH (>= 1.58.0-1), bindrcpp, plogr
Suggests: bit64, covr, dbplyr, dtplyr, DBI, ggplot2, hms, knitr, Lahman
(>= 3.0-1), mgcv, microbenchmark, nycflights13, rmarkdown,
RMySQL, RPostgreSQL, RSQLite, testthat, withr
License: MIT + file LICENSE
NeedsCompilation: yes
Package: utf8
Version: 1.1.3
Depends: R (>= 2.10)
Suggests: corpus, knitr, rmarkdown, testthat
License: Apache License (== 2.0) | file LICENSE
NeedsCompilation: yes
Package: assertthat
Version: 0.2.0
Imports: tools
Suggests: testthat
License: GPL-3
NeedsCompilation: no
Package: cli
Version: 1.0.0
Depends: R (>= 2.10)
Imports: assertthat, crayon, methods
Suggests: covr, mockery, testthat, withr
License: MIT + file LICENSE
NeedsCompilation: no
Package: crayon
Version: 1.3.4
Imports: grDevices, methods, utils
Suggests: mockery, rstudioapi, testthat, withr
License: MIT + file LICENSE
NeedsCompilation: no
Package: pillar
Version: 1.2.1
Imports: cli (>= 1.0.0), crayon (>= 1.3.4), methods, rlang (>= 0.2.0),
utf8 (>= 1.1.3)
Suggests: knitr (>= 1.19), lubridate, testthat (>= 2.0.0)
License: GPL-3
NeedsCompilation: no
Package: tidyselect
Version: 0.2.4
Depends: R (>= 3.1)
Imports: glue, purrr, rlang (>= 0.2.0), Rcpp (>= 0.12.0)
LinkingTo: Rcpp (>= 0.12.0),
Suggests: covr, dplyr, testthat
License: GPL-3
NeedsCompilation: yes
Package: tibble
Version: 1.4.2
Depends: R (>= 3.1.0)
Imports: cli, crayon, methods, pillar (>= 1.1.0), rlang, utils
Suggests: covr, dplyr, import, knitr (>= 1.5.32), microbenchmark,
mockr, nycflights13, rmarkdown, testthat, withr
License: MIT + file LICENSE
NeedsCompilation: yes
Package: stringi
Version: 1.1.7
Depends: R (>= 2.14)
Imports: tools, utils, stats
License: file LICENSE
License_is_FOSS: yes
NeedsCompilation: yes
Package: rlang
Version: 0.2.0
Depends: R (>= 3.1.0)
Suggests: crayon, knitr, methods, pillar, rmarkdown (>= 0.2.65),
testthat, covr
License: GPL-3
NeedsCompilation: yes
Package: Rcpp
Version: 0.12.16
Depends: R (>= 3.0.0)
Imports: methods, utils
Suggests: RUnit, inline, rbenchmark, knitr, rmarkdown, pinp, pkgKitten
(>= 0.1.2)
License: GPL (>= 2)
NeedsCompilation: yes
Package: purrr
Version: 0.2.4
Depends: R (>= 3.1)
Imports: magrittr (>= 1.5), rlang (>= 0.1), tibble
Suggests: covr, dplyr (>= 0.4.3), knitr, rmarkdown, testthat
License: GPL-3 | file LICENSE
NeedsCompilation: yes
Package: magrittr
Version: 1.5
Suggests: testthat, knitr
License: MIT + file LICENSE
NeedsCompilation: no
Package: glue
Version: 1.2.0
Depends: R (>= 3.1)
Imports: methods
Suggests: testthat, covr, magrittr, crayon, knitr, rmarkdown, DBI,
RSQLite, R.utils, forcats, microbenchmark, rprintf, stringr,
ggplot2
License: MIT + file LICENSE
NeedsCompilation: yes
Package: tidyr
Version: 0.8.0
Depends: R (>= 3.2)
Imports: dplyr (>= 0.7.0), glue, magrittr, purrr, Rcpp, rlang, stringi,
tibble, tidyselect
LinkingTo: Rcpp
Suggests: covr, gapminder, knitr, rmarkdown, testthat
License: MIT + file LICENSE
NeedsCompilation: yes
5) Move the downloaded packages from #1 and #2 to the /tmp/ryantester/src/contrib 6) The final step is to install, pointing to your local repo (in our case, /tmp/ryantester) install.packages('tidyr', repos = "file:///tmp/ryantester")
... View more
- Find more articles tagged with:
- cran
- How-ToTutorial
- Installation
- offline
- R
- Sandbox & Learning
10-25-2017
07:37 PM
7 Kudos
Installing the Alarm Fatigue Demo via Cloudbreak:
There are multiple ways to deploy the Alarm Fatigue Demo via Cloudbreak. Below are four options:
1) Deploy via the Cloudbreak UI
a) Login to https://cbdtest.field.hortonworks.com b) Select your credentials – if you credentials don’t exist, create them under “Manage Credentials” c) Once your credentials are selected, click “Create Cluster” d) Make-up a cluster name and choose the Availability Zone (SE) and then click “Setup Network and Security" e) “fieldcloud-openstack-network” should be selected and click “Choose Blueprint” f) Select the Blueprint called “alarm_fatigue_v2” Host Group 1 (Select Ambari Server, alarm-fatigue-demo and pre-install-java8) Host Group 2 (select pre-install-java8) Host Group 3 (select pre-install-java8) g) Click on “Review and Launch” e) Click on “Create and start cluster” (After clicking, the deployment via Cloudbreak will likely take 30-50 minutes, go get a coffee)
2) Deploy via Bash Script (specifying configuration file)
Create file .deploy.config with the following
Version=0.5
CloudBreakServer=
https://cbdtest.field.hortonworks.com
CloudBreakIdentityServer=
http://cbdtest.field.hortonworks.com:8089
CloudBreakUser=admin@example.com
CloudBreakPassword=yourpassword
CloudBreakCredentials=
CloudBreakClusterName=alarmfatigue-auto
CloudBreakTemplate=openstack-m3-xlarge
CloudBreakRegion=RegionOne
CloudBreakSecurityGroup=openstack-connected-platform-demo-all-services-port-v3
CloudBreakNetwork=fieldcloud-openstack-network
CloudBreakAvailabilityZone=SE
Change the highlighted
Then execute the following:
wget -O -
https://raw.githubusercontent.com/ryancicak/northcentral_hackathon/master/CloudBreakArtifacts/cloudbreak-cmd/deployer.sh| bash
3) Deploy via Bash Script inputting configurations (while prompted) Just execute wget -O -https://raw.githubusercontent.com/ryancicak/northcentral_hackathon/master/CloudBreakArtifacts/cloudbreak-cmd/deployer.sh| bash and fill out the information as prompted
4) Deploy via Jenkins
All four options will deploy install, configure and run all necessary services including "Alarm Fatigue Demo Control"
... View more
- Find more articles tagged with:
- Cloudbreak
- demo
- hdf
- hdp-2.3.4
- How-ToTutorial
- solutions
- streaming
10-25-2017
04:09 PM
Repo Description Quickly spin-up an end-to-end Alarm Fatigue Demo via Cloudbreak. All services (including the "Alarm Fatigue Demo Control") will be installed/configured/running after the Cloudbreak Blueprint / Recipe executes. Watch the Youtube installation of the Alarm Fatigue Demo: https://www.youtube.com/watch?v=Kilnu-YOCcc&feature=youtu.be The Alarm Fatigue Demo consists of a custom Ambari Service called "Alarm Fatigue Demo Control", which generates patient vitals every 5 seconds for 4 devices (4 patients). NiFi is used to pull the vitals (tailing the log file), stores all vitals in Hive, enriching the data (from Hive) and storing the data in Kafka. Streaming Analytics Manager then picks up the enriched patient information (with vitals) real-time, stores enriched data in HDFS, then aggregating the vitals every 1 minute, storing the aggregates in Druid cubes, and finally running rules (pulseRate > 100) and then sending notification(s) to the doctor - reducing Alarm Fatigue. One device will consistently throw high pulse rates, from the Hive table "device" column problemPercentage between 0.0-1.0 (0-100%), where device GUID ec93da97-08c6-43c4-a0a6-cb689723cf19 will throw a high pulse rate (greater than 100), 100% of the time. HDP services used: HDFS, YARN, MapReduce2, Tez, Hive (ACID), Zookeeper, Atlas, Cloudbreak, Ambari HDF services used: Kafka, Druid, NiFi, Schema Registry, Streaming Analytics Manager What is “Alarm Fatigue”?
Alarm fatigue or alert fatigue occurs when one is exposed to a large number of frequent alarms (alerts) and consequently becomes desensitized to them. Desensitization can lead to longer response times or to missing important alarms. There were 138 preventable deaths between 2010 and 2015, caused from alarm fatigue.
(https://en.wikipedia.org/wiki/Alarm_fatigue)
How can Alarm Fatigue be reduced? Instead of only sounding an alarm, being heard by the closest nurse or doctor, a notification should be sent to the proper doctor/nurse containing a severity level and acknowledgement.
What will HDP/HDF do to reduce Alarm Fatigue? It all starts on the edge device, being the various sensors in a hospital room (blood pressure, pulse rate monitor, respiratory rate monitor, thermometer, etc:.). For this use-case, we will assume our target hospital contains sensors with active connections to raspberry pi device(s). The raspberry pi device will gather logs from the sensors, therefore we will install MiNiFi and tail the logs. MiNiFi will then bi-directionally communicate with a centralized NiFi instance located at the hospital. The custom service "Alarm Fatigue Demo Control" emulates the function of Raspberry PI running MiNiFi collecting data from the sensors. High-Level Architecture: Repo Info Github Repo URL https://github.com/ryancicak/northcentral_hackathon.git Github account name ryancicak Repo name northcentral_hackathon.git
... View more
- Find more articles tagged with:
- ambari-extensions
- Cloudbreak
- Data Ingestion & Streaming
- demo
Labels:
09-22-2017
02:15 PM
Hi @Michael Vogt, To greatly simply regular expressions for fixed-width files, you can use the language Grok. The processor “ExtractGrok” can be used to pull out fixed-length values for example: https://groups.google.com/forum/#!topic/logstash-users/7FETqn3PB1M Using the following data: Time Sequence Source Destination Action Data ---------- -------- ------------ ------------ ------------------------------ ------------------------------------------------------------------------------------ Start 2013/04/29 00:01:34 > 00:01:34 Yosemite Daily Rollover ? 02:18:56 02185130 Yosemite bioWatch Trak Alert WS Failed Return=Serial Not Found. ? 02:19:03 Yosemite AlertNotify ERROR: Conversion from string "" to type 'Date' is not valid. * 02:19:03 Yosemite AlertNotify Failed Serial=L1234567890 Setting=AUTOREPORT I want to be able to get the Time, Sequence, Source, Destination, Action and Data from the data (fixed length above). Writing regular expressions can be difficult, therefore Grok was created for simplification. I built the following workflow using: 1) GetFile – fetch the file (with the data above) 2) SplitText – I split the file up into 1 flowfile per line 3) ExtractGrok – I use a Grok expression to pull out Time (grok.time attribute), Sequence (grok.sequence attribute), Source (grok.source attribute), Destination (grok.destination attribute), Action (grok.action attribute) and Data (grok.data attribute). My Grok pattern: (?<severity>.{1}) (?<time>.{8}) (?<sequence>.{8}) (?<source>.{12}) (?<destination>.{12}) (?<action>.{30}) %{GREEDYDATA:data} If you look at the data above, there are a total of 6 lines – where 5 lines match my Grok pattern. I likely wouldn’t want to collect the unmatched flowfiles because there will always be an unmatched pattern if the file contains “---------- -------- ------------ ------------ ------------------------------ ------------------------------------------------------------------------------------ Start 2013/04/29 00:01:34”. The Grok Pattern file (is attached). I used one I found on google – that had a bunch of pre-defined regular expressions. Grok will output my attributes as I define them in my Grok Expression, where each FlowFile will associate a group with my specified attribute:
... View more
03-16-2017
11:59 PM
In Azure if I have an ExpressRoute in place, do you recommend a cache only DNS server in the Hadoop
vnet?
... View more
03-10-2017
12:42 AM
Hi @Raj B Can you validate you're connecting to the Hive hive.metastore.uris? Do you have Kerberos enabled?
... View more
03-08-2017
06:19 PM
10 Kudos
Many technologists ask "What is the easiest way to import data into Hive?" It's true, you can write a Sqoop job (making sure YARN containers are tuned properly in regards to # of mappers, the size of the mappers, and the sort allocated memory). And you'll deal with a two step process of moving the RDBMS tables into plain text and then creating an ORC table with an INSERT INTO statement.... I don't call this easy. What if there were a tool where you could simply drag-and-drop processes (processors) on a canvas and connect them together creating a workflow? There is a tool for this called NiFi! Before we go any further, I'll assume you know and understand NiFi - what is a processor, scheduling a processor, what is a connection (relationship), what are different relationship types, etc:. Small-Medium RDBMS tables This article will cover small to medium sized RDBMS tables. This means I can run a single select * from myrdbmstable; without returning millions of rows (where we're taxing the RDBMS), I'll write a second article on how-to use the processor GenerateTableFetch that generates select queries that fetch "pages" of rows from a table. Using pages of rows from a table will distribute the select queries amongst multiple NiFi nodes, similar to what Sqoop does with the #mappers where each mapper pages through results. Prerequisites for Hive In order to stream data into Hive, we'll utilize Hive's transactional capabilities which require the following: 1) Enable ACID in Hive 2) Use bucketing on your Hive table 3) Store the Hive table as ORC 4) Set the following property on your Hive table TBLPROPERTIES ("transactional"="true") 1-4 will be followed below in Step 2 Step 1 - Query an RDBMS table using the QueryDatabaseTable processor As described above, choose a small-medium sized RDBMS table (we'll tackle large database tables in another article). a) Add the QueryDatabaseTable processor to your canvas You'll see you we need to choose a Database Connection Pooling Service (which we'll define below in step 2), add a table name and finally create a successful relationship b) Create an RDBMS Connection Pooling Service - Right click on the processor and go to Configure c) Under the "Properties" tab, click the property "Database Connection Pooling Service" and click "Create new service..." on the drop-down d) Choose the "DBCPConnectionPool" and click on Create e) Click on the arrow to go to configure the DBCPConnectionPool f) Click on the pencil to Edit the DBCPConnectionPool g) Change the name in Settings - to something that is easily identifiable for your database connection pool h) Finally go to Properties and define your connection. I'm creating a connection for MySQL, but the general rule of thumb is if a JDBC driver exists, you'll be able to connect (for example I wrote this article to connect to Teradata from within NiFi) Database Connection URL: jdbc:mysql://localhost:3306/hortonworks
Database Driver Class Name: com.mysql.jdbc.Driver
Database Driver Location(s): /Users/rcicak/Desktop/mysql-connector-java-5.0.8-bin.jar Click Apply - and you've created your RDBMS connection pool i) Enable your connection pool j) Define a table name - in our case we'll choose "people3", where the table people3 is described in MySQL as the following: *Note: Don't forget the Maximum-value columns, QueryDatabaseTable will keep track of the last row that was fetched mysql> DESCRIBE people3; +-------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| id | datetime | NO | PRI | NULL | |
| name | varchar(255) | YES | | NULL | |
| age | int(11) | YES | | NULL | |
+-------+--------------+------+-----+---------+-------+
3 rows in set (0.00 sec)
k) Change the schedule of the QueryDatabaseProcessor to run every 10 minutes (or whatever your requirement calls for), therefore the select statement is not being execute multiple times per second Step 2 - Stream the RDBMS rows into Hive using the PutHiveStreaming processor a) Add the PutHiveStreamingProcessor to your canvas b) Create a successful relationship between QueryDatabaseTable (in step 1) and PutHiveStreaming (step 2) c) The caution went away on QueryDatabaseTable after we added the relationship, now we'll need to resolve the caution on the PutHiveStreaming processor d) Right click on the PutHiveStreaming processor and choose Configure, going to the "Properties" tab - Adding the following Hive configurations: Hive Metastore URI: thrift://cregion-hdpmaster2.field.hortonworks.com:9083
Database Name: default
Table Name: people3 *Adjust the "Transactions per Batch" accordingly to your SLA requirements e) Verify your Hive table "people3" exists - and as explained in the Hive prerequisites above, you'll need ACID enabled, store the table as ORC, table properties set to transactional = true and also bucketing in order for PutHiveStreaming to work create table people3 (id timestamp, name varchar(255), age int) CLUSTERED BY(id) INTO 3 BUCKETS STORED AS ORC tblproperties("transactional"="true"); You've successfully imported your RDBMS table into Hive with two processors - wasn't that easy?
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- hive-streaming
- How-ToTutorial
- NiFi
- rdbms
- Sqoop
Labels:
03-02-2017
03:23 PM
Hi Sunile, As we discussed yesterday, I found this installing HDP 2.5.3 using Ambari 2.4.2. Looking further into this, RHEL 7.3 comes installed with snappy 1.1.0-3.el7 while HDP 2.5.3 needs snappy 1.0.5-1.el6.x86_64. I spun up a RHEL 7.3 instance and ran the following command, showing snappy 1.1.0-3.el7 came pre-installed: As Jay posted - Looking at the latest documentation for Ambari 2.4.2, I found this problem in "Resolving Cluster Deployment Problems" - there should be a bug fix that goes into RHEL 7 (so we don't rely on a rhel 6 dependency) https://docs.hortonworks.com/HDPDocuments/Ambari-2.4.2.0/bk_ambari-troubleshooting/content/resolving_cluster_install_and_configuration_problems.html - What do you think?
... View more
01-27-2017
11:58 PM
2 Kudos
Tez View comes pre-deployed in Ambari, as part of Ambari User Views. The view contains useful textual and graphic analysis for Hive queries, when Hive is using Tez as the execution engine. Hortonworks Data Platform (HDP) offers two execution engines for Hive: 1) Tez 2) MapReduce Tez is the default execution engine of Hive. Therefore, out-of-the-box, Tez View provides essential insight into Hive queries. This article will cover some of the useful features of Tez View to analyze/debug a Hive query. Prerequisites The videos below are using Ambari 2.4.1.0 with Hortonworks Data Platform 2.5.0. Ambari is pre-loaded with Tez View 0.7.0.2.5.0.0-22. Executing a Hive Query It all starts with executing a Hive query. This article will cover the TPC-DS query 98 for analysis. select i_item_desc
,i_category
,i_class
,i_current_price
,i_item_id
,sum(ss_ext_sales_price) as itemrevenue
,sum(ss_ext_sales_price)*100/sum(sum(ss_ext_sales_price)) over
(partition by i_class) as revenueratio
from
store_sales
,item
,date_dim
where
store_sales.ss_item_sk = item.i_item_sk
and i_category in ('Jewelry', 'Sports', 'Books')
and store_sales.ss_sold_date_sk = date_dim.d_date_sk
and d_date between cast('2001-01-12' as date)
and (cast('2001-02-11' as date))
group by
i_item_id
,i_item_desc
,i_category
,i_class
,i_current_price
order by
i_category
,i_class
,i_item_id
,i_item_desc
Query 98 is being executed within Hive View of Ambari:
When the query is executed, the execution engine Tez is creating vertices (mappers and reducers) to provide results. Analyzing a Hive Query using Tez View Then we access Tez View of Ambari – Tez creates DAGs (Directed Acyclic Graphs) that relate to both Hive and Pig. In our case, we choose the DAG for query 98 from the DAG Name column .
There are many statistics on the DAG Details tab that opens by default, such as Application ID (relating to the YARN application in which the Tez job ran), the submitter (who executed the query), Status (Failed, Succeeded, Running), Progress bar (% of completion), Start Time, End Time, and Duration. Next we select the Graphical View tab, which represents the DAG, where each green vertex stands for Hive table(s). The mappers connected to the table(s) are extracting the rows from the tables. Reducers represent table joins and other running SQL functionality.
Highlight over a Vertex to view the Tez Class at each task. To view details of a vertex, simply select the vertex. Lastly, select the Vertex Swimlane tab, which represents the total runtime of each vertex (mappers and reducers). As demonstrated above, Tez View can be helpful when analyzing or debugging Hive queries.
... View more
- Find more articles tagged with:
- ambari-views
- Debugging
- Hive
- How-ToTutorial
- Sandbox & Learning
- tez
Labels:
12-23-2016
08:20 PM
2 Kudos
Hi @rudra prasad biswas Can you validate you have hbase-site.xml, core-site.xml and hdfs-site.xml in the classpath? Also what version of HDP are you running?
... View more
12-17-2016
08:50 PM
2 Kudos
Hi @Mark Melenchenko Take a look at the latest documentation (HDP 2.5.3 for spark streaming) https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_spark-component-guide/content/using-spark-streaming.html
... View more
12-17-2016
08:30 PM
1 Kudo
Hi @Sherry Noah Can you try: spark-submit \
--classSparkSamplePackage.SparkSampleClass \
--master yarn-cluster \
--num-executors 2 \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
--files /usr/hdp/current/spark-client/conf/hive-site.xml \
--jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar,target/SparkSample-1.0-SNAPSHOT.jar
... View more
12-17-2016
08:17 PM
Hi @Sundara Palanki What version of HDP are you running - and what version of Hue are you running? What are you using Hue for? Have you tried the Hive View in Ambari?
... View more
12-17-2016
08:14 PM
Hi @Nick Pileggi Can you verify the user trying to read the file has the Decrypt EEK permission in Ranger KMS? You can use my article here as a reference Is your cluster Kerberized?
... View more
12-17-2016
08:09 PM
2 Kudos
Hi @Raffi Abberbock SSSD will sync AD users on the local OS - where the LDAP/AD users will look like they are local users on the OS -> SSSD is recommended when Kerberizing your cluster (found in documentation) necessary for LDAP/AD users can have secured YARN containers Knox can be used to authenticate against LDAP/AD -> so your end-user won't need to go directly to something like HiveServer2, instead they can be given a Knox URL, where Knox will know the location of the HiveServer2 and also authenticate against LDAP/AD.
... View more
12-17-2016
07:58 PM
2 Kudos
Hi @Alexander Brown I've had success transforming the XML to JSON and then converting the JSON into Avro schema - where the JSON schema will need to adhere to the Avro schema. You can generate an Avro schema from your existing JSON by running the InferAvroSchema processor From that XML to JSON XSLT - I'd use https://www.bjelic.net/2012/08/01/coding/convert-xml-to-json-using-xslt/#code and not the example posted (as I posted in the comments)
... View more
12-17-2016
07:50 PM
2 Kudos
Hi @rudra prasad biswas,
Great questions - don't think about column families / column qualifiers, because Phoenix will interact with HBase to automatically do all this for you. Instead, simply create your table: CREATE TABLE mytable (id integer, first_name varchar, last_name varchar CONSTRAINT my_pk PRIMARY KEY (id)); Phoenix will create this table structure on-top of HBase (automatically creating column families and qualifiers). Phoenix has the concept of dynamic columns, where you are able to upsert columns at runtime - take a look at this documentation If you'd like to see how Phoenix is using HBase to create column families and column qualifiers, I'd recommend taking a look at the audit log in Ranger, to take a look how column families and column qualifiers are being created.
... View more
11-30-2016
01:18 AM
5 Kudos
Prerequisites 1) Service Ambari Infra installed -> Ranger will use Ambari Infra's SolrCloud for Ranger Audit 2) MySQL installed and running (I'll use Hive's Metastore MySQL instance * MySQL is one of the many DB options) Installing Apache Ranger using Ambari Infra (SolrCloud) for Ranger Audit 1) Find the location of mysql-connector-java.jar (assume /usr/share/java/mysql-connector-java.jar) run the following command on the Ambari server sudo ambari-server setup --jdbc-db=mysql --jdbc-driver=/usr/share/java/mysql-connector-java.jar 2) In Ambari, click Add Service 3) Choose Ranger and click Next 4) Choose "I have met all the requirements above." and click Proceed (this was done in #1 above) 5) Assign master(s) for "Ranger Usersync" and "Ranger Admin" and click Next 6) Assign Slaves and Clients - since we did not install Apache Atlas, Ranger TagSync is not required and click Next 7) Customize Services -> Ranger Audit, click on "OFF" to enable SolrCloud Before clicking: After clicking: 😎 Customize Services -> Ranger Admin, enter "Ranger DB host" the DB you chose (in my case, I chose MySQL) and a password "Ranger DB password" for the user rangeradmin *Ranger will automatically add the user "rangeradmin" Add the proper credentials for a DB user that has administrator credentials (this administrator will create the user rangeradmin and Ranger tables) MySQL create an administrator user *Note: rcicak2.field.hortonworks.com is the server where Ranger is being installed CREATE USER 'ryan'@'rcicak2.field.hortonworks.com' IDENTIFIED BY 'lebronjamesisawesome';
GRANT ALL PRIVILEGES ON *.* TO 'ryan'@'rcicak2.field.hortonworks.com' WITH GRANT OPTION; Click Next 9) Review -> Click Deploy * Install, Start and Test will show you the progess of Ranger installing 10) Choose Ranger in Ambari 11) Choose "Configs" and "Ranger Plugin" and select the services you'd like Ranger to authorize (You'll need to restart the service after saving changes)
... View more
Labels:
11-22-2016
12:57 PM
11 Kudos
The goal of this article is to ingest log data from multiple
servers (running MiNiFi) that push log data to a NiFi cluster. The NiFi cluster will listen for the log data
on an input port and route to an HDFS directory (determined by the host name). This article will assume you are using Ambari
for NiFi installation/administration. 1) Set the nifi.remote.input.host and nifi.remote.input.socket.port
of NiFi cluster Reason: Allows NiFi cluster to use
an Input Port (MiNiFi will push to a Remote Processing Group) * NiFi will
listen using the Input Port a) In Ambari, go to NiFi b) Choose the Configs tab c) Choose the Advanced nifi-properties d) Set the nifi.remote.input.host to your NiFi hostname and nifi.remote.input.socket.port to 10000 e) Restart NiFi using Ambari 2) In NiFi – Create a flow for incoming log data
(listening [input port] for MiNiFi data) Reason: Listen for incoming log
data and route the log data to an HDFS directory a) On the NiFi Flow Canvas, drag-and-drop an Input
port (Name your Input Port – I named mine “listen_for_minifi_logs”) b) Drag-and-drop processor RouteOnAttribute c) Create a connection between Input Port and
RouteOnAttribute d) Configure RouteOnAttribute - Properties (removing
the red caution), adding two properties (one property per server you’re
installing MiNiFi on) I have two servers – rcicak0.field.hortonworks.com
and rcicak1.field.hortonworks.com, the incoming log data (flowfile) will
contain an attribute called “host_name” where we’ll properly route the flowfile
depending on the host_name property e) Drag-and-drop three putHDFS processors &
create a connection using hostname_rcicak0, hostname_rcicak1 and unmatched f) Each putHDFS processor will have a different
HDFS directory – configure properties for each putHDFS processor /tmp/rcicak0/,
/tmp/rcicak1/ and /tmp/unmatched g) Configure the HDFS directory – properties (adding a core-site.xml and the
directory – depending on the connection) h) Play the processors – at this point, the NiFi
flow is ready to receive Log data from MiNiFi 3) Setup MiNiFi on at least one server Reason: MiNiFi needs to push the
log data to a remote processing group and delete the log file a) Download MiNiFi (from http://hortonworks.com/downloads/
) on each of the servers (that contain log data) b) Unzip minifi-0.0.1-bin.zip to a directory c) Complete step 4 below before continuing to d d) Running with an account that has the read/write
permission to the log data directory (to read the file and delete the file) run
*location/minifi-0.0.1/bin/minifi.sh start 4) Using a processor group in NiFi, create a MiNiFi Flow
(pushing log data to a remote processing group) Reason: Push the log data to a
remote processing group and delete the log file a) Create a process group (call the group “minifi_flow”) b) Go into the process group “minifi_flow" c) Drag-and-drop the processor GetFile d) Configure the processor GetFile – Properties
(IMPORTANT: Any file matching the file filter’s regular expression under input directory
[and the recursive subdirectories when set to true], the file will be deleted
once the file is stored in MiNiFi’s content repository In
the example above, the file filter looks for hdfs-audit.log.Archive -> in
this case is a date e) Drag-and-drop the UpdateAttribute processor and
create a successful connection between GetFile and UpdateAttribute f) Configure the UpdateAttribute processor –
Properties, setting the host_name attribute adding the nifi expression language
getting the hostname g) Drag-and-drop a remote process group – use the nifi.remote.input.host
from above for the URL Wait for the connection to
establish before continuing to h h) Add a connection between UpdateAttribute and the
Remote Process Group – under To Input choose listen_for_minifi_logs i) Select all processors and relationships to
create a template (Download the template’s xml file) j) Use the minifi-toolkit (https://www.apache.org/dyn/closer.lua?path=/nifi/minifi/0.0.1/minifi-toolkit-0.0.1-bin.zip
) and run the config.sh tool “config.sh transform theminifi_flow_template.xml
config.yml” -> which will convert the XML to a YML file that will be read by
MiNiFi k) Copy the config.yml file into the minifi-0.0.1/conf
directory on each of the MiNiFi servers (if you already have your MiNiFi agent
started, restart the agent)
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- data-ingestion
- How-ToTutorial
- ingestion
- minifi
- NiFi
Labels:
11-15-2016
12:41 AM
Please use the XSLT code located here - it is more complete than the XSLT code provided
... View more
10-07-2016
03:30 PM
1 Kudo
Hi @Sunile Manjee Great question - I just recently tried this out so the answer is, the tag based policy "Deny Condition" will absolutely trump the resource based policy (below is my proof): The table customer_data_flattened has one tag “PII” on the column “c_email_address” Resource Based Policy (#16): - the user john_doe and jane_doe has ALL permissions in Hive for database = default, table = customer_data_flattened, all columns Tag Based Policy (#13) – the user john_doe has a deny condition for the PII Tag for all Hive permissions When I run “SELECT * FROM customer_data_flattened LIMIT 100;” for the user john_doe (policy #16) should give me access but the tag based (policy #13) doesn’t allow it: (notice policy #16 isn’t even shown because #13 denied the request) – policy #11 has to do with a different table
... View more
10-03-2016
05:17 PM
2 Kudos
If you've received the error exitCode=7 after enabling Kerberos, you are hitting this Jira bug. Notice the bug outlines the issue but does not outline a solution. The good news is the solution is simple, as I'll document below. Problem: If you've enabled Kerberos through Ambari, you'll get through around 90-95% of the last step "Start and Test Services" and then receive the error: 16/09/26 23:42:49 INFO mapreduce.Job: Running job: job_1474928865338_0022
16/09/26 23:42:55 INFO mapreduce.Job: Job job_1474928865338_0022 running in uber mode : false
16/09/26 23:42:55 INFO mapreduce.Job: map 0% reduce 0%
16/09/26 23:42:55 INFO mapreduce.Job: Job job_1474928865338_0022 failed with state FAILED due to: Application application_1474928865338_0022 failed 2 times due to AM Container for appattempt_1474928865338_0022_000002 exited with
exitCode: 7
For more detailed output, check application tracking page:
http://master2.fqdn.com:8088/cluster/app/application_1474928865338_0022
Then, click on links to logs of each attempt.Diagnostics: Exception from container-launch.
Container id: container_e05_1474928865338_0022_02_000001
Exit code: 7
Stack trace: ExitCodeException exitCode=7:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
at org.apache.hadoop.util.Shell.run(Shell.java:487)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:371)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:303)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Shell output: main : command provided 1
main : run as user is ambari-qa
main : requested yarn user is ambari-qa
Container exited with a non-zero exit code 7
Failing this attempt. Failing the application. You'll notice running "Service Checks" for Tez, MapReduce2, YARN, Pig (any service that involves creating a YARN container) will fail with the exitCode=7. This is because in YARN, the local-dirs likely has the "noexec" flag specified meaning the binaries that are added to these directories cannot be executed. Solution: Open /etc/fstab (with the proper permissions) and remove the noexec flag under all mounted drives specified under "local-dirs" in YARN. Then either remount or reboot your machine - problem solved.
... View more
- Find more articles tagged with:
- Hadoop Core
09-24-2016
12:24 AM
1 Kudo
You may be in a bind if you need to install HDP on Azure with CentOS 6 or RHEL 6 and certain services (not everything). By following these steps below, you will be able to use ambari-server to install HDP on any of the supported Hortonworks/Azure VMs. 1) Configure your VMs - use the same VNet for all VMs Run the next steps as root or sudo the commands: 2) Update /etc/hosts on all your machines: vi /etc/hosts
172.1.1.0 master1.jd32j3j3kjdppojdf3349dsfeow0.dx.internal.cloudapp.net
172.1.1.1 master2.jd32j3j3kjdppojdf3349dsfeow0.dx.internal.cloudapp.net
172.1.1.2 master3.jd32j3j3kjdppojdf3349dsfeow0.dx.internal.cloudapp.net
172.1.1.3 worker1.jd32j3j3kjdppojdf3349dsfeow0.dx.internal.cloudapp.net
172.1.1.4 worker2.jd32j3j3kjdppojdf3349dsfeow0.dx.internal.cloudapp.net
172.1.1.5 worker3.jd32j3j3kjdppojdf3349dsfeow0.dx.internal.cloudapp.net * use the FQDN (find the fqdn by typing hostname -f). The ip address are internal and can be found on eth0 by typing ifconfig 3) Edit /etc/sudoers.d/waagent so that you don't need to type a password when sudoing a) change permissions on /etc/sudoers.d/waagent: chmod 600 /etc/sudoers.d/waagent
b) update the file "username ALL = (ALL) ALL" to "username ALL = (ALL) NOPASSWD: ALL": vi /etc/sudoers.d/waagent c) change permissions on /etc/sudoers.d/waagent: chmod 440 /etc/sudoers.d/waagent * change username to the user that you sudo with (the user that will install Ambari) 3) Disable iptables a) service iptables stop
b) chkconfig iptables off * If you need iptables enabled, please make the necessary port configuration changes found here 4) Disable transparent huge pages a) Run the following in your shell: cat > /usr/local/sbin/ambari-thp-disable.sh <<-'EOF'
#!/usr/bin/env bash
# disable transparent huge pages: for Hadoop
thp_disable=true
if [ "${thp_disable}" = true ]; then
for path in redhat_transparent_hugepage transparent_hugepage; do
for file in enabled defrag; do
if test -f /sys/kernel/mm/${path}/${file}; then
echo never > /sys/kernel/mm/${path}/${file}
fi
done
done
fi
exit 0
EOF
b) chmod 755 /usr/local/sbin/ambari-thp-disable.sh
c) sh /usr/local/sbin/ambari-thp-disable.sh * Perform a-c on all hosts to disable transparent huge pages 5) If you don't have a private key generated (where the host running ambari-server can use a privat key to login to all the hosts - please perform this step) a) ssh-keygen -t rsa -b 2048 -C "username@master1.jd32j3j3kjdppojdf3349dsfeow0.dx.internal.cloudapp.net"
b) ssh-copy-id -i /locationofgeneratedinaabove/id_rsa.pub username@master1 * Run b above on all hosts, this way you can ssh using the username into all hosts from the ambari-server host without a password 6) Install the ambari repo on the server where you'll install Ambari (documentation 😞 wget -nv http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.2.2.0/ambari.repo -O /etc/yum.repos.d/ambari.repo 7) Install ambari-server: yum install ambari-server 😎 Setup ambari-server: ambari-server setup * You can use the defaults by pressing ENTER 9) Start ambari-server: ambari-server start
* This could take a few minutes to startup depending on the speed of your machine 10) Open your browser and go to the ip address where ambari-server is running http://ambariipaddress:8080 * Continue with your HDP 2.4.3 installation
... View more
09-14-2016
03:12 AM
1 Kudo
When utilizing a processor that calls an external API call, is there a way to set a timeout (for example if the API never responds) wait for 1 minute (max). Would the processor just continue to wait for a response?
... View more
Labels:
09-14-2016
03:06 AM
1 Kudo
Once creating a custom processor (described here) - what is the best way to deploy the nar file in a NiFi cluster? Do I need to deploy to every node in the cluster and then restart one at a time (rolling restart)? Did anything change with deploying a custom processor to the cluster with NiFi v1.0?
... View more
Labels:
08-29-2016
02:58 PM
Hi @ScipioTheYounger Yes - you are correct (StorageBasedAuthorizationProvider and DefaultHiveMetastoreAuthorizationProvider) are the two provided https://cwiki.apache.org/confluence/display/Hive/Storage+Based+Authorization+in+the+Metastore+Server DefaultHiveMetastoreAuthorizationProvider = Hive's grant/revoke model StorageBasedAuthorizationProvider = HDFS permission based model (which is recommended on the apache website) More info here on configuring for the storage-based model https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_Sys_Admin_Guides/content/ref-5422cb60-d1d5-425a-b719-ec7bd03ee5d3.1.html
... View more
08-08-2016
09:36 PM
Hi @habeeb siddique Fantastic news! I'm glad its working! When you made the change, did Ambari tell you to restart each of the services (YARN, Hive, etc:.)?
... View more
08-08-2016
07:32 PM
1 Kudo
Hi @john doe I recently ran PutKafka and GetKafka in NiFi (connecting to a local VM). I found that adding the FQDN and ip to /etc/hosts made this work for me. For example if the FQDN is host1.local and IP is 192.168.4.162 then adding 192.168.4.162 host1.local to /etc/hosts Made this work.
... View more