About RyanCicak

RyanCicak · ‎02-03-2023

Throughout my seven years working with Cloudera/Hortonworks, I'm always learning new things. One thing I've learned from the Cloudera/Hortonoworks merger was how amazing CDSW/CML is as a product. CML isn't only for Data Scientists, it's for anyone that needs an IDE (Integrated Development Environment). Coming from a development background working with Eclipse and IntelliJ, you become dependent on a solid IDE. In the past, I've used IDEs to develop applications, and then a build process would eventually deploy to a run environment. This is where CML shines, you're able to run your applications within CDP at scale with enormous amounts of data. Anyone using CML is over the moon running their applications/projects within CML because of the simplicity, as I'll demonstrate below. The tagline of CML is "BYOL" (Bring your own libraries), meaning ALL libraries are welcome (outside of the Cloudera ecosystem). This differentiates Cloudera from others such as native Azure, where native Azure is HIGHLY dependent on all things Microsoft (unless the application owner created something native Azure-specific). I'll demonstrate how easy deploying a third party such as Django "The web framework for perfectionists with deadlines." where the installation/run feels like you're running on your local laptop, instead of a highly scalable IDE that runs anywhere. Remember that CML runs ANYWHERE, within the cloud providers such as Azure, AWS, or GCP, and on-premise. Cloudera abstracts out the complexities. Using CML, we'll go from installing Django and then running Django in a matter of minutes. Step 1: Find the read-only URL to run your embedded application (Django) import os url=os.environ["CDSW_ENGINE_ID"]+"."+os.environ["CDSW_DOMAIN"] print("http://read-only-%s"%url) Step 2: Install Django !pip install django Step 3: Create a Django project as instructed here !django-admin startproject mysite cd mysite Step 4: Modify the settings.py file adding the 'read-only-%s' value from step 1 and localhost ALLOWED_HOSTS = ['localhost','read-only-yourhostnamefromstep1'] Step 5: Run Django !python manage.py runserver localhost:$CDSW_READONLY_PORT That's IT! If you'd like to access your Django page, navigate to the URL in step 1! Nothing specialized for CML, as we say BYOL!

RyanCicak · ‎05-16-2018

In order to debug pairing DLM, you'll need the following pre-req: 1) Root access to the DPS VM Problem statement - have you received an error when pairing a cluster? Follow these step-by-step instructions to access the DLM log, to gain granular log information that will help you debug: 1) Run the "sudo docker ps" command to gain the container id for "dlm-app": In the image above, the container id for "dlm-app" is "83d879e9a45e". 2) Once you receive the container id, you can run the following command "sudo docker exec -it 83d879e9a45e /bin/tailf /usr/dlm-app/logs/application.log" This will give you insight into the DPS-DLM application, in the example above you'll see "ERROR". The error log will post once you click "pair" in the DLM UI. Using the information from the log, you'll be able to troubleshoot your issue.

RyanCicak · ‎04-03-2018

Reading and writing files to a MapR cluster (version 6) is simple, using the standard PutFile or GetFile, utilizing the MapR NFS. If you've searched high and low on how to do this, you've likely read articles and GitHub projects specifying steps. I've tried these steps without success, meaning whats out there is too complicated or out-dated to solve NiFi reading/writing to MapR. You don't need to re-compile the HDFS processors with the MapR dependencies, just follow the steps below: 1) Install the MapR client on each NiFi node #Install syslinux (for rpm install) sudo yum install syslinux #Download the RPM for your OS http://package.mapr.com/releases/v6.0.0/redhat/ rpm -Uvh mapr-client-6.0.0.20171109191718.GA-1.x86_64.rpm #Configure the mapr client connecting with the cldb /opt/mapr/server/configure.sh -c -N ryancicak.com -C cicakmapr0.field.hortonworks.com:7222 -genkeys -secure #Once you have the same users/groups on your OS (as MapR), you will be able to use maprlogin password (allowing you to login with a Kerberos ticket) #Prove that you can access the MapR FS hadoop fs -ls / 2) Mount the MaprR FS on each NiFi node sudo mount -o hard,nolock cicakmapr0.field.hortonworks.com:/mapr /mapr *This will allow you to access the MapRFS on the mount point /mapr/yourclustername.com/location 3) Use the PutFile and GetFile processor referencing the /mapr directory on your NiFi nodes *Following 1-3 allows you to quickly read/write to MapR, using NiFi.

RyanCicak · ‎03-22-2018

Step-by-step instructions on how-to install a CRAN package from a local repo - without an internet connection. I'll be installing the package called "tidyr". In order to fully install the package, I need to first download tidyr and all dependencies. To do this, I use the CRAN PACKAGE: https://cran.r-project.org/src/contrib/PACKAGES 1) Finding "Package: tidyr", I can see the dependencies (imports): Imports: dplyr (>= 0.7.0), glue, magrittr, purrr, Rcpp, rlang, stringi, tibble, tidyselect and from them, I can build a list of all the imported packages I need: assertthat_0.2.0.tar.gz BH_1.66.0-1.tar.gz bindr_0.1.1.tar.gz bindrcpp_0.2.tar.gz cli_1.0.0.tar.gz crayon_1.3.4.tar.gz dplyr_0.7.4.tar.gz glue_1.2.0.tar.gz magrittr_1.5.tar.gz pillar_1.2.1.tar.gz plogr_0.1-1.tar.gz purrr_0.2.4.tar.gz Rcpp_0.12.16.tar.gz rlang_0.2.0.tar.gz stringi_1.1.7.tar.gz tibble_1.4.2.tar.gz tidyr_0.8.0.tar.gz tidyselect_0.2.4.tar.gz utf8_1.1.3.tar.gz R6_2.2.2.tar.gz pkgconfig_2.0.1.tar.gz 2) Download the packages from https://cran.r-project.org/src/contrib/ 3) Create a directory to emulate the CRAN repo, in my example I created /tmp/ryantester/src/contrib 4) Create a PACKAGES file within /tmp/ryantester/src/contrib - since this tutorial covers tidyr, I'll include the necessary packages Package: R6 Version: 2.2.2 Depends: R (>= 3.0) Suggests: knitr, microbenchmark, pryr, testthat, ggplot2, scales License: MIT + file LICENSE NeedsCompilation: no Package: pkgconfig Version: 2.0.1 Imports: utils Suggests: covr, testthat, disposables (>= 1.0.3) License: MIT + file LICENSE NeedsCompilation: no Package: bindr Version: 0.1.1 Suggests: testthat License: MIT + file LICENSE NeedsCompilation: no Package: bindrcpp Version: 0.2 Imports: Rcpp, bindr LinkingTo: Rcpp, plogr Suggests: testthat License: MIT + file LICENSE NeedsCompilation: yes Package: plogr Version: 0.1-1 Suggests: Rcpp License: MIT + file LICENSE NeedsCompilation: no Package: BH Version: 1.66.0-1 License: BSL-1.0 NeedsCompilation: no Package: plogr Version: 0.1-1 Suggests: Rcpp License: MIT + file LICENSE NeedsCompilation: no Package: dplyr Version: 0.7.4 Depends: R (>= 3.1.2) Imports: assertthat, bindrcpp (>= 0.2), glue (>= 1.1.1), magrittr, methods, pkgconfig, rlang (>= 0.1.2), R6, Rcpp (>= 0.12.7), tibble (>= 1.3.1), utils LinkingTo: Rcpp (>= 0.12.0), BH (>= 1.58.0-1), bindrcpp, plogr Suggests: bit64, covr, dbplyr, dtplyr, DBI, ggplot2, hms, knitr, Lahman (>= 3.0-1), mgcv, microbenchmark, nycflights13, rmarkdown, RMySQL, RPostgreSQL, RSQLite, testthat, withr License: MIT + file LICENSE NeedsCompilation: yes Package: utf8 Version: 1.1.3 Depends: R (>= 2.10) Suggests: corpus, knitr, rmarkdown, testthat License: Apache License (== 2.0) | file LICENSE NeedsCompilation: yes Package: assertthat Version: 0.2.0 Imports: tools Suggests: testthat License: GPL-3 NeedsCompilation: no Package: cli Version: 1.0.0 Depends: R (>= 2.10) Imports: assertthat, crayon, methods Suggests: covr, mockery, testthat, withr License: MIT + file LICENSE NeedsCompilation: no Package: crayon Version: 1.3.4 Imports: grDevices, methods, utils Suggests: mockery, rstudioapi, testthat, withr License: MIT + file LICENSE NeedsCompilation: no Package: pillar Version: 1.2.1 Imports: cli (>= 1.0.0), crayon (>= 1.3.4), methods, rlang (>= 0.2.0), utf8 (>= 1.1.3) Suggests: knitr (>= 1.19), lubridate, testthat (>= 2.0.0) License: GPL-3 NeedsCompilation: no Package: tidyselect Version: 0.2.4 Depends: R (>= 3.1) Imports: glue, purrr, rlang (>= 0.2.0), Rcpp (>= 0.12.0) LinkingTo: Rcpp (>= 0.12.0), Suggests: covr, dplyr, testthat License: GPL-3 NeedsCompilation: yes Package: tibble Version: 1.4.2 Depends: R (>= 3.1.0) Imports: cli, crayon, methods, pillar (>= 1.1.0), rlang, utils Suggests: covr, dplyr, import, knitr (>= 1.5.32), microbenchmark, mockr, nycflights13, rmarkdown, testthat, withr License: MIT + file LICENSE NeedsCompilation: yes Package: stringi Version: 1.1.7 Depends: R (>= 2.14) Imports: tools, utils, stats License: file LICENSE License_is_FOSS: yes NeedsCompilation: yes Package: rlang Version: 0.2.0 Depends: R (>= 3.1.0) Suggests: crayon, knitr, methods, pillar, rmarkdown (>= 0.2.65), testthat, covr License: GPL-3 NeedsCompilation: yes Package: Rcpp Version: 0.12.16 Depends: R (>= 3.0.0) Imports: methods, utils Suggests: RUnit, inline, rbenchmark, knitr, rmarkdown, pinp, pkgKitten (>= 0.1.2) License: GPL (>= 2) NeedsCompilation: yes Package: purrr Version: 0.2.4 Depends: R (>= 3.1) Imports: magrittr (>= 1.5), rlang (>= 0.1), tibble Suggests: covr, dplyr (>= 0.4.3), knitr, rmarkdown, testthat License: GPL-3 | file LICENSE NeedsCompilation: yes Package: magrittr Version: 1.5 Suggests: testthat, knitr License: MIT + file LICENSE NeedsCompilation: no Package: glue Version: 1.2.0 Depends: R (>= 3.1) Imports: methods Suggests: testthat, covr, magrittr, crayon, knitr, rmarkdown, DBI, RSQLite, R.utils, forcats, microbenchmark, rprintf, stringr, ggplot2 License: MIT + file LICENSE NeedsCompilation: yes Package: tidyr Version: 0.8.0 Depends: R (>= 3.2) Imports: dplyr (>= 0.7.0), glue, magrittr, purrr, Rcpp, rlang, stringi, tibble, tidyselect LinkingTo: Rcpp Suggests: covr, gapminder, knitr, rmarkdown, testthat License: MIT + file LICENSE NeedsCompilation: yes 5) Move the downloaded packages from #1 and #2 to the /tmp/ryantester/src/contrib 6) The final step is to install, pointing to your local repo (in our case, /tmp/ryantester) install.packages('tidyr', repos = "file:///tmp/ryantester")

RyanCicak · ‎10-25-2017

Installing the Alarm Fatigue Demo via Cloudbreak: There are multiple ways to deploy the Alarm Fatigue Demo via Cloudbreak. Below are four options: 1) Deploy via the Cloudbreak UI a) Login to https://cbdtest.field.hortonworks.com b) Select your credentials – if you credentials don’t exist, create them under “Manage Credentials” c) Once your credentials are selected, click “Create Cluster” d) Make-up a cluster name and choose the Availability Zone (SE) and then click “Setup Network and Security" e) “fieldcloud-openstack-network” should be selected and click “Choose Blueprint” f) Select the Blueprint called “alarm_fatigue_v2” Host Group 1 (Select Ambari Server, alarm-fatigue-demo and pre-install-java8) Host Group 2 (select pre-install-java8) Host Group 3 (select pre-install-java8) g) Click on “Review and Launch” e) Click on “Create and start cluster” (After clicking, the deployment via Cloudbreak will likely take 30-50 minutes, go get a coffee) 2) Deploy via Bash Script (specifying configuration file) Create file .deploy.config with the following Version=0.5 CloudBreakServer= https://cbdtest.field.hortonworks.com CloudBreakIdentityServer= http://cbdtest.field.hortonworks.com:8089 CloudBreakUser=admin@example.com CloudBreakPassword=yourpassword CloudBreakCredentials= CloudBreakClusterName=alarmfatigue-auto CloudBreakTemplate=openstack-m3-xlarge CloudBreakRegion=RegionOne CloudBreakSecurityGroup=openstack-connected-platform-demo-all-services-port-v3 CloudBreakNetwork=fieldcloud-openstack-network CloudBreakAvailabilityZone=SE Change the highlighted Then execute the following: wget -O - https://raw.githubusercontent.com/ryancicak/northcentral_hackathon/master/CloudBreakArtifacts/cloudbreak-cmd/deployer.sh| bash 3) Deploy via Bash Script inputting configurations (while prompted) Just execute wget -O -https://raw.githubusercontent.com/ryancicak/northcentral_hackathon/master/CloudBreakArtifacts/cloudbreak-cmd/deployer.sh| bash and fill out the information as prompted 4) Deploy via Jenkins All four options will deploy install, configure and run all necessary services including "Alarm Fatigue Demo Control"

RyanCicak · ‎09-22-2017

Hi @Michael Vogt, To greatly simply regular expressions for fixed-width files, you can use the language Grok. The processor “ExtractGrok” can be used to pull out fixed-length values for example: https://groups.google.com/forum/#!topic/logstash-users/7FETqn3PB1M Using the following data: Time Sequence Source Destination Action Data ---------- -------- ------------ ------------ ------------------------------ ------------------------------------------------------------------------------------ Start 2013/04/29 00:01:34 > 00:01:34 Yosemite Daily Rollover ? 02:18:56 02185130 Yosemite bioWatch Trak Alert WS Failed Return=Serial Not Found. ? 02:19:03 Yosemite AlertNotify ERROR: Conversion from string "" to type 'Date' is not valid. * 02:19:03 Yosemite AlertNotify Failed Serial=L1234567890 Setting=AUTOREPORT I want to be able to get the Time, Sequence, Source, Destination, Action and Data from the data (fixed length above). Writing regular expressions can be difficult, therefore Grok was created for simplification. I built the following workflow using: 1) GetFile – fetch the file (with the data above) 2) SplitText – I split the file up into 1 flowfile per line 3) ExtractGrok – I use a Grok expression to pull out Time (grok.time attribute), Sequence (grok.sequence attribute), Source (grok.source attribute), Destination (grok.destination attribute), Action (grok.action attribute) and Data (grok.data attribute). My Grok pattern: (?<severity>.{1}) (?<time>.{8}) (?<sequence>.{8}) (?<source>.{12}) (?<destination>.{12}) (?<action>.{30}) %{GREEDYDATA:data} If you look at the data above, there are a total of 6 lines – where 5 lines match my Grok pattern. I likely wouldn’t want to collect the unmatched flowfiles because there will always be an unmatched pattern if the file contains “---------- -------- ------------ ------------ ------------------------------ ------------------------------------------------------------------------------------ Start 2013/04/29 00:01:34”. The Grok Pattern file (is attached). I used one I found on google – that had a bunch of pre-defined regular expressions. Grok will output my attributes as I define them in my Grok Expression, where each FlowFile will associate a group with my specified attribute:

RyanCicak · ‎03-10-2017

Hi @Raj B Can you validate you're connecting to the Hive hive.metastore.uris? Do you have Kerberos enabled?

RyanCicak · ‎03-08-2017

Many technologists ask "What is the easiest way to import data into Hive?" It's true, you can write a Sqoop job (making sure YARN containers are tuned properly in regards to # of mappers, the size of the mappers, and the sort allocated memory). And you'll deal with a two step process of moving the RDBMS tables into plain text and then creating an ORC table with an INSERT INTO statement.... I don't call this easy. What if there were a tool where you could simply drag-and-drop processes (processors) on a canvas and connect them together creating a workflow? There is a tool for this called NiFi! Before we go any further, I'll assume you know and understand NiFi - what is a processor, scheduling a processor, what is a connection (relationship), what are different relationship types, etc:. Small-Medium RDBMS tables This article will cover small to medium sized RDBMS tables. This means I can run a single select * from myrdbmstable; without returning millions of rows (where we're taxing the RDBMS), I'll write a second article on how-to use the processor GenerateTableFetch that generates select queries that fetch "pages" of rows from a table. Using pages of rows from a table will distribute the select queries amongst multiple NiFi nodes, similar to what Sqoop does with the #mappers where each mapper pages through results. Prerequisites for Hive In order to stream data into Hive, we'll utilize Hive's transactional capabilities which require the following: 1) Enable ACID in Hive 2) Use bucketing on your Hive table 3) Store the Hive table as ORC 4) Set the following property on your Hive table TBLPROPERTIES ("transactional"="true") 1-4 will be followed below in Step 2 Step 1 - Query an RDBMS table using the QueryDatabaseTable processor As described above, choose a small-medium sized RDBMS table (we'll tackle large database tables in another article). a) Add the QueryDatabaseTable processor to your canvas You'll see you we need to choose a Database Connection Pooling Service (which we'll define below in step 2), add a table name and finally create a successful relationship b) Create an RDBMS Connection Pooling Service - Right click on the processor and go to Configure c) Under the "Properties" tab, click the property "Database Connection Pooling Service" and click "Create new service..." on the drop-down d) Choose the "DBCPConnectionPool" and click on Create e) Click on the arrow to go to configure the DBCPConnectionPool f) Click on the pencil to Edit the DBCPConnectionPool g) Change the name in Settings - to something that is easily identifiable for your database connection pool h) Finally go to Properties and define your connection. I'm creating a connection for MySQL, but the general rule of thumb is if a JDBC driver exists, you'll be able to connect (for example I wrote this article to connect to Teradata from within NiFi) Database Connection URL: jdbc:mysql://localhost:3306/hortonworks Database Driver Class Name: com.mysql.jdbc.Driver Database Driver Location(s): /Users/rcicak/Desktop/mysql-connector-java-5.0.8-bin.jar Click Apply - and you've created your RDBMS connection pool i) Enable your connection pool j) Define a table name - in our case we'll choose "people3", where the table people3 is described in MySQL as the following: *Note: Don't forget the Maximum-value columns, QueryDatabaseTable will keep track of the last row that was fetched mysql> DESCRIBE people3; +-------+--------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+--------------+------+-----+---------+-------+ | id | datetime | NO | PRI | NULL | | | name | varchar(255) | YES | | NULL | | | age | int(11) | YES | | NULL | | +-------+--------------+------+-----+---------+-------+ 3 rows in set (0.00 sec) k) Change the schedule of the QueryDatabaseProcessor to run every 10 minutes (or whatever your requirement calls for), therefore the select statement is not being execute multiple times per second Step 2 - Stream the RDBMS rows into Hive using the PutHiveStreaming processor a) Add the PutHiveStreamingProcessor to your canvas b) Create a successful relationship between QueryDatabaseTable (in step 1) and PutHiveStreaming (step 2) c) The caution went away on QueryDatabaseTable after we added the relationship, now we'll need to resolve the caution on the PutHiveStreaming processor d) Right click on the PutHiveStreaming processor and choose Configure, going to the "Properties" tab - Adding the following Hive configurations: Hive Metastore URI: thrift://cregion-hdpmaster2.field.hortonworks.com:9083 Database Name: default Table Name: people3 *Adjust the "Transactions per Batch" accordingly to your SLA requirements e) Verify your Hive table "people3" exists - and as explained in the Hive prerequisites above, you'll need ACID enabled, store the table as ORC, table properties set to transactional = true and also bucketing in order for PutHiveStreaming to work create table people3 (id timestamp, name varchar(255), age int) CLUSTERED BY(id) INTO 3 BUCKETS STORED AS ORC tblproperties("transactional"="true"); You've successfully imported your RDBMS table into Hive with two processors - wasn't that easy?

RyanCicak · ‎03-02-2017

Hi Sunile, As we discussed yesterday, I found this installing HDP 2.5.3 using Ambari 2.4.2. Looking further into this, RHEL 7.3 comes installed with snappy 1.1.0-3.el7 while HDP 2.5.3 needs snappy 1.0.5-1.el6.x86_64. I spun up a RHEL 7.3 instance and ran the following command, showing snappy 1.1.0-3.el7 came pre-installed: As Jay posted - Looking at the latest documentation for Ambari 2.4.2, I found this problem in "Resolving Cluster Deployment Problems" - there should be a bug fix that goes into RHEL 7 (so we don't rely on a rhel 6 dependency) https://docs.hortonworks.com/HDPDocuments/Ambari-2.4.2.0/bk_ambari-troubleshooting/content/resolving_cluster_install_and_configuration_problems.html - What do you think?

RyanCicak · ‎12-17-2016

Hi @Sherry Noah Can you try: spark-submit \ --classSparkSamplePackage.SparkSampleClass \ --master yarn-cluster \ --num-executors 2 \ --driver-memory 1g \ --executor-memory 1g \ --executor-cores 1 \ --files /usr/hdp/current/spark-client/conf/hive-site.xml \ --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar,target/SparkSample-1.0-SNAPSHOT.jar

Online	Offline
Last Visited	‎10-30-2024 03:39 PM

Member Since	‎06-05-2019 07:09 AM
Last Visited	‎10-30-2024 03:39 PM
Posts	128
Kudos received	128

Cloudera Community

Re: HDP 2.5.3 spark-submit sparkSQL, not able to i...

Re: Apache PIG - Script per Table to data cleansin...

Re: Is it possible to convert a text file format t...

Re: Hadoop archive job unsuccessful

Re: GetKafka not getting messages in Apache Nifi

Installing Django in Cloudera Machine Learning (CM...

Debug DLM - Pairing Clusters

Using NiFi with MapR

Install R Package offline from local repo

Alarm Fatigue Demo - Installation via Cloudbreak [...

Re: Search value in the ReplaceText of NiFi does n...

Re: Unable to instantiate org.apache.hive.hcatalog...

RDBMS to Hive using NiFi (small-medium tables)

Re: HDP 2.5.3 install failing on Redhat 7

Re: HDP 2.5.3 spark-submit sparkSQL, not able to i...