Member since
05-18-2016
71
Posts
39
Kudos Received
6
Solutions
04-17-2017
01:24 AM
This works perfectly with Field Cloud. If you want to run some queries on phoenix by following this and Phoenix and Hbase tutorials this is an awesome demoable material
... View more
11-17-2016
03:30 PM
This is a great article, can we do ATLAS tagging to fields in Hbase, by tagging the external table. Can you apply Ranger policies to that??
... View more
10-03-2016
10:36 PM
1 Kudo
This is awesome, where we can demo clustered NIFI servers to clients instead of a standalone instance.
... View more
09-16-2016
12:57 PM
I had questions about the need for the triggers, The main reason for creating Triggers in mysql are 1) Triggers set up date and time stamp whenever a row is inserted or updated and NIFI processor is polling on the date and time column to pull the latest data from RDBMS into nifi to generate a flow file. Date and time field is critical. 2) Also, it helps to figure out if the record was inserted or updated in Mysql as well as in Hive. So we know the state of the record in the source system. This field is just being used for demo purpose, its not really required to set this data.
... View more
09-08-2016
03:17 AM
13 Kudos
Prerequisites
1)Download HDP Sandbox
2)MySQL database (Should
already be present in the sandbox) 3)Nifi 0.6 or later
( Download and install a new version of NIFI or use Ambari to install NIFI in
the sandbox) MySQL setup (Source Database) In this setup we will create
a table in MySQL tables and create a few triggers on the tables to emulate transactions.
These triggers will find out if the change introduced
was an insert or an update also will update the time stamp on the
updated/inserted row. ( This is very important as Nifi Will be polling on this
column to extract changes based on the time stamp) unix> mysql –u root –p
unix>Enter
password:<enter>
mysql>
mysql> create database
test_cdc;
mysql> create user
'test_cdc'@'localhost' identified by 'test_cdc';
mysql> GRANT ALL
PRIVILEGES ON *.* TO 'test_CDC'@'%' IDENTIFIED BY 'test_CDC' WITH GRANT OPTION;
mysql>Flush Privileges
mysql> exit;
unix> mysql –u test_cdc –p
test_cdc
mysql>create table CDC_TEST
(
Column_A int,
Column_B text,
Created_date datetime,
INFORMATION text
);
Create Triggers in MYSQL mysql> create trigger CDC_insert
before insert on
cdc_test
for each row
set
NEW.created_date =NOW()
, NEW.information = 'INSERT';
mysql> create trigger CDC_UPDATE
before update on
cdc_test
for each row
set
NEW.created_date = NOW()
, NEW.information = 'UPDATE';
HIVE setup (Destination Database) In hive, we have created an
external table, with exactly same data structure as MySQL table, NIFI would be
used to capture changes from the source and insert them into the Hive table. Using AMBARI Hive view or
from HIVE CLI create the following table in the hive default database: I have used hive cli to
create the table: Unix> hive Hive> create external table
HIVE_TEST_CDC
( COLUMN_A int ,
COLUMN_B string,
CREATED_DATE string,
INFORMATION string)
stored as avro
location '/test-nifi/CDC/' Note: I am not including how to
create Managed Hive table with ORC format, that would be covered in a different
article. Nifi Setup : This is a simple NIFI setup,
the queryDatabase table processor is only available as part of default
processors from version 0.6 of Nifi. queryDatabaseProcessor Configuration Its very intuitive The main things to configure
is DBCPConnection Pool and Maximum-value Columns Please choose this to be the
date-time stamp column that could be a cumulative change-management column This is the only limitation
with this processor as it is not a true CDC and relies on one column. If the
data is reloaded into the column with older data the data will not be
replicated into HDFS or any other destination. This processor does not rely
on Transactional logs or redo logs like Attunity or Oracle Goldengate. For a
complete solution for CDC please use Attunity or Oracle Goldengate solutions. DBCPConnectionPool Configuration: putHDFS processor configure the Hadoop
Core-site.xml and hdfs-site.xml and destination HDFS directory in this case it
is /test-nifi/CDC Make sure this directory is
present in HDFS otherwise create it using the following command Unix>
hadoop fs –mkdir –p /test-nifi/CDC Make
sure all the processors are running in NiFi
Testing CDC Run
a bunch of insert statements on MySQL database. mysql –u test_cdc –p at the mysql CLI run the following inserts: insert into cdc_test values (3, 'cdc3', null,
null); insert into cdc_test values (4, 'cdc3', null,
null); insert into cdc_test values (5, 'cdc3', null,
null); insert into cdc_test values (6, 'cdc3', null,
null); insert into cdc_test values (7, 'cdc3', null,
null); insert into cdc_test values (8, 'cdc3', null,
null); insert into cdc_test values (9, 'cdc3', null,
null); insert into cdc_test values (10, 'cdc3', null,
null); insert into cdc_test values (11, 'cdc3', null,
null); insert into cdc_test values (12, 'cdc3', null,
null); insert into cdc_test values (13, 'cdc3', null,
null); select * from cdc_test go to hive using cli and
check if the records were transferred over using NIFI. Hive> select * from
hive_test_cdc Voila…
... View more
Labels:
08-11-2016
03:38 PM
1 Kudo
This is a great article for anyone looking to ingest data quickly and store in compressed formats. This will work very well For POC, testing and sandbox type of activities. I used something like this and made it production grade at a client by automating some of the jobs using oozie. Once the data was loaded we also had verification scripts that would audit what came in and what got dropped.. Also we had clean up scripts that would remove all the raw data from HDFS, once the data was set in Hive in ORC format that was compressed and partitioned. With the advent of Nifi and Spark, its worth looking at building an Nifi processor in conjuction with spark jobs to load the data seamlessly into Hive/Hbase in compressed formats as its being loaded.
... View more
05-26-2016
07:55 PM
In my previous project, I had to solve a very similar problem with data being ingested from the GoldenGate interface. I see you are doing several hops before data gets persisted into Hive/HDFS. We had to write several merges before we could put the data into the final state. May be Apache NIFI or some other technology can solve this now in much easier way!!..
... View more