Created on 03-08-201706:19 PM - edited 08-17-201901:55 PM
Many technologists ask "What is the easiest way to import data into Hive?" It's true, you can write a Sqoop job (making sure YARN containers are tuned properly in regards to # of mappers, the size of the mappers, and the sort allocated memory). And you'll deal with a two step process of moving the RDBMS tables into plain text and then creating an ORC table with an INSERT INTO statement.... I don't call this easy.
What if there were a tool where you could simply drag-and-drop processes (processors) on a canvas and connect them together creating a workflow? There is a tool for this called NiFi! Before we go any further, I'll assume you know and understand NiFi - what is a processor, scheduling a processor, what is a connection (relationship), what are different relationship types, etc:.
Small-Medium RDBMS tables
This article will cover small to medium sized RDBMS tables. This means I can run a single
select * from myrdbmstable;
without returning millions of rows (where we're taxing the RDBMS), I'll write a second article on how-to use the processor GenerateTableFetch that generates select queries that fetch "pages" of rows from a table. Using pages of rows from a table will distribute the select queries amongst multiple NiFi nodes, similar to what Sqoop does with the #mappers where each mapper pages through results.
Prerequisites for Hive
In order to stream data into Hive, we'll utilize Hive's transactional capabilities which require the following:
1) Enable ACID in Hive
2) Use bucketing on your Hive table
3) Store the Hive table as ORC
4) Set the following property on your Hive table TBLPROPERTIES ("transactional"="true")
1-4 will be followed below in Step 2
Step 1 - Query an RDBMS table using the QueryDatabaseTable processor
As described above, choose a small-medium sized RDBMS table (we'll tackle large database tables in another article).
a) Add the QueryDatabaseTable processor to your canvas
You'll see you we need to choose a Database Connection Pooling Service (which we'll define below in step 2), add a table name and finally create a successful relationship
b) Create an RDBMS Connection Pooling Service - Right click on the processor and go to Configure
c) Under the "Properties" tab, click the property "Database Connection Pooling Service" and click "Create new service..." on the drop-down
d) Choose the "DBCPConnectionPool" and click on Create
e) Click on the arrow to go to configure the DBCPConnectionPool
f) Click on the pencil to Edit the DBCPConnectionPool
g) Change the name in Settings - to something that is easily identifiable for your database connection pool
*Adjust the "Transactions per Batch" accordingly to your SLA requirements
e) Verify your Hive table "people3" exists - and as explained in the Hive prerequisites above, you'll need ACID enabled, store the table as ORC, table properties set to transactional = true and also bucketing in order for PutHiveStreaming to work
create table people3 (id timestamp, name varchar(255), age int) CLUSTERED BY(id) INTO 3 BUCKETS STORED AS ORC tblproperties("transactional"="true");
You've successfully imported your RDBMS table into Hive with two processors - wasn't that easy?