Created on 03-08-201702:02 PM - edited 08-17-201901:56 PM
We needed to create a data lake of all the companies data, the first set of data was from SQL Server. So using Apache NiFi 1.1.x I ingested it into Hive / ORC. A few of the smaller constantly changing tables need to stay in SQL Server so we need to be able to join tables in Hive with tables in SQL Server. Fortunately, Microsoft provides a very cool extension to SQL Server called Polybase that let's us build external tables pointing to Hadoop. Once those tables are referenced they act like regular tables. So now all the companies data including other data sources loaded into Hadoop Hive ORC tables can be queried and joined. And it's fast!
Step 1: Apache NiFi Magic
QueryDatabaseTable: One for each table picking a sequence id primary key the tables have. Could also do timestamp.
ConvertAVROtoORC: Point to /etc/hive/conf/hive-site.xml
PutHDFS: Store in a separate HDFS, write this down as we need it for polybase.
ReplaceText (GenerateHiveDDL): Builds create Hive table string automagically. You can do this manually.
PutHiveQL: Runs the Hive table creation DDL. You can do this manually.
For my example, I had six tables to do. So I just copied that set of processors and made 5 copies and changed table names and HDFS directories. That's all folks.
Step 2: Prepare Polybase
Change the yarn-site.xml
file on the SQL Server machine to point to the HDP 2.5.
Step 3: Run DDL Necessary for Polybase Access to Hadoop
CREATE EXTERNAL DATA SOURCE [HDP2]
WITH( TYPE = HADOOP, LOCATION = 'hdfs://hadoopserver:8020')
CREATE EXTERNAL FILE FORMAT ORC WITH (
FORMAT_TYPE =ORC );
( [myid] int NULL,
[yourid] int NULL,
[theirid] varchar(64) NULL,
[somedata] varchar(255) NULL, [somedata] int NULL)
DATA_SOURCE =HDP2, FILE_FORMAT =ORC );
2b. Create an external data source pointing to your HDFS on HDP
2c. Create an external file format like ORC that your tables use. I recommend ORC.
2d. Create your external tables pointing to their HDFS directories containing ORC files.
Step 4: Polybase Federated Query
SELECT TOP (1000) [id], hs.[id], hs.[name], c.[description]
FROM [database].[dbo].[MyHiveTable] hs,
WHERE c.id = hs.id
ORDER BY c.name desc
You don't get much easier than that. Looks like a regular table, acts like a regular table, queries like a regular table. Users won't know or care where the data is. They don't have to know you have 100 petabytes of data sitting in a massive Hortonworks Data Platform.