Created 05-19-2018 01:22 PM
Infer Schema and Create Table In Hive from NIFI Node based on Input file ex: CSV or MYSQL
Case 1: For example I have a CSV file IRIS.csv which has headers in it (Folder contains 100 IRIS<NUMBER>.csv) need to ingest all these files (APPEND) as one table in HIVE.
Currently I create a table in HIVE manually. However, I need to create table in HIVE from NIFI flow itself. So that I can parameterize the flow and ingest varieties of schema data later.
Case 2: Similarly how can I ingest Excel Data and inferschema and create table in HIVE dynamically
Case 3: I have excel where there are more than 1 Table and they have different schema, how can I use NIFI to detect table and identify schema and create table in HIVE from NIFI.
Case 4: In a folder there are multiple csv files having same schema and different schema. I want to run a flow where it can group all same schema csv and create hive table and append those data in hive.
Apologies for long post let me know if I need to post each item as separate post.
Created 05-19-2018 03:24 PM
To generate hive.ddl dynamically in NiFi we need to Use ConvertAvroToORC processor and this processor generates hive.ddl statement dynamically based on the contents of flowfile and adds as attribute to the flowfile.
Use ReplaceText processor to change the flowfile content as hive.ddl statment then use PutHiveQL processor to execute hive.ddl statement, by following this method we are going to create hive tables dynamically.
(or)
you can write a script to generate schema dynamically based on the contents of flowfile.
Case 2:
Use ConvertExcelToCSV processor then use ConvertRecord processor to convert Csvdata to Avro.
Flow:
1.ListFile //stores the state and lists file incrementally 2.FetchFile //fetches files 3.ConvertExcelTocsv //converts excel format data to csv 4.ConvertRecord //reader as csvreader and writer as avrosetwriter 5.ConvertAvroToORC 6.UpdateAttribute //to change the filename to unique i.e ${UUID()} 7.PutHDFS //HDFS directory location to store data 8.ReplaceText //use ${hive.ddl} value as replacement value and change replacement strategy as always replace 9.PutHiveQL //configure enable hive connection pool controller service
Case 1:
You don't need to use the 3rd processor from the above flow and use all the other processors.
Case3:
Use ConvertExcelTocsv processor and this processor converts each sheet in excel as new flowfile.
If you are having same sheet names then use RouteOnAttribute Processor to detect and route all the filenames to one relationship, add new property in RouteOnAttribute processor by using NiFi expression language.
Flow:
Use the case2 flow and add RouteOnAttribute Processor to route after 3.ConvertExcelToCSV
Case4:
If all the same group csv files having atleast same name then use RouteOnAttribute Processor to detect all the same group filenames to one relationship.
Example:
i'm having 100 filenames starts with nifi_pull_1.csv,nifi_pull_2.csv,....nifi_pull_100.csv, to detect all the files in the same group.
Add new property in RouteOnAttributeProcessor as
nifi_pull_tables
${filename:startsWith("nifi_pull")} //check the filename and route all the filenames starts with nifi_pull to this relation.
References:
ConvertExcelToCSVProcessor
Convert Record processor
Create hive.ddl dynamically
NiFi expression language
https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html
RouteOnAttribute
-
If the Answer addressed your question, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.
Created 05-21-2018 06:45 AM
@Shu
Thanks for the detailed answer. I have few other questions relating to the above.
As you had mentioned that use route based on file name. However, I have a folder with recursive folders where there are multiple CSV files and I am not sure of all the file names. In that case is there a way I can create dynamic route based on file names found during fetch and not manual entry.
Also in some cases for ex: In 2011 the csv file had 3 cols and 2017 it became 4 cols. 2 things with this
a. How can I detect schema of all files which has same file name and say there were 9 files with 3 cols and 1 file with 4 cols. In this case we need to create table with 4 cols and add all files 9+1 but 3 col files will have 1 col empty based on which is from 4 Col file.
b. When appending new files how can I make sure that it matches the existing schema of the table. But with a little change where the column might be named "Username" in hive but in file it can be "user-name" kind of dictionary mapping advanced loading it should not fail inserting the new files
With regards to excel file here is the file that is taken as example.
1. How do we ingest files which has multi table with different headers in the same sheet.
2. Is there a way we can create a table detection processor - so that it auto detects tables in excel outputs each table in the excel sheet as independent table for hive input.
3. Is there a way where we can provide configuration like below to extract all tables in all excels and sheets that match the criteria and load as table in some DB.
A4::C16 |