Created 07-20-2020 06:15 PM
Hello,
I am working on a project where I need to run multiple HiveQL statements in a sequence, then store the output for each of those HiveQL statement in a variable and then inserting those output variables in a SQL Database - using NiFi processors.
Also Can I use SelectHiveQL processor for running multiple Hive Queries in a sequence?
Ideally what should be the NiFi Flow for the above work?
Please let me know as early as possible.
Thanks!
Created 07-23-2020 02:30 AM
Hi....yes you can use selecthiveql to run multiple hiveql statements. I would suggest putting the sql statements in a file / files and use list file and putfile processors before selecthiveql....Data is stored in flowfile content so you do not need to have a variable or anything. Finally you can use PutDatabaseRecord to write it to sql db.
Ideal flow with no transformations :
ListFile->FetchFIle->SelectHiveql->putdatabaserecord
Hope this helps.
Created on 07-23-2020 10:00 AM - edited 07-23-2020 10:01 AM
Thank You Sagar for your inputs!
I am thinking on the following lines regarding the NiFi Flow:
GetFile (To fetch input file from a specific directory) >> PutHDFS (Write Input File data to HDFS) >> PutHiveQL (Through "LOAD PATH INFILE..." load the data into Hive) >> SelectHiveQL (Execute the Semicolon delimited Hive QL's as Input & output in Avro format) >> ConvertAvroToJson (Direct Mapping from Avro field to JSON) >> SplitJson (Splits a JSON File into multiple separate FlowFiles) >> EvaluateJSONPath (Evaluate JSON Path expressions against content of a FlowFile & to assign the results to FlowFile attributes) >> ConvertJSONToSQL (Converts a JSON Formatted FlowFile into INSERT SQL Statement) >> PutSQL (Execute the SQL INSERT statement to actually insert the data into the Destination SQL Database)
Please can you validate if the above flow will work ?
Also I have a another question as follows:
If I select multiple column values from a Hive Table in a HiveQL (e.g. "Select name, age, grade from student table" ) and retriving say 10 records for this query and further I want to insert these records / values into destination SQL Database table then what should the flow look like?
Can I use the flow I described above to solve this requirement.
Kindly share your thoughts / inputs.
Thanks & Regards in advance !
Created 07-24-2020 12:32 AM
You can use the flow you have in mind. But, i would suggest some optimization for your flow here:
PutHDFS & PutHiveQL -> I suggest you create an external table on HDFS so you can skip the puthiveql
Rest of your flow is to make SQL statements -> As i suggested before, you can look at PutDatabaseRecord since it has a recordreader(in your case AvroReader) and you can mention what type of sql query needs to be generated. This will help you with all the conversions and make the flow much much faster.
The process to retrieve selection columns you can still use the same flow, just change the query you retrieve from getfile.
Hope this helps.
Created 07-28-2020 09:50 AM
Hi Sagar,
Thanks a lot again for your inputs. From performance optimization perspective your ideas worked perfectly. PutDatabaseRecord worked like a charm ! The solution is working now.
Thank You.
Created 07-23-2020 02:30 AM
Hi....yes you can use selecthiveql to run multiple hiveql statements. I would suggest putting the sql statements in a file / files and use list file and putfile processors before selecthiveql....Data is stored in flowfile content so you do not need to have a variable or anything. Finally you can use PutDatabaseRecord to write it to sql db.
Ideal flow with no transformations :
ListFile->FetchFIle->SelectHiveql->putdatabaserecord
Hope this helps.
Created on 07-23-2020 10:00 AM - edited 07-23-2020 10:01 AM
Thank You Sagar for your inputs!
I am thinking on the following lines regarding the NiFi Flow:
GetFile (To fetch input file from a specific directory) >> PutHDFS (Write Input File data to HDFS) >> PutHiveQL (Through "LOAD PATH INFILE..." load the data into Hive) >> SelectHiveQL (Execute the Semicolon delimited Hive QL's as Input & output in Avro format) >> ConvertAvroToJson (Direct Mapping from Avro field to JSON) >> SplitJson (Splits a JSON File into multiple separate FlowFiles) >> EvaluateJSONPath (Evaluate JSON Path expressions against content of a FlowFile & to assign the results to FlowFile attributes) >> ConvertJSONToSQL (Converts a JSON Formatted FlowFile into INSERT SQL Statement) >> PutSQL (Execute the SQL INSERT statement to actually insert the data into the Destination SQL Database)
Please can you validate if the above flow will work ?
Also I have a another question as follows:
If I select multiple column values from a Hive Table in a HiveQL (e.g. "Select name, age, grade from student table" ) and retriving say 10 records for this query and further I want to insert these records / values into destination SQL Database table then what should the flow look like?
Can I use the flow I described above to solve this requirement.
Kindly share your thoughts / inputs.
Thanks & Regards in advance !
Created 07-24-2020 12:32 AM
You can use the flow you have in mind. But, i would suggest some optimization for your flow here:
PutHDFS & PutHiveQL -> I suggest you create an external table on HDFS so you can skip the puthiveql
Rest of your flow is to make SQL statements -> As i suggested before, you can look at PutDatabaseRecord since it has a recordreader(in your case AvroReader) and you can mention what type of sql query needs to be generated. This will help you with all the conversions and make the flow much much faster.
The process to retrieve selection columns you can still use the same flow, just change the query you retrieve from getfile.
Hope this helps.
Created 07-28-2020 09:50 AM
Hi Sagar,
Thanks a lot again for your inputs. From performance optimization perspective your ideas worked perfectly. PutDatabaseRecord worked like a charm ! The solution is working now.
Thank You.