Created 01-27-2017 06:23 PM
I want to use NiFi to ingest CSV file (getFile) to Hive (PutHiveQL).
Since HiveQL need the SQL statement, how can I generate it ?
A solution would be getFile -> InferAvroSchema -> ConvertCSVToAvro -> ConvertAvroToJSon -> ConvertJSONtoSQL -> PutHiveQL.
This looks so complex and resource consuming. Any suggestions ?
Created 01-27-2017 10:11 PM
If the CSV doesn't need any work done to it and you just want to put a Hive table over the CSV(s), you can use the following flow:
GetFile -> PutHDFS -> ReplaceText -> PutHiveQL
GetFile: Retrieves the CSV file(s). You could also use ListFile -> FetchFile for this.
PutHDFS: Puts the CSV file(s) onto the HDFS file system
ReplaceText: Replace the content of the flow file with a HiveQL DDL statement, such as "LOAD DATA INFILE ..." or "CREATE TABLE IF NOT EXISTS ..."
PutHiveQL: Execute the DDL command.
Alternatively, if you want to insert each row individually (note: this is slower), and you know the schema of the CSV file, you could use the following flow:
GetFile -> SplitText -> ExtractText -> ReplaceText -> PutHiveQL
GetFile: Retrieves the CSV file(s). You could also use ListFile -> FetchFile for this.
SplitText: Split the CSV file into one line/row per flow file
ExtractText: Extract each column value into an attribute. There is an example of this in the Working_With_CSV template.
ReplaceText: Replace the content of the flow file with a HiveQL statement, using NiFi Expression Language to insert the column values, such as a Replacement Value of "INSERT INTO myTable VALUES ('${col1}', '${col2}', ${col3} )". Note the use of quotes to surround columns whose values are string literals. You could also use JDBC parameters and flow file attributes, see the PutHiveQL documentation for more details (i.e. your Replacement Value would be INSERT INTO myTable VALUES (?,?,?) and you'd need attributes for the JDBC types and values for your columns).
PutHiveQL: Execute the INSERT command(s).
If instead you need the data in a different format (Avro, JSON, ORC, etc.), then your flow will be more complex (as your example is above). NiFi is highly modular, so although a flow to do something "simple" like get CSV into Hive, there are actually a number of smaller operations to be performed (conversions, input/output, etc.), and thus there may be several processors in your flow. Your example illustrates this modularity in terms of what format(s) the processors are expecting, so if you want to auto-generate the SQL (versus hand-generating it with ReplaceText), then ConvertJsonToSQL is your option, but that requires JSON, and there's no ConvertCSVtoJSON processor at present, so you need the additional conversion processors. There is a Jira case to add the ability to do arbitrary format/type conversions, to avoid the need for multiple conversion processors in a chain (as you have above).
Created 01-05-2018 08:34 PM
There is no example in the "Working_With_CSV" template of how to extract each individual field into attributes.
Created on 07-27-2023 07:42 AM - edited 07-27-2023 07:46 AM
is this solution is fit in streaming more than puthive3ql for about 10 GB during the day???