We have N number of files, which wont be fixed, but with the help of naming convention we pick few files following files as 1 file type.
For example, we have 4 files with structures like below; and are called as 1 type of files.
File1: Col1 (mandatory and fix location), Col2, Col3, Col4
File2: Col1 (mandatory and fix location), Col2, Col3, Col4, Col5
File3: Col1 (mandatory and fix location), Col2, Col5, Col4, Col3
File4: Col1 (mandatory and fix location), Col2, Col3, Col6
Means, Col1 (Pkey columns) will always be first but rest of the Columns can change its position, can come or can be missed out in different coming files.
We need to come-up with a solution where we should be able to pull 1 type of files in one go and able to load all in 1 target Hive table.
We can create Hive table, having all the possible columns, upfront like from Col1 to Col6.
Our process should write all the files, having same file type, into 1 target file or 1 target hive table.
Option of Implementation;
We merge all the files in 1 file type and create 1 target file, in local directory, with all the columns and data as well and then load that 1 file into hive table or in HDFS as file which can then be read via Hive external table.
We pull all the files in 1 file type, SPARK and place those in RDD/ data frames, and then create 1 target RDD which will keep a copy in SPARK and also push the data into Hive table as well. Not sure how to achieve that using NiFi, via SCALA scripting we have achieved it. If anyone can help us to convert SCALA script into NiFi, that will be of great help also.
Ideal option is to us NiFi which pulls all the files of 1 file type and on the fly, it do the thing and load into Hive table or in HDFS as file which can then be read iva Hive external table.