I have started working with NiFi. I am working on a use case to load data into Hive. I get a CSV file and then I use SplitText to split the incoming flow-file into multiple flow-files(split record by record). Then I use ConvertToAvro to convert the split CSV file into an AVRO file. After that, I put the AVRO files into a directory in HDFS and I trigger the "LOAD DATA" command using ReplaceText + PutHiveQL processor.
I'm splitting the file record by record because to get the partition value(since LOAD DATA doesn't support dynamic partitioning). The flow looks like this:
GetFile (CSV) --- SplitText (split line count :1 and header line count : 1) --- ExtractText (Use RegEx to get partition fields' values and assign to attribute) --- ConvertToAvro (Specifying the Schema) --- PutHDFS (Writing to a HDFS location) --- ReplaceText (LOAD DATA cmd with partition info) --- PutHiveQL
The thing is, since I'm splitting the CSV file into each record at a time, it generates too many avro files. For ex, if the CSV file has 100 records, it creates 100 AVRO files. Since I want to get the partition values, I have to split them by one record at a time. I want to know is there any way, we can achieve this thing without splitting record by record. I mean like batching it. I'm quite new to this so I am unable to crack this yet. Help me with this.
PS: Do suggest me if there is any alternate approach to achieve this use case.
You can try using puthdfs to push the csv to hdfs , in a tmp location. then you can insert overwrite the csv directly into an orc table using puthiveql processor.
I went with Load data command because, insert command takes time. Especially when we talk about thousands and ten-thousands of records coming in, it takes hours to insert them. Load is simple and fast since it just copies the files from HDFS into Hive's warehouse.
Here is my answer from the corresponding StackOverflow question:
Are you looking to group the Avro records based on the partitions' values, one Avro file per unique value? Or do you only need the partitions' values for some number of LOAD DATA commands (and use a single Avro file with all the records)?
If the former, then you'd likely need a custom processor or ExecuteScript, since you'd need to parse, group/aggregate, and convert all in one step (i.e. for one CSV document). If the latter, then you can rearrange your flow into:
GetFile -> ConvertCSVToAvro -> PutHDFS -> ConvertAvroToJSON -> SplitJson -> EvaluateJsonPath -> ReplaceText -> PutHiveQL
This flow puts the entire CSV file (as a single Avro file) into HDFS, then afterwards it does the split (after converting to JSON since we don't have an EvaluateAvroPath processor), gets the partition value(s), and generates the Hive DDL statements (LOAD DATA).