I created a process that dumps the content of a Netflow file to a CSV flow file, and using the LookupRecord processor matches the values from one of the CSV columns against HBase. There are about 1.5 million records in that CSV flow file, and it takes a very long time to match every single one of the lines against my HBase table. I'm matching by the rowKey in HBase. My goal is to be able to process each of the 1.5 million records in less than 5 minutes.
Method 1 (current implementation - the flow is below): Create a single flow file with all the data and pass it to LookupRecord. LookupRecord uses the CSVReader to parse it, matches it to a table and writes it back to a CSV flow file using the CSVRecordSetWriiter. This one is a bit faster than Method 2. Takes about 15 minutes to process
Method 2: Split the CSV into 1 line per flow file, and use a ConvertRecord processor to convert them to a JSON. Send each of the individual JSONs to LookupRecord. This creates 1.5 million JSON flow files that HBase needs to process, and from what I see is not an optimal method.
Do you have any suggestions how to optimize this process? Is there any way to increase the throughput of from NiFi to HBase? Or improve the speed of HBase?