I can't find a way to use Nifi to handle Data Drift with tables in Hive.
I've used Streamsets at other companies and within their pipelines, Streamsets can manage the Hive Metadata for the tables created from the raw data. For example, if I am ETLing data from a flat file and that file has three fields: Name, ID, DOB and a fourth gets added, FavoriteColor, Streamsets will automatically add that column to the table in Hive I laid on-top of the data during the ETL process. In Nifi, it appears I have to manually change the table.
Any ideas on how to do that same thing in Nifi?
Hello. The current handling of NiFi in the face of schema changes to the data it handles relative to the downstream Hive table is that we would not send data we have which is not reflected in the downstream schema. So while the upstream data might change by adding a simple column NiFi would be happy and the flow to Hive should continue. If you want the new column reflected you would need to update the schema in an out of band process either in Hive or in establishing a flow that automates schema updates being reflected to Hive.
It would be a fine feature request to add the ability to optionally automate the process of aligning current schema of flowing data and Hive table schema so that as new columns arrive we can send it. It would be good to hear the thinking for this in general as it relates to changes such as type changes, columns being removed, how many new columns would be considered odd, etc.. There are definitely some problematic aspects to this idea but for the safe cases it could be helpful. This would be good to talk about with the Hive team as this would be specific to the NiFi/Hive integration.
NiFi in general has always handled such cases easily and with the record processors we can automatically evolve with the schemas and do so in a Schema Registry compliant manner and you dont have to change code/config/etc.. to leverage that.
Thanks for the response. Super helpful! Do you have any documentation or an example I can leverage to try to recreate some automated process to handle data drift?
There are a lot of great blogs/docs on the nifi record readers/writers and associated processors. For the Hive part specifically we'll want someone more familiar with the Hive metastore calls needed to help give guidance.