I've been looking for information for a long time and I can't get it. I'm starting to think that it can't be done if the .parquet is in Azure DataLake Storage.
I have a folder with subfolders in Azure DataLake Storage. In these subfolders there are many .parquet. I want to dump the data of these files in a table of a DB.
If I am not mistaken when interpreting the code that generates them, the .parquet have several columns:
Some whose names correspond exactly to the names of the columns in the table.
Others whose names in the .parquet are in lowercase while in the table they are in uppercase.
Others whose names are not in the table. I do not need these, I do not want to dump their data in the table.
I manage to get the .parquet out of Azure DataLake Storage using ListAzureDataLakeStorage + FetchAzureDataLakeStorage combination. Then I try to pass them directly through a PutDatabaseRecord to dump them in the DB table.
As the "Record Reader" property of the PutDatabaseRecord I use a ParquetReader. This type of Reader does not have much configuration, only a property called "Avro Read Compatibility" that I have set to True.
Due to what I explained above about the similarities and differences between the columns of the .parquet and the columns of the table where I want to dump the data, I have configured the following PutDatabaseRecord properties like this:
"Translate Field Names" -> true
"Unmatched Field Behavior" -> Ignore Unmatched Fields
"Unmatched Column Behavior" -> Ignore Unmatched Columns
For all this I think I have the PutDatabaseRecord well configured, but when executing it it gives me the following error:
"Session could not be processed due to Failed to process StandardFlowFileRecord due to java.lang.NullPointerException: Name is null."
What "Name" does Nifi refer to? I think there is no column in the .parquet that is called "Name", but even if it is, with the configuration I use I think it should work. What am I doing wrong?
The only thing I can think of is that the ListAzureDataLakeStorage + FetchAzureDataLakeStorage combination does not provide the data in a form that the PutDatabaseRecord understands. I don't know how to explain myself very well, but I think it doesn't do the same as a FetchParquet would, which seems to be able to read a .parquet and put it in FlowFile as records. But I can't use FetchParquet because it can only be used with ListHDFS or ListFile.