Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How can I dump the .parquet data that is in Azure DataLakeStorage to a Microsoft SQL Server database using Nifi?

avatar
New Contributor

I've been looking for information for a long time and I can't get it. I'm starting to think that it can't be done if the .parquet is in Azure DataLake Storage.

 

I have a folder with subfolders in Azure DataLake Storage. In these subfolders there are many .parquet. I want to dump the data of these files in a table of a DB.

 

If I am not mistaken when interpreting the code that generates them, the .parquet have several columns:
Some whose names correspond exactly to the names of the columns in the table.
Others whose names in the .parquet are in lowercase while in the table they are in uppercase.
Others whose names are not in the table. I do not need these, I do not want to dump their data in the table.

 

I manage to get the .parquet out of Azure DataLake Storage using ListAzureDataLakeStorage + FetchAzureDataLakeStorage combination. Then I try to pass them directly through a PutDatabaseRecord to dump them in the DB table.

 

As the "Record Reader" property of the PutDatabaseRecord I use a ParquetReader. This type of Reader does not have much configuration, only a property called "Avro Read Compatibility" that I have set to True.

 

Due to what I explained above about the similarities and differences between the columns of the .parquet and the columns of the table where I want to dump the data, I have configured the following PutDatabaseRecord properties like this:

 

"Translate Field Names" -> true
"Unmatched Field Behavior" -> Ignore Unmatched Fields
"Unmatched Column Behavior" -> Ignore Unmatched Columns

 

For all this I think I have the PutDatabaseRecord well configured, but when executing it it gives me the following error:
"Session could not be processed due to Failed to process StandardFlowFileRecord due to java.lang.NullPointerException: Name is null."

 

What "Name" does Nifi refer to? I think there is no column in the .parquet that is called "Name", but even if it is, with the configuration I use I think it should work. What am I doing wrong?

 

The only thing I can think of is that the ListAzureDataLakeStorage + FetchAzureDataLakeStorage combination does not provide the data in a form that the PutDatabaseRecord understands. I don't know how to explain myself very well, but I think it doesn't do the same as a FetchParquet would, which seems to be able to read a .parquet and put it in FlowFile as records. But I can't use FetchParquet because it can only be used with ListHDFS or ListFile.

1 ACCEPTED SOLUTION

avatar
New Contributor

In the end, the closest thing to the error I was getting was found here (https://issues.apache.org/jira/browse/NIFI-7817). It seems that it is an error related to the creation of the ParquetReader. This makes sense because it would hit any processor that used a ParquetReader. In addition, the FlowFiles did not even enter the processor that used it.

I was using Nifi version 1.12.1. I have downloaded version 1.13.2 and it no longer gives the Name error. In addition, it is seen that the Flow Files already enter the processor. On the download page of the new version (https://nifi.apache.org/download.html) you can access the Release Notes and the Migration Guidance to know what has been fixed with respect to previous versions and with which processors you have to be careful when migrating.

I hope this helps someone.

However, even though the data goes into the processor, it still gives me an error, but it is different and I will open it in another post.

View solution in original post

1 REPLY 1

avatar
New Contributor

In the end, the closest thing to the error I was getting was found here (https://issues.apache.org/jira/browse/NIFI-7817). It seems that it is an error related to the creation of the ParquetReader. This makes sense because it would hit any processor that used a ParquetReader. In addition, the FlowFiles did not even enter the processor that used it.

I was using Nifi version 1.12.1. I have downloaded version 1.13.2 and it no longer gives the Name error. In addition, it is seen that the Flow Files already enter the processor. On the download page of the new version (https://nifi.apache.org/download.html) you can access the Release Notes and the Migration Guidance to know what has been fixed with respect to previous versions and with which processors you have to be careful when migrating.

I hope this helps someone.

However, even though the data goes into the processor, it still gives me an error, but it is different and I will open it in another post.