Support Questions
Find answers, ask questions, and share your expertise

Joining Data-Files Using NiFi (Like Spark RDD)

Hello,

we are trying to use NiFi to ingest 3 data files and then join them based on certain values.

This would be very straight forward for a Spark (RDD/DataFrame), wondering if i can do this using NiFi as well ?

10 REPLIES 10

Master Guru

@Briejsh Jaggi

Join them in what way? binary concatenation? tar? etc...? This is likely something that can easily be done using NiFi's mergeContent processor.

sorry meant like a SQL join... like a spark RDD

Hello Mat,

just confirming if joining is an option in NiFi or should we go with Spark for this use case...

@Briejsh Jaggi

Yes, NiFi can join/merge files based on values.

What specifically is the use case?

we are getting 3 sets of files, we want to do bunch of business validation before ingesting/processing them further.... that requires sql-joins between these 3 files.

we know how to solve that problem using SPARK-SQL not sure how to solution with NiFi.

Hello Wynner,

just confirming if joining is an option in NiFi or should we go with Spark for this solution...

we are getting 3 sets of files, we want to do bunch of business validation before ingesting/processing them further.... that requires sql-joins between these 3 files.

we know how to solve that problem using SPARK-SQL not sure how to solution with NiFi.

Super Guru

This is not currently possible inside NiFi (without scripting pretty much the entire capability), but with the Record Reader/Writer capabilities added in NiFi 1.2.0, a JoinRecord processor could be possible, as long as each incoming flow file had a schema associated with it. One tricky part with a data "flow" is knowing that you have three files, and they are the (only) three files you want. Usually a flow will have any number of files coming in at any time. In this case such a JoinRecord processor would have to be configurable to wait for N flow files and assume they can all be joined.

In future releases of NiFi, you should have more options as more LookupService implementations are added.

In the meantime, you might consider using Presto, you could set up a DBCPConnectionPool and SQL processor (such as ExecuteSQL) from inside NiFi to use the Presto JDBC driver, and execute the JOIN(s) against files on the filesystem (using a LocalFileConnector, e.g.)

Thank You @Matt Burgess , yes pretty much did what you suggested, highly customized processor/s for this feature. Created 2 queues (FILE_READY and FILE_NOT_READY). And we only read from FILE_READY...

Hi @Matt Burgess, I am looking for same functionality described above (i.e to join files containing records from three different tables to be joined on a common field and get a wide row). Please let us know if you are Newer versions of Nifi support processors for this activity?

Thanks Sri

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.