Support Questions

Find answers, ask questions, and share your expertise

Joining tables within NiFi

avatar
Contributor

Hi,

I am pulling data from tables residing in different data sources. For example, one ExecuteSQL processor queries a Teradata instance and the other ExecuteSQL processor queries a table in a DB2 instance. After asking the NiFi forum, I got to know that a SQL JOIN feature within NIFI is not available yet but I wanted to ask if there is any workaround to achieve this where I could merge the results of these tables on a common field and then perform additional actions on that merged data. I know it would be easier to have all the tables in a single data source and then directly have a select query with a JOIN but that can't be done in my particular use case.

Thanks

1 ACCEPTED SOLUTION

avatar
Master Guru

There are tools that let you join disjoint tables like SAP HANA Vora, Microsoft Polybase.

What features of the join do you want? If you just want one really wide row

1. NiFi: 1 ExecuteSQL for teradata and one for DB2 (use the same # of fields and alias them with the same names. Then https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.MergeContent/i...

2. SparkSQL: see https://community.hortonworks.com/repos/29883/sparksql-data-federation-demo.html

3. You could join them in Java code with two JDBC connections and some funky coding.

4. Ingest the teradata into HDFS with that ExecuteSQL query, Ingest the DB2 data into HDFS with ExecuteSQL query, then right a Hive query on top of that and run a third ExecuteSQL query. The advantage is you now have a datalake. Most people just load those legacy data sources into a Hadoop datalake for analytics. And write you queries on that.

5. If you want to do this more real-time, push those two datasources into an in-memory datastore like Redis, Geode, Ignite and then query that.

6. Send the data to Kafka and let a Spark app combine them.

7. Push them all to HBase and do Phoenix queries

8. Try Apache Drill, https://drill.apache.org/docs/using-sql-functions-clauses-and-joins/

View solution in original post

8 REPLIES 8

avatar
Master Guru

There are tools that let you join disjoint tables like SAP HANA Vora, Microsoft Polybase.

What features of the join do you want? If you just want one really wide row

1. NiFi: 1 ExecuteSQL for teradata and one for DB2 (use the same # of fields and alias them with the same names. Then https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.MergeContent/i...

2. SparkSQL: see https://community.hortonworks.com/repos/29883/sparksql-data-federation-demo.html

3. You could join them in Java code with two JDBC connections and some funky coding.

4. Ingest the teradata into HDFS with that ExecuteSQL query, Ingest the DB2 data into HDFS with ExecuteSQL query, then right a Hive query on top of that and run a third ExecuteSQL query. The advantage is you now have a datalake. Most people just load those legacy data sources into a Hadoop datalake for analytics. And write you queries on that.

5. If you want to do this more real-time, push those two datasources into an in-memory datastore like Redis, Geode, Ignite and then query that.

6. Send the data to Kafka and let a Spark app combine them.

7. Push them all to HBase and do Phoenix queries

8. Try Apache Drill, https://drill.apache.org/docs/using-sql-functions-clauses-and-joins/

avatar
Contributor

@Timothy Spann I have 1 common column in both these tables and preferably would want to join them and have one wide row. After that, I have certain case statements that I need to run on those rows which can be possible using splitText and updateAttribute's advanced rule settings. My preference was to get this done within Nifi without making use of any external service. According to your 1st point, you mean to have the common columns have the same name aliases using the SELECT statement within executeSQL and then would mergeContent be able to identify those common names within those AVRO flowfiles? Also, both the tables wouldn't necessarily have the same number of fields and I am not sure why is that necessary here. If you could care to expand on that, it would be great. Also, I don't have much knowledge on drill but can drill be triggerred by NiFi to perform the service and return the result back? Or will I have to send data , then run that JOIN independently in Drill and then make it send to NiFi. Reason I am asking is that I want a continuous flow and don't want to stop in between to go and run another service.

avatar
Master Guru

There is no SQL join available from two different SQL sources. So we have to merge the resulting files together and then use routing to just get the ones matching.

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.MergeContent/i...

See these parameters:

Attribute StrategyKeep Only Common Attributes
  • Keep Only Common Attributes
  • Keep All Unique Attributes
Determines which FlowFile attributes should be added to the bundle. If 'Keep All Unique Attributes' is selected, any attribute on any FlowFile that gets bundled will be kept unless its value conflicts with the value from another FlowFile. If 'Keep Only Common Attributes' is selected, only the attributes that exist on all FlowFiles in the bundle, with the same value, will be preserved.
Correlation Attribute NameIf specified, like FlowFiles will be binned together, where 'like FlowFiles' means FlowFiles that have the same value for this Attribute. If not specified, FlowFiles are bundled by the order in which they are pulled from the queue. Supports Expression Language: true

Correlation attribute name could be used for joining.

This really needs to be pushed to the source or to ingest both datasources to HDFS and then run the SQL query there.

avatar
Contributor

Well, seems like there's no workable workaround for the joins. Would have to use Spark or pushing to HDFS or any other datasource to query. I should also look if I can just trigger Drill and run it as a service. Thanks for the help though.

avatar
Master Guru

You could try Presto, that should let you hook up to multiple DBs. Then you can set up a DBCPConnectionPool in NiFi to connect to Presto and issue the query from there. Also perhaps Apache Calcite has this functionality but I'm not sure.

avatar
Contributor

Is there any plans to have the functionality to JOIN different datasets within NiFi in future releases? because that would be swell.

avatar
Rising Star

Hi All,

I am also looking to join data from two flowfiles based on one common column. Reading above comments i got to know that sql joins for this kind of scenario are not available, can anyone suggest if there is any way provided by Nifi, may be i am not aware?

Thanks in advance!!

avatar
Explorer

Hey Raghav can you please share your template file for Apache drill connectivity with executesql nifi processor, we are doing the same but somewhere we are facing issue. Just want to cross check with your configuration.