I have several datasets that together can be used to build a hierarchy, and in a typical RDMBS we would be able to use a recursive query or more proprietary method (CONNECT_BY) to build the hierarchy. Unfortunately the datasets are so huge that performance is terrible and it would be much better served in a Hadoop environment.
Seems that most of the Apache stack does not yet support recursion on WITH statements, so is there a more programmatic method of building a hierarchy that people typically use in this situation?
You can build a helper/temporary table to store the hierarchy for you as suggested here: https://www.pythian.com/blog/recursion-in-hive/
Your data would end up looking like this:
STATE1 STATE2 LEVEL
------ ------ -----
S1 S1 0
S2 S2 0
S2 S1 1
S3 S3 0
S3 S1 1
S4 S4 0
S4 S2 1
S4 S1 2
There are a couple of way that you could solve this problem.
1) Explode and duplicate the data. Hive is big data, it's NoSQL. You don't have to solve this problem in a SQL way. You could explode the data and see if you get a performance increase. (Don't forget to choose good partitions). This may feel wrong but really when your using big data the rules change, you don't have to solve this with SQL.
2) lateral view explode - This may give you some of the table structure you need.
3) Look at using JSON with Hive to help solve your problems. This would enable you to have the nested hierarchical data you are talking about.
4) Combination of some of the above.
Hope this helps.
I would try to do this hierarchical problem in a different way. We have array datatype in Hive. Lets make use of it rather than having a temp table where we could store the hierarchy data. Perform a group by operation, then capture and convert the hierarchy field which you want into array data type. If you need to select any specific level like 1st or 2nd then split the array field based on your need. Hope it helps!!