Support Questions

Find answers, ask questions, and share your expertise

​We perform frequently Cartesian products involving geospatial functions in the where clause (e.g. ST_Intersects) of our Hive queries. What are the best approaches for tuning those queries for response time and concurrency?

avatar

We perform frequently Cartesian products involving geospatial functions in the where clause (e.g. ST_Intersects) of our Hive queries. What are the best approaches for tuning those queries for response time and concurrency?

1 ACCEPTED SOLUTION

avatar
Master Guru

Gopal and me gave a couple of tips in here to increase the parallelity ( since Hive is normally not tuned for cartesian joins and creates too few mappers ).

https://community.hortonworks.com/questions/44749/hive-query-running-on-tez-contains-a-mapper-that-h...

Apart from that my second point still holds you should create some pre-filtering to reduce the amount of points you need to compare. There are a ton of different ways to do this:

https://en.wikipedia.org/wiki/Spatial_database#Spatial_index

You can put points in grids and make sure that a data point in one grid entry cannot be closer to any point of the other grid entry than your max distance for example.

View solution in original post

1 REPLY 1

avatar
Master Guru

Gopal and me gave a couple of tips in here to increase the parallelity ( since Hive is normally not tuned for cartesian joins and creates too few mappers ).

https://community.hortonworks.com/questions/44749/hive-query-running-on-tez-contains-a-mapper-that-h...

Apart from that my second point still holds you should create some pre-filtering to reduce the amount of points you need to compare. There are a ton of different ways to do this:

https://en.wikipedia.org/wiki/Spatial_database#Spatial_index

You can put points in grids and make sure that a data point in one grid entry cannot be closer to any point of the other grid entry than your max distance for example.