Created 08-07-2016 07:57 PM
We perform frequently Cartesian products involving geospatial functions in the where clause (e.g. ST_Intersects) of our Hive queries. What are the best approaches for tuning those queries for response time and concurrency?
Created 08-08-2016 10:00 AM
Gopal and me gave a couple of tips in here to increase the parallelity ( since Hive is normally not tuned for cartesian joins and creates too few mappers ).
Apart from that my second point still holds you should create some pre-filtering to reduce the amount of points you need to compare. There are a ton of different ways to do this:
https://en.wikipedia.org/wiki/Spatial_database#Spatial_index
You can put points in grids and make sure that a data point in one grid entry cannot be closer to any point of the other grid entry than your max distance for example.
Created 08-08-2016 10:00 AM
Gopal and me gave a couple of tips in here to increase the parallelity ( since Hive is normally not tuned for cartesian joins and creates too few mappers ).
Apart from that my second point still holds you should create some pre-filtering to reduce the amount of points you need to compare. There are a ton of different ways to do this:
https://en.wikipedia.org/wiki/Spatial_database#Spatial_index
You can put points in grids and make sure that a data point in one grid entry cannot be closer to any point of the other grid entry than your max distance for example.