Hello, I am currently trying to associate each lat long ping from a device to its ZIP code.
I have de-normalized lat long device ping data and created a cross-product/ Cartesian product join table in which each row has the ST_Point(long,lat), geometry_shape_of_ZIP and associated zip code for that geometry. for testing purpose I have around 45 million rows in the table and it'll increase in production to about 1 billion every day.
Even though the data is flattened and no join conditions, the query takes about 2 hours to complete. Is there any faster way to compute spatial queries? Or how can I optimize the following query.
Inline is some of the optimizations steps I have already performed. Using the optimizations all the other operations gets done in max 5 minutes except for this one step. I am using aws cluster 2 mater nodes and 5 data nodes.
set hive.vectorized.execution.enabled = true; set hive.execution.engine=tez; set hive.enforce.sorting=true; set hive.cbo.enable=true; set hive.compute.query.using.stats=true; set hive.stats.fetch.column.stats=true; set hive.stats.fetch.partition.stats=true; analyze table tele_us_zipmatch compute statistics for columns; CREATE TABLE zipcheck ( `long4` double, `lat4` double, state_name string, country_code string, country_name string, region string, zip int, countyname string) PARTITIONED by (state_id string) STORED AS ORC TBLPROPERTIES ("orc.compress" = "SNAPPY", 'orc.create.index'='true', 'orc.bloom.filter.columns'=''); INSERT OVERWRITE TABLE zipcheck PARTITION(state_id) select long4, lat4, state_name, country_code, country_name, region, zip, countyname, state_id from tele_us_zipmatch where ST_Contains(wkt_shape,zip_point)=TRUE;
ST_Contains is the function from esri (ref: https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Documentation#relationship-tests ).
Any help is greatly appreciated.
From a platform standpoint you can look at memory and cpu utilization. How much memory do you have on each node? Out of that, how much have you allocated to YARN? (try allocating more memory per node)
I then would look into varying the min and max container size you've allocated to see if there's impact on performance