Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to optimize scan of 1 huge file / table in Hive to confirm/check if lat long point is contained in a wkt geometry shape.

Highlighted

How to optimize scan of 1 huge file / table in Hive to confirm/check if lat long point is contained in a wkt geometry shape.

New Contributor

Hello, I am currently trying to associate each lat long ping from a device to its ZIP code.

I have de-normalized lat long device ping data and created a cross-product/ Cartesian product join table in which each row has the ST_Point(long,lat), geometry_shape_of_ZIP and associated zip code for that geometry. for testing purpose I have around 45 million rows in the table and it'll increase in production to about 1 billion every day.

Even though the data is flattened and no join conditions, the query takes about 2 hours to complete. Is there any faster way to compute spatial queries? Or how can I optimize the following query.

Inline is some of the optimizations steps I have already performed. Using the optimizations all the other operations gets done in max 5 minutes except for this one step. I am using aws cluster 2 mater nodes and 5 data nodes.

set hive.vectorized.execution.enabled = true;
set hive.execution.engine=tez;
set hive.enforce.sorting=true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
analyze table tele_us_zipmatch compute statistics for
columns;
CREATE TABLE zipcheck  (
  `long4` double,
  `lat4` double,
  state_name string, 
  country_code string,
  country_name string,
  region string, 
  zip int, 
  countyname string)  PARTITIONED by (state_id string)
STORED AS ORC TBLPROPERTIES ("orc.compress" =
"SNAPPY",
'orc.create.index'='true',
'orc.bloom.filter.columns'='');
INSERT OVERWRITE TABLE zipcheck PARTITION(state_id)
select long4, lat4, state_name, country_code, country_name,
region, zip, countyname, state_id  from
tele_us_zipmatch 
where ST_Contains(wkt_shape,zip_point)=TRUE;

ST_Contains is the function from esri (ref: https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Documentation#relationship-tests ).

Any help is greatly appreciated.

Thanks.

1 REPLY 1

Re: How to optimize scan of 1 huge file / table in Hive to confirm/check if lat long point is contained in a wkt geometry shape.

Expert Contributor

From a platform standpoint you can look at memory and cpu utilization. How much memory do you have on each node? Out of that, how much have you allocated to YARN? (try allocating more memory per node)

I then would look into varying the min and max container size you've allocated to see if there's impact on performance

Don't have an account?
Coming from Hortonworks? Activate your account here