I have driving data and I am trying to find in which state the vehicle is driving. Later I would drill down to find the county and zip code they are driving through as well. I am using HIVE to do the operations and have written a code which does what I want but is very slow. I have vehicle ping information approximately 9500000 rows each file and I need to associate each ping to a state (96 records), county and zip (50000 rows).
To optimize, I have tried several things like mapjoin and other techniques, and still the query takes a lot of time to complete. In join when I use 'ON', HIVE throws an error so I use it in WHERE clause because of which it acts like a cross join. Is there a better way to do this? Any input to optimize or a different approach will be helpful. Thanks.
The code in use:
hive> CREATE TABLE driving_state > STORED AS ORC > AS SELECT /*+ MAPJOIN(m) */ a.vehicle_DeviceId, a.VIN, a.Longitude, a.Latitude, a.Time_stamp, m.country_code, m.country_name, m.state_name, m.postal, m.type_en FROM vehicle_ping a LEFT OUTER JOIN northamerica_state m where ST_Contains(ST_GeomFromGeoJSON(m.geometry), ST_Point(a.longitude, a.latitude)) ;
I am using Esri spatial framework. The where condition ST_GeomFromGeoJson and ST_Point are from functions from ESRI framework.
Is there a way I can index geometry polygon and lat / long ?
While the answer is technically "Yes", you can create indexes in Hive, I would go with tables stored as ORC and not recommend indexes in Hive for your case. ORC has built-in Indexes which allow the format to skip blocks of data during read, they also support Bloom filters. Together this pretty much replicates what Hive Indexes do.
Anyhow, to respond your original question and assuming that you decide to ignore the use of ORC, Latitude and Longitude data type is double and you seem to store geometry data as json because Hive does not support an actual binary geometry like Oracle, SQL Server etc. The index on double data type is probably ok. On Json, I don't think so. I I would create another field of bigint that is associated with an unique geometry. You may have to be creative about generating a number based on json content. Json does not make much sense for indexing, neither text.
Here is an example of Compact Index:
create index YourTableIndex on table YourTable (portfolio_id) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild;
Let me know if I addressed your question, if so don't forget to vote/accept answer.