Created on 03-25-2016 07:01 PM - edited 08-17-2019 12:57 PM
March 23, 2016
Eran Orgad
Geographical distance is the distance measured along the surface of the earth.
The distances between points are defined by geographical coordinates (Geo Location cords) in terms of latitude and longitude.
Each pair represent a location as:
Lexington MA:
Latitude : 42.4428, Longitude: -71.2317
Mountain View CA:
Latitude: 37.405990600586, Longitude: -122.07851409912
Distance: 2676.09225228497 Miles
To be able to calculate the distance on Hortonworks HDP using Hive (and Tez) let’s generate some sample data that includes some pairs of Geolocation coordinates.
Data file called: distance1.csv
42.28,-71.87,42.28,-71.86
42.00,-71.87,42.28,-71.11
42.28,-71.87,42.28,-71.86
42.28,-71.87,42.28,-71.22
42.00,-71.87,42.28,-72.33
42.28,-71.87,42.28,-70.44
42.00,-71.87,42.28,-71.55
42.28,-71.87,41.28,-71.66
42.00,-71.87,43.28,-71.77
42.28,-71.87,44.28,-71.88
42.00,-71.87,45.28,-71.99
42.28,-71.87,46.11,-71.00
42.00,-71.87,47.22,-71.00
42.4428,-71.2317,37.405990600586,-122.07851409912
….
We can create a schema to be able to read the content in Hive
Here is the table I created:
hive> show create table distancecalc;
OK
CREATE EXTERNAL TABLE distancecalc(
src_lat double,
src_long double,
dest_lat double,
dest_long double)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://sandbox.hortonworks.com:8020/tmp/dist2'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='false',
'numFiles'='1',
'numRows'='-1',
'rawDataSize'='-1',
'totalSize'='338',
'transient_lastDdlTime'='1458866694')
After placing my distance1.csv in the /tmp/dist2/ directory I can query the content in hive:
The following query will produce a calculation of distance with every given pair of geo location coordinates:
select src_lat, src_long, dest_lat, dest_long,
60*1.1515*(180*(acos(((sin(radians(src_lat))*sin(radians(dest_lat))) + (cos(radians(src_lat))*cos(radians(dest_lat))*cos(radians(src_long-dest_long))))))/PI()) as distancecalc
from distancecalc;
screen-shot-2016-03-25-at-122643-pm.png
Last entry is the distance between Lexington MA and Mountain View CA.
Java code to calculate the distance:
/*
* Class to calculate geo-location distance
* eorgad - Hortonworks.com 9/30/2014
*/
package geodistance;
import java.sql.SQLException;
public class calcGeoDistance {
public static double distance(double lat1, double lon1, double lat2, double lon2, String string) {
double theta = lon1 - lon2;
double dist = Math.sin(deg2rad(lat1)) * Math.sin(deg2rad(lat2)) + Math.cos(deg2rad(lat1)) * Math.cos(deg2rad(lat2)) * Math.cos(deg2rad(theta));
dist = Math.acos(dist);
dist = rad2deg(dist);
dist = dist * 60 * 1.1515;
if (string == "K") {
dist = dist * 1.609344;
} else if (string == "N") {
dist = dist * 0.8684;
}
return (dist);
}
/*###################################################################*/
/*# This function converts decimal degrees to radians #*/
/*###################################################################*/
privatestaticdouble deg2rad(double deg) {
return (deg * Math.PI / 180.0);
}
/*###################################################################*/
/*# This function converts radians to decimal degrees #*/
/*###################################################################*/
privatestaticdouble rad2deg(double rad) {
return (rad * 180 / Math.PI);
}
publicstaticvoid main(String[] args) throws SQLException {
System.out.println(distance(42.4428, -71.2317, 37.405990600586, -122.07851409912, "M") + " Miles\n");
// lexington to mountain view
// 42.4428 -71.2317 37.405990600586 -122.07851409912
}
}
Created on 04-01-2016 06:08 PM
@eorgadn You should wrap the geoDistance functions as hive UDF’s it will be a lot friendlier for most people that will want to use it in hive.