Member since
07-24-2016
7
Posts
9
Kudos Received
0
Solutions
08-08-2016
09:10 PM
5 Kudos
@Mahipal Ramidi I assume you mean functions missing or having a different behavior documented here: https://github.com/Esri/spatial-framework-for-hadoop/wiki/ST_Geometry-in-Hive-versus-SQL. Their number is not that high. I think that is more pragmatic to either contribute to ESRI open source project (https://github.com/Esri/spatial-framework-for-hadoop) and maybe others will contribute too as such that everybody wins. Other option, is to write yourself the a few functions missing, or convince ESRI to add the missing functions. Adding another framework into the mix can complicate your implementation. You could have a small SQL Server database where these functions are available, pre-process the data before bring it to Hive and denormalize your tables to add those columns needed for missing functions.
... View more
08-08-2016
09:04 PM
4 Kudos
@Mahipal Ramidi Best is to write a Java UDF for Hive, however, you can actually write WIDTH_BUCKET in SQL if you know your number of buckets and assume that is static for your histogram. Her is an example: SELECT whatever, CASE WHEN(whatever)>=${hiveconf:mymin} AND (t.tiv)<=${hiveconf:mymax} THEN CASEWHENfloor((whatever)/((${hiveconf:mymax}– ${hiveconf:mymin})/${ hiveconf:mybuckets} )+ 1)>${ hiveconf:mybuckets} THEN floor((t.tiv)/((${hiveconf:mymax}-${ hiveconf:mymin})/${ hiveconf:mybuckets})) ELSE floor((whatever)/((${hiveconf:mymax}- ${hiveconf:mymin})/${hiveconf:mybuckets})+1) END ELSE(${hiveconf:mybuckets})+1 END AS whateverlabel FROM(whatever table or sql)
... View more
08-08-2016
10:00 AM
1 Kudo
Gopal and me gave a couple of tips in here to increase the parallelity ( since Hive is normally not tuned for cartesian joins and creates too few mappers ). https://community.hortonworks.com/questions/44749/hive-query-running-on-tez-contains-a-mapper-that-h.html#comment-45388 Apart from that my second point still holds you should create some pre-filtering to reduce the amount of points you need to compare. There are a ton of different ways to do this: https://en.wikipedia.org/wiki/Spatial_database#Spatial_index You can put points in grids and make sure that a data point in one grid entry cannot be closer to any point of the other grid entry than your max distance for example.
... View more
08-08-2016
08:59 PM
4 Kudos
@Mahipal Ramidi As Sindhu suggested, you can write your UDF, specifically leveraging Java math library. NTILE divides ordered data set into number of buckets and assigns appropriate bucket number to each row. It can be used to divide rows into equal sets and assign a number to each row. WIDTH_BUCKET, while not far off from NTILE, here we can actually supply the range of values (start and end values), it takes the ranges and splits it into N groups. You can actually write WIDTH_BUCKET in SQL if you know your number of buckets and assume that is static for your histogram. Her is an example: SELECT whatever, CASE WHEN(whatever)>=${hiveconf:mymin} AND (t.tiv)<=${hiveconf:mymax} THEN CASEWHENfloor((whatever)/((${hiveconf:mymax}– ${hiveconf:mymin})/${ hiveconf:mybuckets} )+ 1)>${ hiveconf:mybuckets} THEN floor((t.tiv)/((${hiveconf:mymax}-${ hiveconf:mymin})/${ hiveconf:mybuckets})) ELSE floor((whatever)/((${hiveconf:mymax}- ${hiveconf:mymin})/${hiveconf:mybuckets})+1) END ELSE(${hiveconf:mybuckets})+1 END AS whateverlabel FROM(whatever table or sql) Even is not exactly WIDTH_BUCKET, https://developers.google.com/chart/interactive/docs/gallery/histogram provides a bucketing javascript function useful for histograms. Check the histograms section. Another good resource: https://developers.google.com/api-client-library/java/apis/analytics/v3 I believe that I knew a good Java library that had WIDTH_BUCKET among other analytical functions, but look at Google resources mentioned above. Most likely, you could leverage those and add yours, custom UDFs.
... View more
08-17-2016
01:33 PM
@nizar saddiki In other databases (other than Hive), ST_Transform converts two-dimensional ST_Geometry data into the spatial reference specified by the spatial reference ID (SRID). SRID parameter is not supported in Hive. As such, you need to pre-process the data in other system before uploading to Hive. Usually, that leads to denormalization. You would add a new column for each SRID. However, if they are way too many, is probably better to write your own ST_Transform service or function. I wish I could give better news. Check this article: https://community.hortonworks.com/articles/44319/geo-spatial-queries-with-hive-using-esri-geometry.html. Also: https://community.hortonworks.com/articles/44319/geo-spatial-queries-with-hive-using-esri-geometry.html It will show you how to add the jar and create the function, as well as how to use it. Second article includes some limitations.
... View more
08-08-2016
08:20 PM
4 Kudos
@Mahipal Ramidi Taking in account how ST_GeomFromText or St_GeomFromJson functions are used to convert text or json to geometry and then geometry is used in various functions, they provide the same functionality. If the value as text or json had to be parsed with some special conditions, then JSON would have been a better choice for processing, but that is not a known use case. As such, overall, used as explained above, text makes more sense, taking even less space (no structure included). If there will be a case where the above assumption is not true, add another column to your table, shape_json and convert text to json, or apply a text to json function for the specific scenario if performance is not impacted. If performance is impacted, denormalize by adding shape_json column.
... View more
09-19-2016
01:36 PM
@All I also similar issue. Can you please help me in that - http://stackoverflow.com/questions/39547001/why-hive-staging-file-is-missing-in-aws-emr?
... View more