Member since
03-16-2016
707
Posts
1753
Kudos Received
203
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5126 | 09-21-2018 09:54 PM | |
6492 | 03-31-2018 03:59 AM | |
1968 | 03-31-2018 03:55 AM | |
2176 | 03-31-2018 03:31 AM | |
4821 | 03-27-2018 03:46 PM |
08-10-2016
03:34 PM
@Frank Lu Confluent is a different platform. That would mean the customer will migrate from using Kafka from HDP to Confluent. Their version of Kafka is their version and not necessary the same version that all open source community is using.
... View more
08-10-2016
01:23 PM
4 Kudos
@Frank Lu Apache Kafka monitoring still needs a lot of work and there is not one tool that can handle all community requirements. I have used a combination of Burrow and Ambari Metrics for the most recent version of HDP 2.4.2 +. Burrow is a monitoring tool for keeping track of consumer lag in Apache Kafka. It is designed to monitor every consumer group that is committing offsets to either Kafka or Zookeeper, and to monitor every topic and partition consumed by those groups. This provides a comprehensive view of consumer status. Burrow also provides several HTTP request endpoints for getting information about the Kafka cluster and consumers, separate from the lag status. This can be very useful for creating applications that assist with managing your Kafka clusters when it is not convenient (or possible) to run a Java Kafka client. Please check these articles for starters: https://community.hortonworks.com/articles/28103/monitoring-kafka-with-burrow.html https://community.hortonworks.com/articles/36725/kafka-monitoring-per-topic-and-per-broker.html
... View more
08-09-2016
02:32 PM
1 Kudo
@sivakumar sudhakarannair girijakumari This is a great finding. I did not realize that setting the ez.grouping.min-size at the session level will not override the global value. It should override. Maybe this is a bug which is a rare condition because nobody was thinking to override at the session level the tez.grouping.max-size so low that it would be lower than the tez.grouping.min-size set at global level. It is a small issue and it could be workarounded as I specified in my response below.
... View more
08-09-2016
02:29 PM
5 Kudos
@Mahipal Ramidi! Ideally, keep Tez global settings as they are and set tez.grouping.max-size to a value that makes sense for the query you execute, always higher than the tez.grouping.min-size which you set globally. If your global tez.grouping_min-size is not low enough to allow you to set your session tez.grouping.max-size to a value higher than the global tez.grouping.min-size, you may want to change the global tez.grouping.min-size to a lower value to satisfy the condition. Low values of min and max create a lot of small tasks. Each task has allocated a container. Has such a lot of parallel tasks will do the work but it could also consume all resources of the cluster. This approach needs always a careful analysis of how many tasks created and resources used. Anyhow, mappers will chunk the input data to sizes between min and max and most likely there will be no impact on other jobs requiring a larger chunking. Your query seems to have a not so large data volume, but it requires a lot of parallelism to complete faster.
... View more
08-08-2016
09:10 PM
5 Kudos
@Mahipal Ramidi I assume you mean functions missing or having a different behavior documented here: https://github.com/Esri/spatial-framework-for-hadoop/wiki/ST_Geometry-in-Hive-versus-SQL. Their number is not that high. I think that is more pragmatic to either contribute to ESRI open source project (https://github.com/Esri/spatial-framework-for-hadoop) and maybe others will contribute too as such that everybody wins. Other option, is to write yourself the a few functions missing, or convince ESRI to add the missing functions. Adding another framework into the mix can complicate your implementation. You could have a small SQL Server database where these functions are available, pre-process the data before bring it to Hive and denormalize your tables to add those columns needed for missing functions.
... View more
08-08-2016
09:04 PM
4 Kudos
@Mahipal Ramidi Best is to write a Java UDF for Hive, however, you can actually write WIDTH_BUCKET in SQL if you know your number of buckets and assume that is static for your histogram. Her is an example: SELECT whatever, CASE WHEN(whatever)>=${hiveconf:mymin} AND (t.tiv)<=${hiveconf:mymax} THEN CASEWHENfloor((whatever)/((${hiveconf:mymax}– ${hiveconf:mymin})/${ hiveconf:mybuckets} )+ 1)>${ hiveconf:mybuckets} THEN floor((t.tiv)/((${hiveconf:mymax}-${ hiveconf:mymin})/${ hiveconf:mybuckets})) ELSE floor((whatever)/((${hiveconf:mymax}- ${hiveconf:mymin})/${hiveconf:mybuckets})+1) END ELSE(${hiveconf:mybuckets})+1 END AS whateverlabel FROM(whatever table or sql)
... View more
08-08-2016
08:59 PM
4 Kudos
@Mahipal Ramidi As Sindhu suggested, you can write your UDF, specifically leveraging Java math library. NTILE divides ordered data set into number of buckets and assigns appropriate bucket number to each row. It can be used to divide rows into equal sets and assign a number to each row. WIDTH_BUCKET, while not far off from NTILE, here we can actually supply the range of values (start and end values), it takes the ranges and splits it into N groups. You can actually write WIDTH_BUCKET in SQL if you know your number of buckets and assume that is static for your histogram. Her is an example: SELECT whatever, CASE WHEN(whatever)>=${hiveconf:mymin} AND (t.tiv)<=${hiveconf:mymax} THEN CASEWHENfloor((whatever)/((${hiveconf:mymax}– ${hiveconf:mymin})/${ hiveconf:mybuckets} )+ 1)>${ hiveconf:mybuckets} THEN floor((t.tiv)/((${hiveconf:mymax}-${ hiveconf:mymin})/${ hiveconf:mybuckets})) ELSE floor((whatever)/((${hiveconf:mymax}- ${hiveconf:mymin})/${hiveconf:mybuckets})+1) END ELSE(${hiveconf:mybuckets})+1 END AS whateverlabel FROM(whatever table or sql) Even is not exactly WIDTH_BUCKET, https://developers.google.com/chart/interactive/docs/gallery/histogram provides a bucketing javascript function useful for histograms. Check the histograms section. Another good resource: https://developers.google.com/api-client-library/java/apis/analytics/v3 I believe that I knew a good Java library that had WIDTH_BUCKET among other analytical functions, but look at Google resources mentioned above. Most likely, you could leverage those and add yours, custom UDFs.
... View more
08-08-2016
08:29 PM
5 Kudos
@Mahipal Ramidi Actually, ST_Transform is supported, but it has a slightly different behavior when used in HIve. in traditional implementations like Netezza, Oracle or SQL Server, ST_Transform converts two-dimensional ST_Geometry data into the spatial reference specified by the spatial reference ID (SRID). SRID parameter is not supported in Hive. As such, I suggest you to pre-process the data a denormalize your table structure to account for your SRID. This is a good approach if you have a limited number of SRIDs to support. If number is high, then you may need to write a custom UDF and use in Hive. That is if you need to implement ST_Transform with SRID in SQL. There are other options if the geometry subject of the conversion is a small amount and is a matter of how is reflected in the UI. You may consider implementing a JavaScript function or a REST web service. Overall, it is a matter of good design. Check this reference for functions in Hive with different behavior and not only: https://github.com/Esri/spatial-framework-for-hadoop/wiki/ST_Geometry-in-Hive-versus-SQL
... View more
08-08-2016
08:20 PM
4 Kudos
@Mahipal Ramidi Taking in account how ST_GeomFromText or St_GeomFromJson functions are used to convert text or json to geometry and then geometry is used in various functions, they provide the same functionality. If the value as text or json had to be parsed with some special conditions, then JSON would have been a better choice for processing, but that is not a known use case. As such, overall, used as explained above, text makes more sense, taking even less space (no structure included). If there will be a case where the above assumption is not true, add another column to your table, shape_json and convert text to json, or apply a text to json function for the specific scenario if performance is not impacted. If performance is impacted, denormalize by adding shape_json column.
... View more
08-06-2016
07:15 PM
@sivakumar sudhakarannair girijakumari Could you add an excerpt from the log, please?
... View more