About cstanca

cstanca · ‎05-18-2016

This post should not be categorized as an "article". An article should show in a structured manner why, what, how, benefits, conclusions, etc. about a topic. This is a support issue, at most a question "what to do when ODBC driver fails to install on OSZX 10.11"? @dtraver please reclassify it as a question, for clarity-sake and fairness for what an article should be.

cstanca · ‎05-18-2016

@bsaini Additionally, RHive framework delivers the libraries and algorithms of R to data stored in Hadoop by extending Hive’s SQL-like query language (HiveQL) with R-specific functions. Through the RHive functions, you can use HiveQL to apply R statistical models to data in your Hadoop cluster that you have catalogued using Hive. Regarding RHadoop, there is more than rmr2. There is also the rhdfs package which provides an R language API for file management over HDFS stores. Using rhdfs, users can read from HDFS stores to an R data frame (matrix), and similarly write data from these R matrices back into HDFS storage. Also the rhbase packages provide an R language API as well, but their goal in life is to deal with database management for HBase stores, rather than HDFS files.

cstanca · ‎05-18-2016

@mhendricksT Try this: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_hdfs_admin_tools/content/ch07.html

cstanca · ‎05-18-2016

@sankar rao In most recent In most recent version of Ambari, SmartSense can be installed as any other service. If you don't have an Ambari capable to do so, you could upgrade Ambari and then install SmartSense, or install SmartSense manually. Instructions here: https://docs.hortonworks.com/HDPDocuments/SS1/SmartSense-1.1.0/bk_smartsense_admin/content/ch01s02s01.html

cstanca · ‎05-18-2016

This is a very loaded question. Many things can change and impact performance. You can have suddenly more data that meets the criteria for that job processing, some other process is using more resources, your network band can be impacted by some other process or have some dropped packets. You can also have a setting that due to change in conditions could be insufficient, e.g. JVM heap, file descriptors, etc. You must analyze data provided by your monitoring tools and identify what seems to be the bottleneck: CPU, I/O, network, RAM etc. Check your Ganglia and Nagios monitoring and see what they report. In most recent versions of HDP, you could benefit of SmartSense which you may have based on your subscription level. This is an awesome tool designed by engineers in pain of administrating clusters and dealing with mis-configuration and performance issues. As a short-term approach, you should troubleshoot the current issues, address as much as you can, and then plan for an HDP upgrade to use SmartSense. Read more about SmartSense here: http://hortonworks.com/services/smartsense/

cstanca · ‎05-18-2016

@ Dominika B The Hortonworks Hive ODBC Driver for OSX 10.11 (El Capitan) failing to install bug has been tracked on HWX's Jira Engineering since November 2015: https://hortonworks.jira.com/browse/BUG-47990. This issue has been escalated to Simba. They have released an updated ODBC driver called 2.1.x which is not yet on their website. Their latest on the website is 2.0.3. The 2.1.x version provided to Hortonworks and certified with HDP 2.3.2 addresses inability to install properly on OSX El Capitan. If you are an Hortonworker, you could download the driver (2.1.2.1002) at: https://simba.app.box.com/s/2e1m5sjimcrm0m5gsnirhpxuduw0aoo7. Setup instructions: https://hortonworks.com/wp-content/uploads/2016/03/Hortonworks-Hive-ODBC-Driver-User-Guide.pdf, page 33. http://cdn.simba.com/products/Hive/doc/Simba_Hive_ODBC_InstallGuide.pdf

cstanca · ‎05-18-2016

Simba have released an updated ODBC driver called 2.1.0. Link: https://simba.box.com/s/sq9zyyogft7ee42i18geiyevx371txrd This fix addresses critical Windows problems as well as the inability to install properly on OSX El Capitain. This does not address CentOS 7 compatibility unfortunately.

cstanca · ‎05-18-2016

@ Shannon Wright HDP 2.4.2 is to be launched. Best chance to get a precise answer is to post this question on Trifacta support site: https://www.trifacta.com/support/. It is a common practice that certifications happen on the next 1 or 2 quarters. HDP 2.4.2 is a maintenance release and the certification may come sooner than with a major release.

cstanca · ‎05-17-2016

This article is a continuation of Monitoring Kafka with Burrow - Part 1. Before diving into evaluation rules, HTTP endpoint API and notifiers, I would like to point-out a few other tools that are utilizing Burrow. Burrower (http://github.com/splee/burrower), a tool for gathering consumer lag information from Burrow and sending it into influxDB ansible-burrow (https://github.com/slb350/ansible-burrow) provides an Ansible role for installing Burrow. Consumer Lag Evaluation Status The status of a consumer group in Burrow is determined based on several rules evaluated against the offsets for each partition the group consumes. Thus, there is no need for setting a discrete threshold for the number of messages a consumer is allowed to be behind before alerts go off. By evaluating against every partition the group consumes, the entire consumer group health status is evaluated, and not just the topics that are being monitored. This is very important for wildcard consumers, such as Kafka Mirror Maker. Window The lagcheck configuration determines the length of the sliding window, specifying the number of offsets to store for each partition that a consumer group consumes. This window moves forward with each offset the consumer commits (the oldest offset is removed when the new offset is added). For each consumer offset, the following are stored: the offset itself, the timestamp that the consumer committed it, and the lag at the point Burrow received it. The lag is calculated as the difference between the head offset of the broker and the consumer's offset. Because broker offsets are fetched on a fixed interval, the result could be a negative number, however, by convention, the stored lag value is zero. Rules The following rules are used for evaluation of a group's status for a given partition: If any lag within the window is zero, the status is considered to be OK. If the consumer offset does not change over the window, and the lag is either fixed or increasing, the consumer is in an ERROR state, and the partition is marked as STALLED. If the consumer offsets are increasing over the window, but the lag either stays the same or increases between every pair of offsets, the consumer is in a WARNING state. This means that the consumer is slow, and is falling behind. If the difference between the time now and the time of the most recent offset is greater than the difference between the most recent offset and the oldest offset in the window, the consumer is in an ERROR state and the partition is marked as STOPPED. However, if the consumer offset and the current broker offset for the partition are equal, the partition is not considered to be in error. If the lag is -1, this is a special value that means we do not have a broker offset yet for that partition. This only happens when Burrow is starting up, and the status is considered to be OK. HTTP Endpoint API The HTTP Server in Burrow provides a convenient way to interact with both Burrow and the Kafka and Zookeeper clusters. Requests are simple HTTP calls and all responses are formatted as JSON. For bad requests, Burrow will return an appropriate HTTP status code in the 400 or 500 range. The response body will contain a JSON object with more detail on the particular error encountered. Examples of requests: Request URL Path Description Healthcheck GET /burrow/admin Healthcheck of Burrow, whether for monitoring or load balancing within a VIP. List Clusters T /v2/kafka GET /v2/zookeeper List of the Kafka clusters that Burrow is configured with. Kafka Cluster Detail GET /v2/kafka/(cluster) Detailed information about a single cluster, specified in the URL. This will include a list of the brokers and zookeepers that Burrow is aware of. List Consumers GET /v2/kafka/(cluster)/consumer List of the consumer groups that Burrow is aware of from offset commits in the specified Kafka cluster. Remove Consumer Group DELETE /v2/kafka/(cluster)/consumer/(group) Removes the offsets for a single consumer group from a cluster. This is useful in the case where the topic list for a consumer has changed, and Burrow believes the consumer is consuming topics that it no longer is. The consumer group will be removed, but it will automatically be repopulated if the consumer is continuing to commit offsets. List Consumer Topics GET /v2/kafka/(cluster)/consumer/(group)/topic List of the topics the topics that Burrow is aware of from offset commits consumed by the specified consumer group in the specified Kafka cluster. Consumer Topic Detail GET /v2/kafka/(cluster)/consumer/(group)/topic/(topic) Most recent offsets for each partition in the specified topic, as committed by the specified consumer group. Consumer Group Status GET /v2/kafka/(cluster)/consumer/(group)/status or GET /v2/kafka/(cluster)/consumer/(group)/lag Current status of the consumer group, based on evaluation of all partitions it consumes. The evaluation is performed on request, and the result is calculated based on the consumer lag evaluation rules. There are two versions of this request. The endpoint "/status" will return an object that only includes the partitions that are in a bad state. The endpoint "/lag" will return an object that includes all partitions for the consumer, regardless of the evaluated state of the partition. The second version can be used for full reporting of consumer message lag on all partitions. List Cluster Topics GET /v2/kafka/(cluster)/topic List of the topics in the specified Kafka cluster. Cluster Topic Detail GET /v2/kafka/(cluster)/topic/(topic) Head offsets for each partition in the specified topic, as retrieved from the brokers. Note that these brokers may be up to the number of seconds old specified by the broker-offsets configuration parameter. List Clusters GET /v2/kafka GET /v2/zookeeper List of the Kafka clusters that Burrow is configured with. Notifiers Two notifier modules are available to configure to check and report consumer group status: email and HTTP. Email Notifier The email notifier is used to send out emails to a specified address whenever a consumer group is in a bad state. Multiple groups can be configured for a single email address, and the interval to check the status on (and send out emails on) is configurable per email address. Before configuring any email notifiers, the [smtp] section needs to be configured in Burrow configuration file. Example of configuration: [smtp] server=mailserver.example.com port=25a auth-type=plain username=emailuser password=s3cur3! from=burrow-noreply@example.com template=config/default-email.tmpl Multiple email notifiers can be configured in the Burrow configuration file. Each notifier configuration resides in its own section. Example of configuration: [email "bofh@example.com"] group=local,critical-consumer-group group=local,other-consumer-group interval=60 The email that is sent is formatted according to the template specified in the [smtp] configuration section. A default template is provided as part of the Burrow distribution in theconfig/default-email.tmplfile. The template format is the standard Golang text template. There are several good posts available on how to compose Golang templates: http://andlabs.lostsig.com/blog/2014/05/26/8/the-go-templates-post http://jan.newmarch.name/go/template/chapter-template.html http://golangtutorials.blogspot.com/2011/06/go-templates.html A timer is set up inside Burrow to fire everyintervalseconds and check the listed consumer groups. The current status is requested for each group, and if any group in the list is not in an OK state, an email is sent out with the status of all groups. This means that the email can contain listings for both good and bad groups, but no email will be sent out if everything is OK. HTTP Notifier The HTTP notifier reports error states for all consumer groups to an external HTTP endpoint via POST requests. DELETE requests can be also sent to the same endpoint when a consumer group returns to normal. The HTTP notifier is used to send POST requests to an external endpoint, such as for a monitoring or notification system, on a specified interval whenever a consumer group is in a bad state. This notifier operates on all consumer groups in all clusters (excluding groups matched by the blacklist). Incidents of a consumer group going bad have a unique ID generated that is maintained until that group transitions back to a good state. This allows notification systems to handle incidents, rather than individual reports of consumer group status, if needed. The configuration for the HTTP notifier is specified under a heading [httpnotifier]. This is where is configured the URL to connect to, as well as the templates to use for POST and DELETE request bodies. Extra fields can be provided as they are provided to the template. An example HTTP notifier configuration looks like this: [httpnotifier] url=http://notification.server.example.com:9000/v1/alert interval=60 extra=field1=custom information extra=field2=special info to pass to template template-post=config/default-http-post.tmpl template-delete=config/default-http-delete.tmpl timeout=5 keepalive=30 The request body that is sent is with each HTTP request is formatted according to the templates specified. A default template is provided as part of the Burrow distribution in theconfig/default-http-post.tmplandconfig/default-http-delete.tmplfiles. The template format is the standard Golang text template. There are several good posts available on how to compose Golang templates: http://andlabs.lostsig.com/blog/2014/05/26/8/the-go-templates-post http://jan.newmarch.name/go/template/chapter-template.html http://golangtutorials.blogspot.com/2011/06/go-templates.html A timer is set up inside Burrow to fire everyintervalseconds. When the timer fires, all consumer groups in all Kafka clusters are enumerated and the current status is requested for each group. For each group that is not in an OK state, a unique ID is generated (if it does not already exist) and a POST request is generated for that group. For each group that is in an OK state, a check is performed as to whether or not an ID exists for that group currently. If it does, the ID is removed (as the group has transitioned to OK). If the DELETE template is specified, a DELETE request is generated for that group. Conclusion The most important metric to watch is whether or not the consumer is keeping up with the messages that are being produced. Until Burrow, the fundamental approach was to monitor the consumer lag and alert on that number. Burrow monitors the consumer lag and keeps track of the health of the consuming application automatically monitoring all consumers, for every partition that they consume. It does this by consuming the special internal Kafka topic to which consumer offsets are written. Burrow provides consumer information as a centralized service that is separate from any single consumer, based on the offsets that the consumers are committing and the broker's state.

cstanca · ‎05-16-2016

@Kirk Haslbeck Interval data type is not supported in Hive, yet. See https://issues.apache.org/jira/browse/HIVE-5021. Until HIVE-5021 feature is added, I would use two BigInt fields in Hive target table: startInterval, endInterval. Queries using these two fields in WHERE clauses would run better, being more appropriate for indexing and fast scan. For bit[n] in HAWQ, I would use a char, varchar, or string data type in Hive, depends on how big the string needs to be.

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Re: Hive ODBC Driver on OSX 10.11 (El Capitan)

Re: Running R program on HDP

Re: Access Running Job Metrics via JMX

Re: i have 6 nodes with hdp 2.1 cluster my jobs ar...

Re: i have 6 nodes with hdp 2.1 cluster my jobs ar...

Re: Updated instructions for hive ODBC setup?

Re: Hive ODBC Driver on OSX 10.11 (El Capitan)

Re: Is Trifacta 3.x certified with HDP 2.4.2 ?

Monitoring Kafka with Burrow - Part 2

Re: HAWQ to HIVE data type mapping