About myoung

myoung · ‎03-02-2018

@Nick Lewis I don't think there are any configuration options within the current GetTwitter processor. I took a look at the source code and I didn't see anything that would expose that option or anything that limits the length of the text. Here is the code: https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-social-media-bundle/nifi-twitter-processors/src/main/java/org/apache/nifi/processors/twitter/GetTwitter.java You can probably do what you want using other processors to query the Twitter API directly. I'm not sure if that will provide the "real time" feed you are looking for. Take a look at InvokeHTTP to see if that might work for you.

myoung · ‎03-02-2018

@Dmitro Vasilenko I don't know if you can send the Ambari Metrics data directly to Elasticsearch. However it should be possible to pull the data from HBase which is where Ambari Metrics is storing the data behind the scenes. This might be helpful for doing that: https://lessc0de.github.io/connecting_hbase_to_elasticsearch.html You could also use the Elasticsearch hadoop connector to pull data via Spark, which can access the data in HBase via Phoenix, to push the data to Elasticsearch: https://www.elastic.co/products/hadoop

myoung · ‎10-18-2017

Objectives Everyday Hortonworks customers are taking advantage of the flexibility and elasticity that cloud platforms provide. For many of these customers, Cloudbreak is used to manage their HDP clusters and to provide autoscaling capability. Cloudbreak's autoscaling features are tied to Ambari Alerts. Ambari ships with a set of alerts out of the box. However, you may want to enable an autoscaling policy based on an alert that Ambari doesn't provide out of the box. The good news is Ambari supports creating custom alerts. Custom alerts created in Ambari are visible to Cloudbreak and usable with Cloudbreak autoscaling policies. A common desire with autoscaling is to scale the cluster based on memory used, cores used, or perhaps the number of running applications. You can use the YARN ResourceManager JMX data to determine these values. For example, you may have a typical cluster with 5 Node Managers. You also know that sometimes your cluster usage will spike and you want to increase the number of Node Managers by 3, but you don't want to run 8 Node Managers all the time to save costs. You can create an alert based on the JMX data from YARN ResourceManager to scale the cluster based on usage. Then Cloudbreak can scale the cluster when the alert is triggered. This tutorial will walk you through the process of creating a custom Ambari Alert for use by Cloudbreak autoscaling policies. Prerequisites You should have a properly running instance of Cloudbreak with credentials for your cloud provider of choice. You should have an Ambari 2.5/HDP 2.6 cluster already deployed with Cloudbreak. Scope This tutorial was tested in the following environment: Cloudbreak 1.16.4 AWS EC2 Ambari 2.5 HDP 2.6 Steps Login into Ambari As mentioned in the prerequisites, you should already have a cluster built using Cloudbreak. Click on the cluster summary box in the Cloudbreak UI to display the cluster details. Now click on the link to your Ambari cluster. You may see something similar to this: Your screen may vary depending on your browser of choice. I'm using Chrome. This warning is because Cloudbreak uses self-signed certificates which are not trusted. Click on the Advanced link. You should see something similar to this: Click on the Proceed link to open the Ambari login screen. You should be able to login to Ambari using the default username and password of admin unless you changed it. Once you have logged into Ambari, you should see something similar to this: NOTE: Your specific cluster may look different. Login into YARN ResourceManager YARN is the central component used to manage resource availability on an HDP cluster. In Ambari, you can see a high-level summary of resources available to YARN to by click on the YARN link in the service list on the Ambari dashboard. You should see something similar to this: If you take a look at the upper right corner, you can see a summary of containers, applications and cluster memory. For this tutorial, I would like Cloudbreak to autoscale my cluster when the number of pending applications is greater than 2. To do this, I'm going to create a custom Ambari Alert based on that value. To get that value, I need to look at the YARN ResourceManager JMX data. View YARN ResourceManager JMX Data You can view available JMX data for the YARN ResourceManager via the Ambari Quick Links. You should already have the YARN ResourceManager dashboard visible from the last step. Click on the Quick Links drop down menu in the top middle of the screen. You should see something similar to this: As you can see, ResourceManager JMX is available in the list. If you click that link you will see something similar to this: You should see a fairly large JSON output. If you search for q0=root , you should see something similar to this: This is a list of YARN related metrics associated with the root queue. If you look in the list of values, you should see AppsPending . This is the metric I want to use for my Ambari Alert. Review existing Alert definitions You can view the definition for any Ambari provided Alerts. To get a list of all alerts on the system, you make a call to the Ambari API: curl -u admin:admin -i -k -H 'X-Requested-By:ambari' https://#.#.#.#/ambari/api/v1/clusters/autoscaling/alert_definitions/ You should see something similar to this: HTTP/1.1 200 OK Server: nginx Date: Wed, 18 Oct 2017 17:15:21 GMT Content-Type: text/plain Content-Length: 21595 Connection: keep-alive Vary: Accept-Encoding X-Frame-Options: DENY X-XSS-Protection: 1; mode=block X-Content-Type-Options: nosniff Cache-Control: no-store Pragma: no-cache Set-Cookie: AMBARISESSIONID=1gprc4wefyoiqmb1kj6plu95j;Path=/;HttpOnly Expires: Thu, 01 Jan 1970 00:00:00 GMT User: admin Vary: Accept-Encoding, User-Agent { "href" : "http://#.#.#.#/api/v1/clusters/autoscaling/alert_definitions/", "items" : [ { "href" : "http://#.#.#.#/api/v1/clusters/autoscaling/alert_definitions/1", "AlertDefinition" : { "cluster_name" : "autoscaling", "id" : 1, "label" : "HBase Master Process", "name" : "hbase_master_process" } }, ... NOTE: Your username and password may be different. You need to update the curl call to use your IP address for the Ambari server and your cluster name. In this example, my cluster name is autoscaling . Also notice the use of https for Cloudbreak clusters and the need for the -k flag. As you can see, each alert is assigned a unique id. To view the configuration of a specific alert, you make a curl call to the href link with the alert id provided in the output. To see the definition of Alert id 1 , make the following curl call: curl -u admin:admin -i -k -H 'X-Requested-By:ambari' https://#.#.#.#/ambari/api/v1/clusters/autoscaling/alert_definitions/1 NOTE: With Cloudbreak, Ambari is using HTTPS and is proxied so change http to https and /api to /ambari/api . You should see something similar to this: HTTP/1.1 200 OK Server: nginx Date: Wed, 18 Oct 2017 17:24:00 GMT Content-Type: text/plain Content-Length: 1156 Connection: keep-alive Vary: Accept-Encoding X-Frame-Options: DENY X-XSS-Protection: 1; mode=block X-Content-Type-Options: nosniff Cache-Control: no-store Pragma: no-cache Set-Cookie: AMBARISESSIONID=bcdh6wmyxpnd1ioufen9hikva;Path=/;HttpOnly Expires: Thu, 01 Jan 1970 00:00:00 GMT User: admin Vary: Accept-Encoding, User-Agent { "href" : "http://#.#.#.#/api/v1/clusters/autoscaling/alert_definitions/1", "AlertDefinition" : { "cluster_name" : "autoscaling", "component_name" : "HBASE_MASTER", "description" : "This alert is triggered if the HBase master processes cannot be confirmed to be up and listening on the network for the configured critical threshold, given in seconds.", "enabled" : true, "help_url" : null, "id" : 1, "ignore_host" : false, "interval" : 1, "label" : "HBase Master Process", "name" : "hbase_master_process", "repeat_tolerance" : 1, "repeat_tolerance_enabled" : false, "scope" : "ANY", "service_name" : "HBASE", "source" : { "default_port" : 60000.0, "reporting" : { "ok" : { "text" : "TCP OK - {0:.3f}s response on port {1}" }, "warning" : { "text" : "TCP OK - {0:.3f}s response on port {1}", "value" : 1.5 }, "critical" : { "text" : "Connection failed: {0} to {1}:{2}", "value" : 5.0 } }, "type" : "PORT", "uri" : "{{hbase-site/hbase.master.port}}" } } The alert definitions will vary depending on the component. My advice is to look for existing alert definitions around the component for which you are interested and use that as a base for your custom alerts. Create Custom Alert JSON file To submit a custom alert to Ambari, we can define the alert in a JSON file which we upload via the Ambari API. You can copy and paste the following alert definition to your alert file: { "AlertDefinition" : { "cluster_name" : "autoscaling", "component_name" : "RESOURCEMANAGER", "description" : "This queue-level alert is triggered if the number of root queue pending applications exceeds 1.", "enabled" : true, "help_url" : null, "ignore_host" : false, "interval" : 5, "label" : "[CUSTOM] ResourceManager Pending Applications", "name" : "queue_pending_applications", "repeat_tolerance" : 1, "repeat_tolerance_enabled" : false, "scope" : "ANY", "service_name" : "YARN", "source" : { "jmx" : { "property_list" : ["Hadoop:service=ResourceManager,name=QueueMetrics,q0=root/AppsPending"], "value" : "{0}" }, "reporting" : { "ok" : { "text" : "YARN Pending Applications: {0}" }, "warning" : { "text" : "YARN Pending Applications: {0}", "value" : 2 }, "critical" : { "text" : "YARN Pending Applications: {0}", "value" : 3 }, "units" : "Applications" }, "type" : "METRIC", "uri" : { "http" : "{{yarn-site/yarn.resourcemanager.webapp.address}}", "https" : "{{yarn-site/yarn.resourcemanager.webapp.https.address}}", "https_property" : "{{yarn-site/yarn.http.policy}}", "https_property_value" : "HTTPS_ONLY", "kerberos_keytab" : "{{yarn-site/yarn.resourcemanager.webapp.spnego-keytab-file}}", "kerberos_principal" : "{{yarn-site/yarn.resourcemanager.webapp.spnego-principal}}", "default_port" : 0.0, "connection_timeout" : 5.0, "high_availability" : { "alias_key" : "{{yarn-site/yarn.resourcemanager.ha.rm-ids}}", "http_pattern" : "{{yarn-site/yarn.resourcemanager.webapp.address.{{alias}}}}", "https_pattern" : "{{yarn-site/yarn.resourcemanager.webapp.https.address.{{alias}}}}" } } } } } You will need to change the value of cluster_name to match the name of your cluster. The label and name values can be customized by you, but they should be unique from other alerts in the system. The label is what will be displayed in the Ambari. I like to prepend [CUSTOM] on custom alerts to make it clear. Once you make the appropriate changes, you save the file as alert.json or really any filename you like. This alert, as defined with throw a WARNING alert when the number of pending applications is 2 and a CRITICAL alert when the number of pending applications is 3. Upload Custom Alert JSON file Now that we have the custom alert file, we can submit it to the Ambari API to create the new alert. You submit the alert by using the following curl call: curl -u admin:admin -i -k -H 'X-Requested-By:ambari' -X POST -d @alert.json https://#.#.#.#/ambari/api/v1/clusters/autoscaling/alert_definitions You should see something similar to the following: HTTP/1.1 100 Continue HTTP/1.1 201 Created Server: nginx Date: Wed, 18 Oct 2017 17:52:47 GMT Content-Type: text/plain Content-Length: 0 Connection: keep-alive X-Frame-Options: DENY X-XSS-Protection: 1; mode=block X-Content-Type-Options: nosniff Cache-Control: no-store Pragma: no-cache Set-Cookie: AMBARISESSIONID=18utggom97x7z33z3d2x9h1mf;Path=/;HttpOnly Expires: Thu, 01 Jan 1970 00:00:00 GMT User: admin Verify Custom Alert Exists You can verify the alert exists using the API call we used before: curl -u admin:admin -i -k -H 'X-Requested-By:ambari' https://#.#.#.#/ambari/api/v1/clusters/autoscaling/alert_definitions You should see the new alert at the bottom of the list: { "href" : "http://#.#.#.#/api/v1/clusters/autoscaling/alert_definitions/75", "AlertDefinition" : { "cluster_name" : "autoscaling", "id" : 75, "label" : "[CUSTOM] ResourceManager Pending Applications", "name" : "queue_pending_applications" } } ] You can also verify via the Ambari Alerts page. In the upper right-hand menu of Ambari, click on Alerts . You should see something similar to this: Now filter for CUSTOM and you should see something similar to this: As you can see, the alert exists in Ambari. After a few minutes, the status should change from NONE to OK . Create Cloudbreak Autoscaling Policy Now that our custom alert exists in Ambari, we can create a Cloudbreak autoscaling policy based on that alert. In the Cloudbreak UI, show the details for the cluster you have running. You should see something similar to this: Click on the autoscaling SLA polices link to the right of details . You should see something similar to this: By default, the policies should be disabled. You can click on the enable button to enable autoscaling. You should see something similar to this: Before creating the policy, you have to define the Ambari Alert on which you want to trigger. Click the create alert button. You should see something similar to this: You have to option to chose between metric based and time based alerts. Time based alerts allow you to define a cron based time period where autoscaling events will happen. For this tutorial, I'm going to use metric based. The Alert Name and Description are up to you. I recommend using something informative. The Metric - Desired State is a drop down where you select from the list of available Ambari Alerts and you determine which Alert state you want to trigger. The Period is how long, in minutes, the alert should exist before an autoscaling event is triggered. You should use a value that is reasonable; you don't want the scaling events happening too quickly as that can cause a lot of churn. You can see what I've used as an example: When you have everything entered, click on the create alert button. Now we can define the scaling policy itself. Click on the create policy button. You should see something similar to this: The Policy Name is up to you. Again, I recommend using something informative. The Scaling Adjustment is how many nodes to add to the cluster. The dropdown to the right specifies the node metric. You can specify a specific node count, a percentage of nodes based on the cluster size, or a total cluster node count. The Host Group defines which kind of nodes should be added. This will go back to your Blueprint used to build the cluster. You may have compute or data only nodes that you want to add. The Alert is the Cloudbreak Alert we created in the previous step. You can see what I used as an example: When you have everything entered, click on the create policy button. You should now have an Alert and Scaling Policy defined. You should see something similar to this: Run Jobs On The Cluster To trigger the alert, I'm going to run some jobs on my cluster. A simple test would be to run a couple of copies of TeraGen. Because of the size of my cluster, I shouldn't have the capacity to run more than 1 of those at a time. This should create pending applications which will trigger the alert. To do this I'm going to log into one of the nodes in my cluster using ssh. You can do this using something similar to this: ssh -i cloudbreak cloudbreak@#.#.#.# NOTE: Your keyname and ip will be different. You should see something similar to this: The authenticity of host '#.#.#.# (#.#.#.#)' can't be established. ECDSA key fingerprint is SHA256:C10UDnRxnTTaxkWqv5cPgw/FItKWvEdyWmeS2BKVUU8. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '#.#.#.#' (ECDSA) to the list of known hosts. __| __|_ ) _| ( / Amazon Linux AMI ___|\___|___| https://aws.amazon.com/amazon-linux-ami/2017.03-release-notes/ 27 package(s) needed for security, out of 61 available Run "sudo yum update" to apply all updates. Amazon Linux version 2017.09 is available. I'm going to need 4 sessions because I want to have 4 submitted jobs at the same time. In each session I'm going to run the following command: sudo -u hdfs hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar teragen 1000000000 /tmp/terasort1-input For each session you need to specify a unique output directory. In my case I used terastor1-input , terasort2-input , etc. We need enough jobs running for the alert to trigger and be present for at least 5 minutes, which is the time period we specified in Cloudbreak. In Ambari, click on YARN to see the summary dashboard. You should see something similar to this: If you click on the red 1 alert you can get more details. You should see something similar to this: As you can see, this has been CRIT for 3 minutes. Cloudbreak won't trigger an autoscale event until it has been 5 minutes. After 5 minutes has passed and the alert is still present, Cloudbreak should start autoscaling. IF you look at the HISTORY second on the Cluster autoscaling page, you should see something similar to this: As you can see, Cloudbreak as started the autoscaling process. It will add 1 node to the cluster based on our policy. You can also see this on the cluster details page in the Event History . You should see something similar to this: After a couple of minutes, you should notice Ambari showing the addition of another node in the list of operations. You should see something similar to this: Once the new node is added you should notice that one of the other jobs is picked up and the Alert changes from CRITICAL to WARN . You should see something similar to this: Next Steps The autoscaling policy we setup only addresses the addition of new nodes. You need to consider multiple policies that adjusts the cluster up and down. For example you could have a policy that sets the cluster size to a specific total node count when an alert is OK . Cloudbreak also allows you to adjust the scaling configuration to allow for a cool down time with min and max cluster size. This helps you to control the amount of cluster churn created by autoscaling events. Combined with adjusting the period for the Cloudbreak alert, you have a reasonable amount of control over autoscaling on the cluster. Review If you have successfully followed along with this tutorial, you should have been able to create a custom Ambari Alert, to create a Cloudbreak autoscaling policy based on that alert, then see the alert and Cloudbreak autoscaling trigger based on running multiple TeraGen jobs.

myoung · ‎08-25-2017

@Julien Champ spun up a cluster using Cloudbreak 1.16.1. I do not see either of those options available. I will see if I can track this down and get clarification.

myoung · ‎08-02-2017

@Hadoop User Ah, that helps clarify things some. You can use SplitText processor (http://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.SplitText/index.html) to split a file into individual record lines. You could probably use the ExtractText processor (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.ExtractText/index.html) to extract the K:V pairs and create attributes At point, you should be able to put the data into Hive using PutHiveQL. While this article isn't doing exactly what you want, it is something you should be able to follow as an example: https://community.hortonworks.com/questions/80211/from-csv-to-hive-via-nifi.html in terms of general flow. There are some new processors in NiFi 1.3 around RecordReaders and RecordWriters. It may be a little more complicated to get set up at first, but you'll see significantly better performance: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-record-serialization-services-nar/1.3.0/org.apache.nifi.csv.CSVReader/index.html. You might find using ScriptedReader allows you to use Python as an easier way to parse the data: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-scripting-nar/1.3.0/org.apache.nifi.record.script.ScriptedReader/index.html

myoung · ‎08-02-2017

@Hadoop User Do all records for "data1" have the same structure? In other words, while data1, data2, and data3 are different from each other, are all data1 like each other and data2 are like each other? You could use NiFi to route the data using regular expressions with the RouteText processor: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.RouteText/index.html or the RouteOnContent processor: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.RouteOnContent/index.html This would allow you to land each data type into an appropriate Hive table.

myoung · ‎07-17-2017

@Constantin Stanca Zeppelin in HDP 2.6 supports user impersonation. When properly configured, the audit logs should capture the actual end-user submitting the hive queries.

myoung · ‎07-10-2017

@priyanshu hasija You can manually call a commit after indexing data using something like http://localhost:8983/solr/collection_name/update?commit=true. Here is a link to information on autoCommit: https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig

myoung · ‎07-10-2017

@Tim Maher I highly recommend you look into HDF/NiFi as a possible solution to this problem. You can easily use NiFi to pull from source systems, do basic transformations and then store the data in HDFS, Hive, HBase, etc. In terms of feeding the data to BI tools, you may want to consider storing the data in Hive for the best performance. Take a look at this article for one way to solve some of what you are trying to do: https://community.hortonworks.com/articles/52856/stream-data-into-hive-like-a-king-using-nifi.html

myoung · ‎07-10-2017

@priyanshu hasija Likely the data you have indexed hasn't been flushed, then when Solr crashes the data is lost. You should consider change the flush settings to ensure the data is written to the index at a faster interval. https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig

Online	Offline
Last Visited	‎02-08-2019 07:03 PM

Member Since	‎02-09-2016 09:44 PM
Last Visited	‎02-08-2019 07:03 PM
Posts	559
Kudos received	413

Cloudera Community

Re: How can I force the getTwitter processor to no...

Re: Send Ambari Metric to Elasticsearch

Re: Ingesting unformatted, unordered data from hdf...

Re: What would the audit record on Zeppelin users ...

Re: Automate loading data into HDFS

Re: How can I force the getTwitter processor to no...

Re: Send Ambari Metric to Elasticsearch

How to create a custom Ambari Alert and use it for...

Re: [CloudBreak] Use Pending YARN containers OR Pe...

Re: Ingesting unformatted, unordered data from hdf...

Re: Ingesting unformatted, unordered data from hdf...

Re: What would the audit record on Zeppelin users ...

Re: Solr Data delete suddenly

Re: Automate loading data into HDFS

Re: Solr Data delete suddenly