Member since
02-09-2016
559
Posts
421
Kudos Received
98
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
555 | 03-02-2018 01:19 AM | |
731 | 03-02-2018 01:04 AM | |
676 | 08-02-2017 05:40 PM | |
604 | 07-17-2017 05:35 PM | |
457 | 07-10-2017 02:49 PM |
11-01-2017
09:29 PM
1 Kudo
@Maryem Mary I see why you have some confusion; it was harder to find than it should be. This link tells you where to find the repositories: https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.0.1.1/bk_installing-hdf/content/ch_using-local-repos.html. It is "Obtain Public Repositories from the HDF Release Notes" The direct link to the repositories are here: https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.0.1.1/bk_release-notes/content/ch_hdf_relnotes.html#repo-location
... View more
10-27-2017
03:54 PM
@Joe P The HDP MPack is bundled with Ambari itself. I'm not aware of any direct links to the HDP MPacks.
... View more
10-25-2017
05:29 PM
1 Kudo
@Rahul Raj HDF is very flexible. It comes bundled with a number of processors (over 200). Those processors allow you to pull and push data, or to do some transformations etc. Some processors are flexible in that they allow you to interact with web services via standard HTTP requests. On top of that, you can create scripts (Python, Javascript, etc) and run those scripts from within HDF. Then you have the ability to easily write your own processors to extend HDF. Sqoop is generally used to bulk pull data from RDBMS and then you use HDF to pull the incremental changes. It depends on the volume of data and the use case. In some cases, you can just use HDF. In others, you might use Sqoop first, then HDF for those updates. Anything Flume can do, HDF can do better. So if you are considering Flume, then HDF can do it. HDF has the ability to directly land data into Kafka, HDFS, HBase, Hive etc. So again, it depends on your use case and the volume of data in terms of what the flow would look like, but HDF can land the data into HDP after pulling it from another source (DB, Cassandra, HTTP, Elasticsearch, Solr, etc).
... View more
10-23-2017
02:38 PM
2 Kudos
@Rahul Raj To correct something in your question, Flume and Sqoop are part of HDP, not HDF. HDF allows you to manage your data in motion. It is primarily a data flow tool. However, it is capable of performing simple event processing. Think of HDF as a way to manage and route data between different systems. Those systems can be two different HDP clusters, or the systems can be an HDF and HDP cluster, or a REST API endpoint and an HDP cluster, etc. With over 200 processors, HDF is flexible and allows you to manage data flows between any number of systems. HDF can pull data from web servers, RESTful APIs, Data Warehouses (RDBMS). It can handle streaming data or bulk pulling data at rest from an HDFS filesystem.
... View more
10-20-2017
03:21 PM
1 Kudo
@Joe P Installation docs are available here: https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.0.1.1/bk_planning-your-deployment/content/ch_deployment-scenarios.html The order in which you install MPacks matters. You can install the HDF MPack on an HDP cluster. However, I don't believe trying to install an HDP MPack on an HDF cluster is supported.
... View more
10-19-2017
01:57 PM
1 Kudo
@Eyad Garelnabi It seems like the storage devices are not mounted properly on the instance. Is this a one-off occurrence or can you reproduce it? If you log into each of the nodes on the cluster with ssh, you can do df -h to see which devices are mounted.
... View more
10-18-2017
10:13 PM
@Tomas You can find more logging information by using cbd logs You can narrow down the logs using cbd logs cloudbreak You can use ctrl-c to break out of that. That should show you more logging information. You can run the ps command to make sure all of your containers are running using cbd ps What does the output of that look like?
... View more
10-18-2017
09:19 PM
13 Kudos
Objectives
Everyday Hortonworks customers are taking advantage of the flexibility and elasticity that cloud platforms provide. For many of these customers, Cloudbreak is used to manage their HDP clusters and to provide autoscaling capability.
Cloudbreak's autoscaling features are tied to Ambari Alerts. Ambari ships with a set of alerts out of the box. However, you may want to enable an autoscaling policy based on an alert that Ambari doesn't provide out of the box. The good news is Ambari supports creating custom alerts. Custom alerts created in Ambari are visible to Cloudbreak and usable with Cloudbreak autoscaling policies.
A common desire with autoscaling is to scale the cluster based on memory used, cores used, or perhaps the number of running applications. You can use the YARN ResourceManager JMX data to determine these values. For example, you may have a typical cluster with 5 Node Managers. You also know that sometimes your cluster usage will spike and you want to increase the number of Node Managers by 3, but you don't want to run 8 Node Managers all the time to save costs. You can create an alert based on the JMX data from YARN ResourceManager to scale the cluster based on usage. Then Cloudbreak can scale the cluster when the alert is triggered.
This tutorial will walk you through the process of creating a custom Ambari Alert for use by Cloudbreak autoscaling policies.
Prerequisites
You should have a properly running instance of Cloudbreak with credentials for your cloud provider of choice.
You should have an Ambari 2.5/HDP 2.6 cluster already deployed with Cloudbreak.
Scope
This tutorial was tested in the following environment:
Cloudbreak 1.16.4
AWS EC2
Ambari 2.5
HDP 2.6
Steps
Login into Ambari
As mentioned in the prerequisites, you should already have a cluster built using Cloudbreak. Click on the cluster summary box in the Cloudbreak UI to display the cluster details. Now click on the link to your Ambari cluster. You may see something similar to this:
Your screen may vary depending on your browser of choice. I'm using Chrome. This warning is because Cloudbreak uses self-signed certificates which are not trusted. Click on the Advanced link. You should see something similar to this:
Click on the Proceed link to open the Ambari login screen. You should be able to login to Ambari using the default username and password of admin unless you changed it.
Once you have logged into Ambari, you should see something similar to this:
NOTE: Your specific cluster may look different.
Login into YARN ResourceManager
YARN is the central component used to manage resource availability on an HDP cluster. In Ambari, you can see a high-level summary of resources available to YARN to by click on the YARN link in the service list on the Ambari dashboard. You should see something similar to this:
If you take a look at the upper right corner, you can see a summary of containers, applications and cluster memory. For this tutorial, I would like Cloudbreak to autoscale my cluster when the number of pending applications is greater than 2. To do this, I'm going to create a custom Ambari Alert based on that value. To get that value, I need to look at the YARN ResourceManager JMX data.
View YARN ResourceManager JMX Data
You can view available JMX data for the YARN ResourceManager via the Ambari Quick Links. You should already have the YARN ResourceManager dashboard visible from the last step. Click on the Quick Links drop down menu in the top middle of the screen. You should see something similar to this:
As you can see, ResourceManager JMX is available in the list. If you click that link you will see something similar to this:
You should see a fairly large JSON output. If you search for q0=root , you should see something similar to this:
This is a list of YARN related metrics associated with the root queue. If you look in the list of values, you should see AppsPending . This is the metric I want to use for my Ambari Alert.
Review existing Alert definitions
You can view the definition for any Ambari provided Alerts.
To get a list of all alerts on the system, you make a call to the Ambari API:
curl -u admin:admin -i -k -H 'X-Requested-By:ambari' https://#.#.#.#/ambari/api/v1/clusters/autoscaling/alert_definitions/
You should see something similar to this:
HTTP/1.1 200 OK
Server: nginx
Date: Wed, 18 Oct 2017 17:15:21 GMT
Content-Type: text/plain
Content-Length: 21595
Connection: keep-alive
Vary: Accept-Encoding
X-Frame-Options: DENY
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Cache-Control: no-store
Pragma: no-cache
Set-Cookie: AMBARISESSIONID=1gprc4wefyoiqmb1kj6plu95j;Path=/;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
User: admin
Vary: Accept-Encoding, User-Agent
{
"href" : "http://#.#.#.#/api/v1/clusters/autoscaling/alert_definitions/",
"items" : [
{
"href" : "http://#.#.#.#/api/v1/clusters/autoscaling/alert_definitions/1",
"AlertDefinition" : {
"cluster_name" : "autoscaling",
"id" : 1,
"label" : "HBase Master Process",
"name" : "hbase_master_process"
}
},
...
NOTE: Your username and password may be different. You need to update the curl call to use your IP address for the Ambari server and your cluster name. In this example, my cluster name is autoscaling . Also notice the use of https for Cloudbreak clusters and the need for the -k flag.
As you can see, each alert is assigned a unique id. To view the configuration of a specific alert, you make a curl call to the href link with the alert id provided in the output.
To see the definition of Alert id 1 , make the following curl call:
curl -u admin:admin -i -k -H 'X-Requested-By:ambari' https://#.#.#.#/ambari/api/v1/clusters/autoscaling/alert_definitions/1
NOTE: With Cloudbreak, Ambari is using HTTPS and is proxied so change http to https and /api to /ambari/api .
You should see something similar to this:
HTTP/1.1 200 OK
Server: nginx
Date: Wed, 18 Oct 2017 17:24:00 GMT
Content-Type: text/plain
Content-Length: 1156
Connection: keep-alive
Vary: Accept-Encoding
X-Frame-Options: DENY
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Cache-Control: no-store
Pragma: no-cache
Set-Cookie: AMBARISESSIONID=bcdh6wmyxpnd1ioufen9hikva;Path=/;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
User: admin
Vary: Accept-Encoding, User-Agent
{
"href" : "http://#.#.#.#/api/v1/clusters/autoscaling/alert_definitions/1",
"AlertDefinition" : {
"cluster_name" : "autoscaling",
"component_name" : "HBASE_MASTER",
"description" : "This alert is triggered if the HBase master processes cannot be confirmed to be up and listening on the network for the configured critical threshold, given in seconds.",
"enabled" : true,
"help_url" : null,
"id" : 1,
"ignore_host" : false,
"interval" : 1,
"label" : "HBase Master Process",
"name" : "hbase_master_process",
"repeat_tolerance" : 1,
"repeat_tolerance_enabled" : false,
"scope" : "ANY",
"service_name" : "HBASE",
"source" : {
"default_port" : 60000.0,
"reporting" : {
"ok" : {
"text" : "TCP OK - {0:.3f}s response on port {1}"
},
"warning" : {
"text" : "TCP OK - {0:.3f}s response on port {1}",
"value" : 1.5
},
"critical" : {
"text" : "Connection failed: {0} to {1}:{2}",
"value" : 5.0
}
},
"type" : "PORT",
"uri" : "{{hbase-site/hbase.master.port}}"
}
}
The alert definitions will vary depending on the component. My advice is to look for existing alert definitions around the component for which you are interested and use that as a base for your custom alerts.
Create Custom Alert JSON file
To submit a custom alert to Ambari, we can define the alert in a JSON file which we upload via the Ambari API. You can copy and paste the following alert definition to your alert file:
{
"AlertDefinition" : {
"cluster_name" : "autoscaling",
"component_name" : "RESOURCEMANAGER",
"description" : "This queue-level alert is triggered if the number of root queue pending applications exceeds 1.",
"enabled" : true,
"help_url" : null,
"ignore_host" : false,
"interval" : 5,
"label" : "[CUSTOM] ResourceManager Pending Applications",
"name" : "queue_pending_applications",
"repeat_tolerance" : 1,
"repeat_tolerance_enabled" : false,
"scope" : "ANY",
"service_name" : "YARN",
"source" : {
"jmx" : {
"property_list" : ["Hadoop:service=ResourceManager,name=QueueMetrics,q0=root/AppsPending"],
"value" : "{0}"
},
"reporting" : {
"ok" : {
"text" : "YARN Pending Applications: {0}"
},
"warning" : {
"text" : "YARN Pending Applications: {0}",
"value" : 2
},
"critical" : {
"text" : "YARN Pending Applications: {0}",
"value" : 3
},
"units" : "Applications"
},
"type" : "METRIC",
"uri" : {
"http" : "{{yarn-site/yarn.resourcemanager.webapp.address}}",
"https" : "{{yarn-site/yarn.resourcemanager.webapp.https.address}}",
"https_property" : "{{yarn-site/yarn.http.policy}}",
"https_property_value" : "HTTPS_ONLY",
"kerberos_keytab" : "{{yarn-site/yarn.resourcemanager.webapp.spnego-keytab-file}}",
"kerberos_principal" : "{{yarn-site/yarn.resourcemanager.webapp.spnego-principal}}",
"default_port" : 0.0,
"connection_timeout" : 5.0,
"high_availability" : {
"alias_key" : "{{yarn-site/yarn.resourcemanager.ha.rm-ids}}",
"http_pattern" : "{{yarn-site/yarn.resourcemanager.webapp.address.{{alias}}}}",
"https_pattern" : "{{yarn-site/yarn.resourcemanager.webapp.https.address.{{alias}}}}"
}
}
}
}
}
You will need to change the value of cluster_name to match the name of your cluster. The label and name values can be customized by you, but they should be unique from other alerts in the system. The label is what will be displayed in the Ambari. I like to prepend [CUSTOM] on custom alerts to make it clear. Once you make the appropriate changes, you save the file as alert.json or really any filename you like.
This alert, as defined with throw a WARNING alert when the number of pending applications is 2 and a CRITICAL alert when the number of pending applications is 3.
Upload Custom Alert JSON file
Now that we have the custom alert file, we can submit it to the Ambari API to create the new alert. You submit the alert by using the following curl call:
curl -u admin:admin -i -k -H 'X-Requested-By:ambari' -X POST -d @alert.json https://#.#.#.#/ambari/api/v1/clusters/autoscaling/alert_definitions
You should see something similar to the following:
HTTP/1.1 100 Continue
HTTP/1.1 201 Created
Server: nginx
Date: Wed, 18 Oct 2017 17:52:47 GMT
Content-Type: text/plain
Content-Length: 0
Connection: keep-alive
X-Frame-Options: DENY
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Cache-Control: no-store
Pragma: no-cache
Set-Cookie: AMBARISESSIONID=18utggom97x7z33z3d2x9h1mf;Path=/;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
User: admin
Verify Custom Alert Exists
You can verify the alert exists using the API call we used before:
curl -u admin:admin -i -k -H 'X-Requested-By:ambari' https://#.#.#.#/ambari/api/v1/clusters/autoscaling/alert_definitions
You should see the new alert at the bottom of the list:
{
"href" : "http://#.#.#.#/api/v1/clusters/autoscaling/alert_definitions/75",
"AlertDefinition" : {
"cluster_name" : "autoscaling",
"id" : 75,
"label" : "[CUSTOM] ResourceManager Pending Applications",
"name" : "queue_pending_applications"
}
}
]
You can also verify via the Ambari Alerts page. In the upper right-hand menu of Ambari, click on Alerts . You should see something similar to this:
Now filter for CUSTOM and you should see something similar to this:
As you can see, the alert exists in Ambari. After a few minutes, the status should change from NONE to OK .
Create Cloudbreak Autoscaling Policy
Now that our custom alert exists in Ambari, we can create a Cloudbreak autoscaling policy based on that alert. In the Cloudbreak UI, show the details for the cluster you have running. You should see something similar to this:
Click on the autoscaling SLA polices link to the right of details . You should see something similar to this:
By default, the policies should be disabled. You can click on the enable button to enable autoscaling. You should see something similar to this:
Before creating the policy, you have to define the Ambari Alert on which you want to trigger. Click the create alert button. You should see something similar to this:
You have to option to chose between metric based and time based alerts. Time based alerts allow you to define a cron based time period where autoscaling events will happen. For this tutorial, I'm going to use metric based.
The Alert Name and Description are up to you. I recommend using something informative. The Metric - Desired State is a drop down where you select from the list of available Ambari Alerts and you determine which Alert state you want to trigger. The Period is how long, in minutes, the alert should exist before an autoscaling event is triggered. You should use a value that is reasonable; you don't want the scaling events happening too quickly as that can cause a lot of churn.
You can see what I've used as an example:
When you have everything entered, click on the create alert button. Now we can define the scaling policy itself. Click on the create policy button. You should see something similar to this:
The Policy Name is up to you. Again, I recommend using something informative. The Scaling Adjustment is how many nodes to add to the cluster. The dropdown to the right specifies the node metric. You can specify a specific node count, a percentage of nodes based on the cluster size, or a total cluster node count. The Host Group defines which kind of nodes should be added. This will go back to your Blueprint used to build the cluster. You may have compute or data only nodes that you want to add. The Alert is the Cloudbreak Alert we created in the previous step.
You can see what I used as an example:
When you have everything entered, click on the create policy button. You should now have an Alert and Scaling Policy defined. You should see something similar to this:
Run Jobs On The Cluster
To trigger the alert, I'm going to run some jobs on my cluster. A simple test would be to run a couple of copies of TeraGen. Because of the size of my cluster, I shouldn't have the capacity to run more than 1 of those at a time. This should create pending applications which will trigger the alert.
To do this I'm going to log into one of the nodes in my cluster using ssh. You can do this using something similar to this:
ssh -i cloudbreak cloudbreak@#.#.#.#
NOTE: Your keyname and ip will be different.
You should see something similar to this:
The authenticity of host '#.#.#.# (#.#.#.#)' can't be established.
ECDSA key fingerprint is SHA256:C10UDnRxnTTaxkWqv5cPgw/FItKWvEdyWmeS2BKVUU8.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '#.#.#.#' (ECDSA) to the list of known hosts.
__| __|_ )
_| ( / Amazon Linux AMI
___|\___|___|
https://aws.amazon.com/amazon-linux-ami/2017.03-release-notes/
27 package(s) needed for security, out of 61 available
Run "sudo yum update" to apply all updates.
Amazon Linux version 2017.09 is available.
I'm going to need 4 sessions because I want to have 4 submitted jobs at the same time. In each session I'm going to run the following command:
sudo -u hdfs hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar teragen 1000000000 /tmp/terasort1-input
For each session you need to specify a unique output directory. In my case I used terastor1-input , terasort2-input , etc. We need enough jobs running for the alert to trigger and be present for at least 5 minutes, which is the time period we specified in Cloudbreak. In Ambari, click on YARN to see the summary dashboard. You should see something similar to this:
If you click on the red 1 alert you can get more details. You should see something similar to this:
As you can see, this has been CRIT for 3 minutes. Cloudbreak won't trigger an autoscale event until it has been 5 minutes. After 5 minutes has passed and the alert is still present, Cloudbreak should start autoscaling. IF you look at the HISTORY second on the Cluster autoscaling page, you should see something similar to this:
As you can see, Cloudbreak as started the autoscaling process. It will add 1 node to the cluster based on our policy. You can also see this on the cluster details page in the Event History . You should see something similar to this:
After a couple of minutes, you should notice Ambari showing the addition of another node in the list of operations. You should see something similar to this:
Once the new node is added you should notice that one of the other jobs is picked up and the Alert changes from CRITICAL to WARN . You should see something similar to this:
Next Steps
The autoscaling policy we setup only addresses the addition of new nodes. You need to consider multiple policies that adjusts the cluster up and down. For example you could have a policy that sets the cluster size to a specific total node count when an alert is OK .
Cloudbreak also allows you to adjust the scaling configuration to allow for a cool down time with min and max cluster size. This helps you to control the amount of cluster churn created by autoscaling events. Combined with adjusting the period for the Cloudbreak alert, you have a reasonable amount of control over autoscaling on the cluster.
Review
If you have successfully followed along with this tutorial, you should have been able to create a custom Ambari Alert, to create a Cloudbreak autoscaling policy based on that alert, then see the alert and Cloudbreak autoscaling trigger based on running multiple TeraGen jobs.
... View more
- Find more articles tagged with:
- Ambari
- ambari-alerts
- autoscaling
- Cloud & Operations
- Cloudbreak
- How-ToTutorial
Labels:
10-18-2017
12:13 PM
@Sanaz Janbakhsh You may find this series of articles helpful: https://community.hortonworks.com/articles/75032/executescript-cookbook-part-1.html. This article is also useful: https://community.hortonworks.com/articles/35568/python-script-in-nifi.html
... View more
10-17-2017
05:17 PM
@Ashal Singh I'm not aware of any plans to for Metron to be part of Azure HD Insights. However, you should be able to deploy Metron on Azure IaaS without difficulty. Can you share more information about what you are trying to do? Do you have more specific questions for which you are looking for answers?
... View more
10-10-2017
02:12 PM
@Sanaz Janbakhsh There is an open JIRA to create a process to convert between data formats, but the work has not been completed on it. https://issues.apache.org/jira/browse/NIFI-2997 I think you should be able to do what you need using my recommendation above. Python should be able to easily handle the task you need (https://docs.python.org/3/library/base64.html). You just need to plug a Python script into your NiFi fly using ExecuteScript.
... View more
10-09-2017
05:40 PM
@Sanaz Janbakhsh Have you considered ExecuteScript or ExecuteProcess to use Python or Java to do the conversion?
... View more
09-22-2017
01:21 PM
@Prakash Punj Elasticsearch, Kibana, and Logstash are not provided as out-of-the-box services in Ambari. However, Ambari is extensible and designed to allow you to create your own Management Packs for custom services. You can create custom services for each those products and install/manage them from Ambari. I've created a very minimal service definition (not full Management Pack) for Elasticsearch here: https://github.com/Jaraxal/ambari-elasticsearch-service. My service definition is based on this one here which is very dated: https://github.com/Symantec/ambari-elasticsearch-service. You could use either of these as an example to get started. Neither of these are supported by Hortonworks. All of these services can be installed on a single server. However, there are a couple of points you should consider: Depending on the amount of data you intend to store in Elasticsearch, you may want more than a single server. Additionally you may want more than a single server for index replicas. Depending on the number of users hitting Kibana, you may want more than 1 copy of Kibana running. Logstash typically scales by adding additional Logstash instances. You may want to have multiple copies of Logstash running across more than 1 server. This depends on your performance needs. As a general rule, I would have a server dedicated to ELK and try to mix ELK and HDP components together. If this is a dev/test cluster, then you have a lot of flexibility to ignore my points above.
... View more
09-22-2017
01:12 PM
@Deepak Tripathi I would install Anaconda Python 2.7.x. You can install it in your home directory or in /opt. This won't interfere with the system default python.
... View more
08-27-2017
02:36 PM
@Sofian Benabdelhak When you are running the sandbox on Virtualbox, you are actually running a Linux VM which has a Docker container running inside of that. When you use the Virtualbox UI to connect to the VM console, you are connecting to the VM, not the Docker instance. This configuration is new as of the HDP 2.5 version of the sandbox and has caused some confusion. Is there something specific you are trying to do with a GUI? The sandbox is intended to learn the ropes of HDP and is primarily validated against the associated sandbox tutorials. There are limitations of the sandbox for uses outside of the expected use case.
... View more
08-26-2017
02:47 PM
@Nagesh Kumar So it looks like you are installing the Python version of Tensorflow using Anaconda3 as your Python distribution. The error you are getting is a GLIBC error and isn't really that uncommon it seems. It's an OS/Anaconda/Tensorflow issues in general. What OS are you running? You are likely running into this: https://stackoverflow.com/questions/39807621/glibc-2-14-not-recognized-by-tensorflow-installation-in-redhat. It seems this problem is fairly common as a Google search yields a number of similar results. This appears to suggest a workaround: https://stackoverflow.com/questions/33655731/error-while-importing-tensorflow-in-python2-7-in-ubuntu-12-04-glibc-2-17-not-f. If you are not running Ubuntu, you should be able to adapt the specific versions as needed to your OS.
... View more
08-26-2017
02:42 PM
@Ryan Sequeira Yes, if you upgrade your system memory to 16GB, then it should resolve the problem. If you have 8GB of memory, you can't give all 8GB to the VM because your host computer needs a minimum of 2-3GB just to run it's base OS.
... View more
08-25-2017
04:09 PM
1 Kudo
@Julien Champ
spun up a cluster using Cloudbreak 1.16.1. I do not see either of those options available. I will see if I can track this down and get clarification.
... View more
08-02-2017
10:05 PM
@Hadoop User Ah, that helps clarify things some. You can use SplitText processor (http://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.SplitText/index.html) to split a file into individual record lines. You could probably use the ExtractText processor (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.ExtractText/index.html) to extract the K:V pairs and create attributes At point, you should be able to put the data into Hive using PutHiveQL. While this article isn't doing exactly what you want, it is something you should be able to follow as an example: https://community.hortonworks.com/questions/80211/from-csv-to-hive-via-nifi.html in terms of general flow. There are some new processors in NiFi 1.3 around RecordReaders and RecordWriters. It may be a little more complicated to get set up at first, but you'll see significantly better performance: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-record-serialization-services-nar/1.3.0/org.apache.nifi.csv.CSVReader/index.html. You might find using ScriptedReader allows you to use Python as an easier way to parse the data: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-scripting-nar/1.3.0/org.apache.nifi.record.script.ScriptedReader/index.html
... View more
08-02-2017
05:47 PM
@Venkata Sudheer Kumar M Have you tried printing out the JSON the Python script is pushing to the API to validate it's correct? Then take that printed output and try pushing it to the API using curl. That will eliminate the JSON as the problem.
... View more
08-02-2017
05:40 PM
@Hadoop User Do all records for "data1" have the same structure? In other words, while data1, data2, and data3 are different from each other, are all data1 like each other and data2 are like each other? You could use NiFi to route the data using regular expressions with the RouteText processor: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.RouteText/index.html or the RouteOnContent processor: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.RouteOnContent/index.html This would allow you to land each data type into an appropriate Hive table.
... View more
07-27-2017
10:55 PM
1 Kudo
@Hovo Khachikyan It depends on the use cases you intend for the cluster. A 6 node cluster may be ok for a dev or POC cluster, but is generally considered under powered for a production use case. When you say you want to install "most components", do you know what you want to do with the cluster? I generally recommend that you do not install components just to "check the box". You can easily add components later if you find you need them.
... View more
07-17-2017
05:35 PM
2 Kudos
@Constantin Stanca Zeppelin in HDP 2.6 supports user impersonation. When properly configured, the audit logs should capture the actual end-user submitting the hive queries.
... View more
07-10-2017
03:24 PM
@priyanshu hasija You can manually call a commit after indexing data using something like http://localhost:8983/solr/collection_name/update?commit=true. Here is a link to information on autoCommit: https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig
... View more
07-10-2017
02:49 PM
1 Kudo
@Tim Maher I highly recommend you look into HDF/NiFi as a possible solution to this problem. You can easily use NiFi to pull from source systems, do basic transformations and then store the data in HDFS, Hive, HBase, etc. In terms of feeding the data to BI tools, you may want to consider storing the data in Hive for the best performance. Take a look at this article for one way to solve some of what you are trying to do: https://community.hortonworks.com/articles/52856/stream-data-into-hive-like-a-king-using-nifi.html
... View more
07-10-2017
02:45 PM
@swathi thukkaraju Have you tried using the built-in TOBAG functions? https://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#Bag+and+Tuple+Functions You could also write your own UDF.
... View more
07-10-2017
02:38 PM
@priyanshu hasija Likely the data you have indexed hasn't been flushed, then when Solr crashes the data is lost. You should consider change the flush settings to ensure the data is written to the index at a faster interval. https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig
... View more
07-10-2017
02:21 PM
1 Kudo
@Alexey Vovchenko The latest version of Ambari supports installing and managing both HDP and HDF. If you upgrade Ambari on the HDP cluster, you could install the HDF management pack and then install Kafka via Ambari that way.
... View more
06-02-2017
06:46 PM
1 Kudo
@Daniel Beaulieu I think you'll find recommendations on which cloud provider will vary depending on experience and preference. From a Hortonworks Infrastructure as a Service (IaaS) perspective, it's all just virtual machines (instances) to us. If you configure HDP as IaaS, then both clouds are 100% supported; you won't find missing capabilities. Our software works equally well across AWS, Azure, Google, etc. In fact, our Cloudbreak component makes it very easy to deploy across multiple cloud vendors. I can demonstrate deploying the exact same cluster configuration on AWS and Azure simultaneously from a single web-base UI. I have used both AWS and Azure quite a bit over the last couple of years. I personally prefer AWS over Azure and tend to use AWS more. That doesn't mean AWS is better, but that is my preferred cloud vendor. If you talk to someone else, they are just as likely to tell you they prefer Azure. In terms of performance, I haven't seen any real benchmarks to suggest that Hortonworks on Azure is faster than Hortonworks on AWS. If you optimally configure both environments, I would expect performance to be within 1 or 2 percentage points. There are some customers that use both vendors with clusters in different clouds. From a technical perspective, I think that AWS is easier to use. I think that Azure tries to be user friendly, but isn't as easy as AWS. Just my 2 cents.
... View more
05-26-2017
04:19 PM
@Chris Houston Can you confirm the HDFS plugin shows it is active? On the Ranger Admin audit page, you can see when the plugins last updated. Do you see HDFS in the list?
... View more