About drussell

drussell · ‎04-24-2016

So you've just got all excited about the Metron Tech Preview, have downloaded the code and maybe have even followed the Apache Metron - First Steps in the Cloud article, but then EC2 instances pop up where you don't expect to see them! Unfortunately at this stage the Metron Tech Preview deployment ignores some of the systemwide settings that you might have in your .aws directory including the preferred region and availability zone. Thankfully all the settings are within the setup scripts instead so this is easy enough to correct. There are two settings I'd suggest updating: ./deployment/amazon-ec2/conf/defaults.yaml Contains the key settings you will need to change towards the top of the file: # ec2 env: test-metron region: us-west-2 instance_type: m4.xlarge image: ami-05cf2265 For example, to change this to run in the EU Ireland region your config settings would need to look like this: # ec2 env: mynamehere-metron region: eu-west-1 instance_type: m4.xlarge image: ami-edb9069e Note that I've also changed the "env" tag to ensure that the Metron environment is easier to identify when I spin it up. The only slightly tricky bit is identifying what the AMI for your chosen region is. You need to find the current active AMI for "CentOS Linux 6 x86_64 HVM EBS", to do this follow these steps. Log into your AWS console Go to the EC2 Dashboard Click on the AMI's link on the left hand side menu (under the Images section) Change the filter on the search box from "Owned by me" to "Public images" Then in the search box enter "CentOS Linux 6 x86_64 HVM EBS" without the quotes You should get one entry returned, use teh AMI ID that is presented there and enter that into your config file. Then you're good to go!

drussell · ‎04-21-2016

Hi @Klaus Lucas I can't delete the question, but if you mark this answer as accepted, the question will be removed from the unanswered questions list. Just incase anyone else comes across the same issue, you are supposed to access the Ambari UI from your web browser on your machine, not from within the VM. For more information on use of the Hortonworks Sandbox, please go here: http://hortonworks.com/hadoop-tutorial/learning-the-ropes-of-the-hortonworks-sandbox/

drussell · ‎04-21-2016

Camus is effectively dead or dying. I haven't really seen much of a strong community develop around it since LinkedIn moved on really. Goblin which is indeed a brand new project that seems to have all of LinkedIn' focus.

drussell · ‎04-20-2016

Hi @Dim Guy so this is a reasonably complex process, depending on your knowledge of RPM building. First of all you'll need to grab the particular SRPM for the package you're interested in: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_HDP_Reference_Guide/content/ch03s05.html The above url covers the difference between the different types of source RPMs that Hortonworks provides. The next set of steps you'll need to follow is here: https://wiki.centos.org/HowTos/RebuildSRPM If you carefully follow this, you should be able to rebuild a new RPM containing the modifications you require. Good luck!

drussell · ‎04-19-2016

Hi there @Hassan Faouaz I had an exchange with a colleague of mine @Aaron Wiebe quite a while ago that I often refer back to whenever this topic comes up, I'll paste it below and can certainly answer questions on it if you have any as I'm pretty familiar with the approaches. I also know that people have built one-off chargeback/showback style dashboards etc that pull data from clusters, but they tend to be heavily bespoke in each case, not something that could be easily shared. Aaron's discussion on this topic is shown below: ------------------------------------------------------------------------------------------------------------------------------- There are effectively two portions to implementing a chargeback system for Hadoop. Building the Model The first is to build a full TCO model for the Hadoop implementation, inclusive of both capital and operational costs. From this number, you should be able to calculate a full cost per month to run the Hadoop system. Using the fully loaded cost per month, you then set two targets - and these targets are somewhat arbitrary: Target system utilization to reach a break-even cost metric - aka your target margin. For most people, this ranges from 60% to 80%. The goal of this target is to provide a realistic resource utilization point, but also give yourself space to initiate an expansion once you begin to run over an average utilization in a month which exceeds your target. The second target is a resource split, between CPU and space. This will be primarily driven on the intended use cases. Most people will split the costs at 75% storage and 25% CPU. Most want to encourage CPU use on the platform, meaning that analysis will be performed rather than simply using Hadoop as a storage mechanism. Once you've built this model, you can calculate two costs: the cost per GB of storage per month, and the cost per hour per GB of memory for CPU utilization. Implementing the Model This step is relatively easy in comparison to the first part. There are two ways to implement this model: charge by reservation or charge by use. Charging by reservation is fairly straightforward - a given use-case or customer will request a certain amount of space, and they will be provided a quota. They are then charged as if their quota was fully utilized, since it is reserved for their use. For CPU/Memory, the calculation can be based on full use of their processing queue over the month. Charging by use is more complex, and less common - however it is possible. Today, content from the logging subsystems will need to be pulled to determine usage for processing, and the filesystem will need to be traversed to determine usage. There are dangers in this approach in that users could, in theory, dump their datasets just before the end of the month - resulting in a skewed result, but that is also easy to detect.

drussell · ‎04-14-2016

Hi @Davide Isoardi that's particularly strange! Does this state still persist after restarting (or indeed stopping/starting) the services? Let us know how it goes!

drussell · ‎04-12-2016

Hi there @nyakkanti you do indeed need to re-run ldap-sync so that Ambari will be able to know about that new user. This is covered in more detail at the link below: https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.1.0/bk_Ambari_Security_Guide/content/_specific_set_of_users_and_groups.html

drussell · ‎04-06-2016

Hi there @kael lee To delete a service or client you need to use the Ambari API There are worked examples in the Apache Ambari documentation linked below: https://cwiki.apache.org/confluence/display/AMBARI/Using+APIs+to+delete+a+service+or+all+host+components+on+a+host I used this recently so it should work fine for you.

drussell · ‎03-31-2016

Hi there @Rahul Pathak So there are really 3 components worth discussing in this topic: 1) Service level authorisation 2) Knox for access to Hadoop Services 3) Ranger for RBAC (Role Based Access Control) security policies. First of all, these can all be used together, they provide additional complementary security measures for Hadoop. Service Level Authorisation deals with the most basic set of permissions, all Hadoop services let you define the users and groups who are authorized to make RPC call to that service. Only if the user making RPC call belongs to authorized service user/group, the RPC call will go through. Once someone is through however, there is no further check made. Knox takes this up a level and exposes user/group based policies for access to Hadoop services (Hive, HDFS, Storm etc) in a far easier way, the policy is created and applied by Ranger and enacted by Knox, this is true perimeter security as users can therefore be denied before they are able to even connect to the Hadoop cluster. Ranger then gives the final level of granularity, once someone is granted access to a particular service, you can then control at a very granular level which Hive databases, tables and table colums they have access to, HDFS paths and the level of access, Kafka queues and much much more. This gives you fine grain control over the exact data and services you wish your users to be granted access to. Hope that helps.

drussell · ‎10-05-2015

Should be mentioned in our release notes in that case imho.

Online	Offline
Last Visited	‎12-10-2018 10:03 AM

Member Since	‎09-18-2015 08:21 AM
Last Visited	‎12-10-2018 10:03 AM
Posts	191
Kudos received	80

Cloudera Community

Re: Metastore HA Active/Active ?

Re: Hi All, I want to integrate Ab initio tool wit...

Re: Hadoop Rack-Awareness is only for datanode ser...

Re: Kafka installation best practices in HDF

Re: Best tools for file transfer and ingest.

How to Run Apache Metron Tech Preview in another R...

Re: Tag based policies with Apache Ranger and Apac...

Re: Status of Camus

Re: How to build Hadoop RPM

Re: Resource chargeback

Re: Version mismatch of datanode after HDP upgrade...

Re: Does Ambari server need ldap-sync to be execut...

Re: how can i delete a bad client?

Re: How Hadoop service level authorization is diff...

Re: Rolling upgrade failing for a clean cluster fr...