About drussell

drussell · ‎05-26-2016

Hi @Aniruddh Kendurkar. So the first time you log in should be username root password hadoop, however upon successful login it will ask you to change your password etc. I have just created a fresh HDP 2.4 sandbox to verify this operation. Are you sure you certain you haven't done that before? If you re-import the machine image again from scratch, does it still do the same thing? Please let me know.

drussell · ‎05-25-2016

Hi @Andrew Watson As long as it's installed out of the default system paths, and only called by whatever processing scripts or tools you're thinking of, I don't see that being an issue at all. It'll require careful installation and ongoing care though, just from a security and patching perspective.

drussell · ‎05-25-2016

Hi @azza messaoudi sorry this isn't a direct answer to your question, but unless you REALLY want to use Flume for this, have you looked at NiFi? There is already a GetTwitter processor that you can configure for search terms etc, and there's a number of really good demos to show how you can put all this together. A 3 part demo series here: https://www.linkedin.com/pulse/apache-nifi-part-1-introduction-neeraj-sabharwal ... and a full tutorial including indexing in solr here: http://hortonworks.com/hadoop-tutorial/how-to-refine-and-visualize-sentiment-data/ I know that's not exactly what you asked for, but I promise that it's absolutely the easiest way to get twitter information into Hadoop that I've ever used. Hope that helps.

drussell · ‎05-25-2016

Hi @Smart Solutions. Right now... there isn't really a good, quick, clean, easy way to achieve this. You've already identified the thread that I would otherwise point you towards for some ideas. You're just too good! There are two main approaches that I would recommend thinking about, the first is ceasing to make any changes via the web-ui, and only making changes via the API, that way you can just call both your clusters one after the other to make the configuration changes. The second is to use some of the ideas from the thread you linked to, where you would continue maintaining the configs on your "master" cluster, but then extract the configs from your "master" cluster on a regular basis (or on a trigger of config changed?), diff them between the previous "master" cluster config version, and then push the resulting deltas to your "slave" cluster, again via API calls. Either way, there's quite a bit of automation that would be required there. I'd strongly suggest if you want to go down this path, doing your work out in the open, this is something I see come up now and again, so I think you may well get others who would be interested in working with you on this. Longer term, Ambari will no doubt support multi cluster, and this functionality would be a natural extension of that, but progress on those public JIRA's has been slow, with other more important items taking priority. Happy to hear if you have other ideas too, sorry I couldn't be more direct help, but let me know if you plan on cutting some code moving forward, I'm sure it'd be an interesting project. Many thanks.

drussell · ‎05-25-2016

Just as an aside, if you also happen to be a paying Hortonworks support cluster, I can't speak highly enough about SmartSense, which will analyse the configs of your cluster and provide you with performance, stability and security recommendations specific to your exact environment. This service is included in every support contract, for more info take a look at: http://hortonworks.com/services/smartsense/ There was also a recent session at the Dublin Hadoop Summit which is worth watching for general tuning suggestions and recommendations (not security specific): https://www.youtube.com/watch?v=sCB6HmfdTZ4

drussell · ‎05-25-2016

Hi @Smart Solutions. It's tricky to give generic best practice recommendations without knowing a lot of detail about what you are doing or have already done. There are a few things I can think of off the top of my head. Ensure you're using HDFS Data Encryption for especially sensitive locations (though you needn't apply it everywhere). Start looking at things like the refresh rate of your Ranger policies, ensure that's in-line with your expectations, setting the refresh time too low could impact the performance of the Ranger admin so bear that in mind. Make sure that you're actually looking at the logging and auditing that Ranger is creating. Start thinking about how Atlas could start to play a part in your security story, with things like the upcoming tag based policy control, Ranger + Atlas will be a very powerful combination. For more info take a look at: http://hortonworks.com/hadoop-tutorial/tag-based-policies-atlas-ranger/ Standard practices apply, ensure that people have the least permissions they need (both in terms of access to data and services) to complete their job, no more, no less. Hope that helps, the fact that you already have Kerberos, Ranger and Knox in place suggests you're already along the right path. Good luck and hope that helps.

drussell · ‎05-24-2016

Hi @Smart Solutions. Generally for that level, 3 zookeepers should be fine. You could bump it up to 5 if you start seeing issues, but we rarely see clusters go higher than that for a zookeeper instance as much higher starts to create quorum overheads. There's no simple rule of thumb for this, it's as much an art as it is a science, as it depends on the workloads and how chatty they are with your current zk's. Rather than going to larger clusters of zk's people tend to split out certain services to their own zk's when they're putting more pressure on an otherwise fairly quiet zk cluster. So. In short, even up to 40 nodes could be fine, but keep an eye on your zk response times etc, if you start to see issues, then maybe move to 5 zk nodes, or consider splitting out the heaviest service to it's own zk cluster. Finally keep an eye on the I/O requirements, dedicate spindles to ZK quorum, don't make it share other busier disks. Hope that helps!

drussell · ‎05-24-2016

Latency wise that should be very easy to achieve. As for the event driven nature of what it sounds like you're trying to achieve, I don't see why not. Hope that helps.

drussell · ‎05-24-2016

Hi @Kaliyug Antagonist. I would suggest implementing Knox, with a restricted set of users allowing access to only the set of services you want to expose to those users. Both http://hortonworks.com/apache/knox-gateway/ and http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_Security_Guide/content/perimeter_security_with_apache_knox.html ... should get you started. Hope that helps.

drussell · ‎05-24-2016

Hi @Farrukh Mahmood. When you get these errors, are you using the standard version of Sqoop, or the IBM Netezza version? As mentioned in the documentation: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_dataintegration/content/netezza-connector.html you will need to put the IBM jar (named nzjdbc.jar) in the right location for it to be picked up. The standard version of Sqoop shipped with HDP does not support those additional arguments. Hope that helps.

Online	Offline
Last Visited	‎12-10-2018 10:03 AM

Member Since	‎09-18-2015 08:21 AM
Last Visited	‎12-10-2018 10:03 AM
Posts	191
Kudos received	80

Cloudera Community

Re: Metastore HA Active/Active ?

Re: Hi All, I want to integrate Ab initio tool wit...

Re: Hadoop Rack-Awareness is only for datanode ser...

Re: Kafka installation best practices in HDF

Re: Best tools for file transfer and ingest.

Re: Hortonworks HDP 2.4 none of shell is working (...

Re: Concerns with installing Python 3.X on Hadoop ...

Re: Does anyone know how can we stream Twitter dat...

Re: Configuration replication across environments

Re: Best Practices on Ranger, Ranger KMS and Knox

Re: Best Practices on Ranger, Ranger KMS and Knox

Re: How to decide, how many zookeepers should I ha...

Re: Spark Streaming 2.0 is it suitable for Low Lat...

Re: Quickly secure the access to the cluster via h...

Re: Sqoop extra arguments supported by Netezza Con...