Member since
04-26-2016
6
Posts
3
Kudos Received
0
Solutions
10-04-2019
02:46 PM
Hello, I'm looking your answer 3 years later because I'm in a similar situation :). In my company (telco) we're planning using 2 hot clusters with dual ingest because our RTO is demanding and we're looking for mechanisms to monitor and keep in sync both clusters. We ingest data in real-time with kafka + spark streaming, loading to HDFS and consuming with Hive/Impala. I'm thinking about a first approach making simple counts with Hive/Impala tables on both clusters each hour/half hour and comparing. If something is missing in one of the clusters, we will have to "manually" re-ingest the missing data (or copy it with cloudera BDR from one cluster to the other) and re-process enriched data. I'm wondering if have you dealt with similar scenarios or suggestions you may have. Thanks in advance!
... View more
05-19-2016
06:59 PM
6 Kudos
Hi @David Lays ... wow you don't want to know much in a single question! 🐵 I'll try to give you an overview on each one, if you want to go deeper on any of them, I suggest a separate question per topic area. So first of all, the below applies to all environments, regardless of dev/test/pre-prod/prod. HA of Masters Do it. Full NN HA is production ready and I barely ever see clusters without it nowadays. HA of edge nodes Knox is stateless, to make it HA you just spin up multiple instances of it and put them behind a session aware load balancer All clients wherever possible should go through knox, or ambari views. Ambari views can also be spun up in the same way as Knox (as described above). Security Kerberos - Do it. Without Kerberos it doesn't matter what else you layer on top, your cluster is insecure, think about a large safe with a big secure door on the front, but no side walls... that kind of security is what you have without Kerberos. Ranger - Do it, policies can be group based, groups can be inherited from AD or LDAP Knox - strongly recommended, also prevents you have to update users each time your internal cluster services move around, they just keep talking to Knox and Knox does the internal mappings Now you have questions about master and slave node mixing and % of global data to store between the environments. What I would say here is that there is a very strong emerging pattern in a lot of organisations that guides the decisions you make here. First, you still need Dev, Test, Pre-Prod, Prod etc, but that's for testing your infrastructure. i.e. whenever you upgrade to a new version of HDP, or add a new generation or vendor of hardware, or update a third party component such as SAS.... you run that through your Dev/Test etc etc clusters. Now... When it comes to your user base, that's a very different conversation. With the datalake being a very real concept nowadays, and datalakes being truly multitennant and people being able to store and safely control access to a wide range of data, what we're seeing is that the data scientists, developers and many other categories of users including those that would usually have been on a separate scaled down silo, are actually using resources on the production datalake. Their resource queues are managed so they can't impact production jobs or users, and maybe in some cases they can only access anonymised data, rather than data containing the full PII (personally identifiable information). But they can also test and develop their programs and hypothesis against a scale of data that just isn't possible in a "data lab". One thing that you don't mention is DR (disaster recovery) we often see these assets also being used as areas for Developers and Data Science users to also be working on, and in the event of a DR situation, a separate set of capacity scheduler queues are deployed so production workloads take precidence until the DR conditions are resolved. Hope that helps, it's a complex situation but this should set you on the right path. Good luck!
... View more
05-20-2016
04:01 AM
1 Kudo
This is a superb distillation of things you should think about and do for a new cluster design and installation. Read through this carefully as there a nuances and reasons for his recomendations.
... View more
10-08-2016
11:49 AM
@David Lays Please let me know what final Kafka design approach you went with; Kafka on Cluster node or separate Kafka cluster. We are also facing exactly same design dilemma with regards to Kafka installation for Cluster. Thanks very much in advance.
... View more