I am not sure if this is the right section to post this but currently I am using Cloudera on AWS however have a few questions I am trying to wrap my head around and maybe can get some answers. I am open to how to do things better with cloudera and make it easier to manage, support and have better performance but cost is very expensive to run on AWS (Even with reserved instances, EBS volumes etc). Cloudera seems like a good solution for an on-premise deployment, but having troubles with the following list below. Have others solved these problems that I am facing?
Cloudera on "Cloud" (Public or Private)
- Managed suite
- Open Source
- Great Support community
- Have to buy enterprise edition (very costly!)
- Cost prohibitive at massive scale with hundreds of nodes - Licenseing model is $3,500+/node
- Automated backups - very difficult with out enterprise
- Disaster Recovery - very difficult with out enterprise
- If a single node goes down - it has to replicate- however if multiple nodes go down you are in trouble
- does not support AWS Snapshot / Restore
- Can not Attach snapshot volume to instance - Cloudera wont accept it and causes system go ho haywire due to mismatch info. (Root or HDFS volumes)
- Requires Database to be in sync with Cloudera and very dependant on each other
- System is not truely "stateless" - where a component can fail and resume where it left off
- EBS volumes are very expensive to run (10 nodes x 3 volumes each @ 3000 GB each = $3,375.90/month in EBS costs!!)
- No Instance auto-scaling supported (Cant turn off instances when system is idle - difficult to put new instances in service to handle spikes in demand)
- Ehpemeral storage is not an option for Cloudera as it takes a long time to recover from failure
- Does not support multi-zone deployment (For HA deployments) Ex: 10 servers split across 2- AZ's
- Cost prohibitive to deploy 2 environments for HA
- Does not support replication between 2 indepandant environments (multi-zone deployment)
- Not much documentation on all the levers and settings with in the product as there are hundreds of them.
I've asked team members to review this too, the issue is that you are mixing hadoop issues, with cloudera product, and asserting things that we need to evaluate.
Agreed, snapshoting is not going to work, anywhere in the hadoop platform, having a node suddenly return to a previous historical state of data and configuration outside of the context of the cluster is going to be destructive. That is not your recovery model, the recovery model is based on the integrated components of the cluster, not moment in time snapshots.
You do not have to buy enterprise edition to be able to use cloudera, or deploy into the cloud. We offer additional tools and functinoality that if your organization requires it, have support and services available to make your project successful. Enterprise adds features around deployment automation in the cloud auditing and encryption at rest as well as security that are managed centrally by CM. But there is nothing preventing the community version from being used within AWS
Distcp works in the non enterprise version, and scripting around it has been in use by sites for backup for quite a while now; BDR within the enterprise release enhances the management and scheduleing of hdfs data and metadata structure for hive metastore and smooths security configuration issues as a feature. There is nothing preventing you from replicating data with distcp between clusters in the community version.
As I read through the rest of the issues there are assertions that I think are a combination of AWS features and security configuration that you would need to manage outside of the platform and within AWS. This is literally why we have a solutions architecture team to work with sites who need to achieve more advanced configurations like the cloud based operating environment you are defining below.
The API, and provided tools around it address the deployment issues you point out as well... so I'm not quite sure if we are missing all the things you are pointing out, it just might take some deeper evaluation to identify which are true gaps and what are opinions based on needing more information around the API and the examples we provide on automating deployment through it. http://cloudera.github.io/cm_api/ https://github.com/cloudera/cm_api/tree/master/python/examples/aut...
So there is much you can take control over and manage within the tools available with the community edition. For enterprises who have strategic initiatives that are cloud based and represent critical business infrastructure that mandates regulatory compliance and demands enterprise support... there is the enterprise edition.
Thank you so much for getting back to me!!
"So there is much you can take control over and manage within the tools available with the community edition. For enterprises who have strategic initiatives that are cloud based and represent critical business infrastructure that mandates regulatory compliance and demands enterprise support... there is the enterprise edition."
Does this mean that the freeware version does not meet regluatory compliance (IE HIPPA / SOX etc) and is only covered by the Enterprise version? I am guessing this is because of the added features - Encryption central mangement control etc?
I am going to go though and look into more of the solutions you provided and will get back to you soon!
One other thing; the Cloudera Director team pointed out that it is not tied to enterprise licensing and will work with the community edition, and it is available for free as well.
Does the apache webserver Address Hippa requirements? Does MySQL address sarbaines oxley financial controls regulation? Those things are not implemented within a product, they are implemented with technology through sound business practices, IT methodology, and staff training on the requirements and regulatory requirements of the market a business is participating in.
To lay those things at the foot of any product and assume that it "takes care of them" is folly. We provide tools, to help achieve those objectives. You're welcome to design your own software and tools around the community edition to achieve those things... Its just that sites want to concentrate on using hadoop for analysis of data, and not have to build out all the additional ellements required for them to meet their objectives.
The platform is open, you're free to write your own implementation of encryption, key management, HSM integration, and hook it up to your enterprise data and convert it all to encrypted data. If that is your business objective to consume budget and manhour time with those things, great. For sites who want to focus on data sciences and analysis of their enterprise data at hand... theres the enterprise version of our product.