We're currently planning and doing some POC with Hortonworks for the basis for a data lake. The lake will be multi-tenant and want to shift a little bit towards flexibility. Each tenant will have many users with a lot of differing requirements. For example, we have scientists that build code that uses a lot of different set of libraries, and a lot of differing versions of same libs/tools.
The problem is that the current way HDP works is that it's stuck with specific set of tools with versions. For example, current HDP can provides spark 1.6.3 and 2.1.1. We have some that want to run other versions than provided.
Also, we're looking to use GPUs, Containers/Kubernetes as an easy way to provide flexibility, all of this backed by a common data-lake.
Reading through the HWX blog, we found an interesting series of blog posts labelled as Data Lake 3.0:
Is that technology is available? When will it be as a tech preview?
Also, we want to use our common Data Lake with permanent services for unified security and compliance, to be used by ephemeral clusters like we see with HD Cloud. They call this shared services, in theses slides:
For regulations requirements, we must implement it on-premises, not AWS. So we cannot use HD Cloud, that seems to be doing kind of stuff we looking for.
For the on premises option, cloudbreak is nice and can be part of the solution but with no features about common data lake, common services and ephemeral clusters. So, what we want do to, is a kind of cloudbreak with features of hd cloud, on top of some containerization / virtualization for ephemeral clusters. For virtualization and container stuff, we plan for OpenStack and OpenShift platforms.
Also, we envision to use S3 storage backed by Ceph + RadosGW rather than HDFS.