Created 08-22-2016 03:54 PM
Hi
i have a small application that generates some reports without using any map reduce code
i want to understand what are the real benefits of using Data lake, i think it will be useful for enterprise if there are many products which are writing data to various hadoop clusters and in order to have unified view of the various issues and having common data store , apart from this what are the other real benefits ?
How does data lake work if i want particular HDP version ?
i think its easier to switch to particular HDP in a separate cluster from ambari but what about data lake?
also if multiple applications use data lake and just 1 application require frequent changes like hbase coprocessor for testing various things , is it advisable to go for data lake ?
HA we get in cluster as well , so what are the main advantages technically if we dont bother cost
Created 08-22-2016 04:05 PM
Hi @ripunjay godhani, there may be some confusion. Data Lake is a concept, not a technology - unless you are referring to a Azure's Data Lake. https://azure.microsoft.com/en-us/solutions/data-lake/.
Anytime you bring in silo'd data and store them on a single HDFS cluster, you are creating a data lake. The benefits of this include:
1. Centralized security across all your data
2. Centrilized data governance and data lineage.
3. Centralized cluster monitoring and tuning.
4. Multiple data access on single data sets (batch, realtime, in-memory, interactive, adhoc...)
5. Central data repository for 3rd party BI tools and other visualization applications.
6. Centralized CoE for development, management, operations, and control.
7. Centralized budgeting and charge-back models.
This list is by no means conclusive. In summary, a data lake is break away from application driven silos and into a data driven, data centric architecture.
Hope this helps!
Created 08-22-2016 04:05 PM
Hi @ripunjay godhani, there may be some confusion. Data Lake is a concept, not a technology - unless you are referring to a Azure's Data Lake. https://azure.microsoft.com/en-us/solutions/data-lake/.
Anytime you bring in silo'd data and store them on a single HDFS cluster, you are creating a data lake. The benefits of this include:
1. Centralized security across all your data
2. Centrilized data governance and data lineage.
3. Centralized cluster monitoring and tuning.
4. Multiple data access on single data sets (batch, realtime, in-memory, interactive, adhoc...)
5. Central data repository for 3rd party BI tools and other visualization applications.
6. Centralized CoE for development, management, operations, and control.
7. Centralized budgeting and charge-back models.
This list is by no means conclusive. In summary, a data lake is break away from application driven silos and into a data driven, data centric architecture.
Hope this helps!
Created 08-22-2016 04:16 PM
Thanks but please see below questions
1. can i get same performance what i get in my optimized and purpose-built infrastructure HDP cluster ? because data lake is central and can i tune it specifically for 1 application ?
2. how can i manage different HDP versions in data lake ?
3. if something goes wrong with security or configuration because of 1 application then my whole data lake will be impacted ?
Created 08-22-2016 04:40 PM
1. YARN provides resource isolation for most data access. The exception is streaming processes in which you will want to size and dedicate your hardware appropriately. You can use Capacity Scheduler for fine-grained resource allocations. There are also node labeling if you want to run certain jobs on certain nodes based on their hardware architecture. You will still want to do your due diligence around service co-location and proactively monitor and maintain your environment for proper performance. SmartSense is vital to helping you in this regard.
2. You cannot have multiple HDP versions under a single Ambari server. HDP 2.5 will allow for multiple Spark versions but the core HDP will need to be the same. You use development and/or test environments to test upgrades or varying tech preview components.
3. Again, YARN Capacity Scheduler can fence off application resources so no one application will consume all your cluster resources. Security is always a concern but following best practices around encryption, authentication, authorization, auditing, RBAC policies, etc. will help alleviate most scenarios. If you think about it, we've been using shared storage (SAN) for over a decade. HDFS is similar but much more versatile than SAN. Still, the same basic centralized storage concept applies.
Created 08-23-2016 03:02 AM
@Scott Shaw Thanks a lot