Support Questions

Find answers, ask questions, and share your expertise

Hortonworks Hadoop platform vs datalake (looking for some guidance)

avatar

have any one build a datalake in organisations? Looking for some best practices around it..

Should we have one single datalake or multi data lake within an organisations.. Looking for some guidance around it.. appreciate if some one can answer or point in right direction. I have googled out many links but unable to find one answers so looking for some real live examples..

1 ACCEPTED SOLUTION

avatar
Contributor

One is large environment with 20+ pb in size and data is completely different from other environment data, and reasons for different lakes are they both fall in different internal departments and data is also different and customers are also different, again depends on the data these cluster(s) servers located in different data centers and one is open for company wide enterprise network and others open for an internal network within enterprise network.

View solution in original post

5 REPLIES 5

avatar
Contributor

Hi,

Not sure if my answer helps you or not: But i can give you some details:

We built an enterprise Data Lake using HDP 2.x, how many data lakes(environments) you wanted to build, it depends on the data & requirements. at my workplace, we got multiple production environments, where we got different kinds of data and we enabled 'distcp' between couple of environments, to get some data feed from other environments, but the end users and requirements are clearly different for these environments and one more difference is, different kinds of end users & data and multiple ways they can access these environments. (some wanted the data in "NRT" (near real tme) and some users can wait for the results). So we provided multiple ways to access and to get the data from our data lake -- end users chose the best way that meets their requirements.

Hope this helps.

avatar

I assume you are referring to multi-data lake in your organization? or you are referring to processing the same feed in same datalake based on different kind of requirement?

Why not storing the data in same datalake? as you mentioned we got multiple production environments..

avatar
Master Mentor

@Pankaj Singh

A data lake is simply a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.

Enterprise data lakes run in Petabytes (PB) or Exabytes (EB), just think of a data lake as a massive storage area where you ingestions and data pipelines land all the data, you can create a directory like structures like below see attached screenshot for some visual aid

/landing_Zone/Raw_data/refined 
/landing_Zone/Raw_data/Trusted 
/landing_Zone/Raw_data/sandbox 

And typically you apply ranger policies to manages the access and data encryption etc. There are also tools like Alation to mention but a few for managing the catalog by data stewards etc.

The data lake can be used also to feed upstream systems for real-time monitoring system or long storage like HDFS or hive for analytics

HTH


datalake02.jpg

avatar

Thanks for your answer. Our dataset is not huge may be 10-15 TB of data at the moment. or over the period of 2-3 yrs grow to 15-20 TB at max. We intent to build this datalake in cloud but not using cloud provider platform as they don't provide data gov and lineage tracking.. our data set is mostly going to be unstructured..

avatar
Contributor

One is large environment with 20+ pb in size and data is completely different from other environment data, and reasons for different lakes are they both fall in different internal departments and data is also different and customers are also different, again depends on the data these cluster(s) servers located in different data centers and one is open for company wide enterprise network and others open for an internal network within enterprise network.